# LLM Training Data Augmentation - Classification of Kaggle Disaster Data

The goal of this notebook is to prepare the data for augmentation by an LLM and classification by two models:

1. Logistic regression
2. Single hidden-layer neural network

## Data

The data used in this project comes from the kaggle *Natural Language Processing with Disaster Tweets* competition at:  

https://www.kaggle.com/competitions/nlp-getting-started/data

This data consists of two files:
+ *train.csv* - 7485 labled tweets **after duplicate removals** 
+ *test.csv* - 3263 unlabled tweets

Because the *test.csv* labels are not available, the *train.csv* file was split into the following two files:

+ train_model.csv - data used to train model, 6090 labeled tweets
+ train_test.csv - held out and not used to train model, used as *pseudo-test* data, 1523 labeled tweets (~20% of the original training sample)

## Non-Transformer Models

Two types of models are created and compared:

1. Logistic Regression - This serves as the baseline
2. Single-Hidden layer neural network with 1000 nodes in the hidden layer

## LLM

ChatGPT 3.5 turbo will be used to augment the data used to train the models.

## Encodings

The Twitter GloVe embedding will be used to vectorize the input text.  These embeddings were downloaded from:

https://nlp.stanford.edu/data/glove.twitter.27B.zip


In [1]:
%pwd

'D:\\llmamd'

## Vocabulary and tokenization

### Empty string embedding

After running all the text pre-processing steps ("pipeline"), some of the resulting tweets resulted in **empty strings**.  These result in **NaN** values when read in as dataframe and causes problems with `CountVectorize` which we need to build the token data matrix (rows = tweets, cols = token count in the tweet).

There is an embedding for the empty string token in each of the `glove.twitter.27B...` embedding files at line 38523. Because there was no token to split on, the string "<>" was used as the token to represent the empty string so the `get_glove_embed` function could read this embedding properly.

### Vectorizing a document using the entire input

In this project, a tweet is considered a document.  Each word/token in the document is represented by a d-dimensional vector.  We can concatenate all these word vectors together to create one big vector.  For example, say we have a tweet:  *summer is lovely* and we are using 50d twitter glove embeddings, each word would be represented by the following vectors where ... are the values for the other 45 dimensions in the 50d vector:

summer = [-0.40501, -0.56994, 0.34398, ..., -0.95337, 1.1409]
is = [0.18667 0.21368 0.14993, ..., -0.24608, -0.19549]
lovely = [-0.27926 -0.16338 0.50486, ..., -0.15416, -0.20196]

The entire tweet would then be represented by the following 150d vector:

[-0.40501, -0.56994, 0.34398, ..., -0.95337, 1.1409 | 0.18667 0.21368 0.14993, ..., -0.24608, -0.19549 | -0.27926 -0.16338 0.50486, ..., -0.15416, -0.20196]

where the pipe character | was inserted after each word so it's easier to see.

There are a couple of challenges to representing documents this way.  The first challenge is that our classification models need a fix input size.  The second challenge is that these vectors can get intractably large.


### Vectorize a document with mean, min and max vectors

A more common approach is to create a **mean** of the input embeddings and use this mean to represent the entire document.  Another related approach might be to create **min** and **max** vectors and concatenate them together to form a 2d dimensional vector where d = number of dimensions in the embedding vectors.  These min and max vectors are created from the minimum and maximum values of each dimension of the input embedding vectors respectively as described in the **Representing Document as Vectors** section of this workbook:

https://github.com/MichaelSzczepaniak/WordEmbeddings/blob/master/WordEmbeddings.ipynb



In [2]:
import projtools as pt

dict_glove_embs = pt.get_glove_embeds()  # default is glove.twitter.27B.200d.txt which takes ~1.3 min to load, ...50d.txt only take ~25 sec to load

Indexing word vectors.
Found 1193514 word vectors.


In [3]:
print(dict_glove_embs["<>"].dtype)
print(dict_glove_embs["man"].dtype)

float64
float64


In [4]:
import numpy as np
# <> inserted as empty string embedding token in twitter embedding files at line 38523
# coeffs = "-0.29736 -0.57305 -0.39627 0.11851 0.16625 0.20137 0.15891 0.27938 -0.078399 -0.12866 0.21086 0.10652 -0.45356 -0.60928 -0.44878 -0.10511 0.32838 -0.088057 0.051537 0.46852 -0.13936 -0.71007 -0.65363 0.23445 -0.19538 0.6608 0.1313 -0.045464 0.43522 -0.96466 0.18855 0.93414 0.68161 -0.64802 0.059672 -0.69549 -0.31669 -0.48399 -0.63895 -0.35644 0.14326 0.79823 0.41653 -0.10187 0.17715 -0.20817 -0.47895 0.36954 0.4828 0.37621 -0.3492 -0.089045 0.40169 -0.8378 0.19303 -0.16941 0.2664 0.49512 -0.20796 0.69913 0.43428 0.15835 0.38629 0.24039 0.031994 -0.14381 0.52596 0.28369 -0.27033 0.22807 0.23541 -0.39603 -0.31054 -0.78715 -0.71227 -0.029253 0.24174 -0.44296 -0.836 0.064297 -0.94075 -0.18824 -0.16903 0.5849 -0.0074337 0.626 -0.49226 -0.71578 0.35292 -0.21006 -0.24776 0.57754 -0.27919 0.70211 0.039619 0.34539 -0.14673 -0.81167 0.68231 0.52827 -0.52141 -0.69099 -0.75099 0.11661 0.98226 0.35352 -0.11707 0.45133 0.69767 0.19557 -0.364 -0.035521 -0.71357 -0.83975 0.20347 -0.039052 -0.63665 -0.4491 -0.16223 0.51879 -0.7832 0.0896 -0.037932 0.23763 -0.51888 -0.17253 -0.014441 -0.5044 0.26391 -0.53308 0.92899 0.043442 -0.17849 -0.24523 -0.45531 -0.069423 -0.21187 -0.41407 -0.090711 -0.34815 0.1754 -0.21396 -0.13499 -0.64721 -0.3795 -0.14429 -0.30074 0.61857 -0.065655 -0.14137 0.45494 0.26353 -1.1331 1.0426 -0.027096 0.23131 0.32532 -0.25335 -0.34065 0.28641 -0.25686 -1.1398 0.22298 -0.2051 -0.48052 -0.065082 -0.32023 -0.045533 0.093544 -0.28296 -0.34975 0.19851 0.0086796 0.12968 0.96043 0.4946 0.47144 -0.10981 0.67961 -0.42269 0.23401 0.38641 -0.18864 -0.8254 -0.098215 -0.27643 -0.17081 0.30223 -0.62112 -0.2338 -0.39195 -0.049065 -0.28386 0.24707 -0.13131 -0.33601 -0.92245 -0.32083 -0.28469 -0.43977"
# lst_coeffs = coeffs.split()
# print(len(lst_coeffs))
# vec_coeffs = np.fromstring(coeffs, dtype='float', sep=' ')
# print(vec_coeffs.shape, vec_coeffs.shape[0])

In [5]:
# next 2 line commented out because empty string was manually added as <> token
# coeffs = np.fromstring(coeffs, dtype='float', sep=' ')
# dict_glove_embs[''] = coeffs

# find the nearest neighbor
# pt.word_NN("<>", dict_glove_embs, True)  # '\x94', U+0094, Cancel Character

### Checking token count distributions

In [6]:
import pandas as pd
import projtools as pt
display = pd.options.display
pd.set_option('display.max_colwidth', None)

df_train_clean_v09 = pd.read_csv("./data/train_clean_v09.csv", encoding="utf8")
df_aug_clean_v09 = pd.read_csv("./data/aug_clean_v09.csv", encoding="utf8")

In [7]:
print(df_aug_clean_v09.shape)
df_aug_clean_v09.head()

(7485, 3)


Unnamed: 0,id,text,target
0,20001,witness devastation cause powerful hurricane caribbean region pray safety all affect <hashtag> hurricane <hashtag> caribbean,1
1,20004,devastate forest fire near la heart go all affect tragic disaster <hashtag> <hashtag>,1
2,20005,break news authority local shelter place wildfire continue part california stay safe prepare <hashtag>,1
3,20006,massive flooding force <number> resident evacuate rescue effort underway home water <hashtag>,1
4,20007,receive video from <hashtag> california flood water engulf neighborhood stay safe everyone <hashtag>,1


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
# actual size of vocabulary
# vocabulary_size = 4872

## add the special tokens to token_pattern parameter so we can preserve them
## <> added to fix issue with empty string and possibly use as padding
vectorizer_v9 = CountVectorizer(analyzer = "word", tokenizer = None,
                                  token_pattern = r"(?u)\b\w\w+\b|<user>|<hashtag>|<url>|<number>|<>",
                                  preprocessor = None, max_features = None)  #max_features = vocabulary_size)
data_features_v09_train = vectorizer_v9.fit_transform(df_train_clean_v09['text'])
## each row rep's a tweet, each column rep's a word in the vocabulary
data_mat_v09_train = data_features_v09_train.toarray()  # each cell is the freqency of a word in a tweet

In [9]:
# keys are words in the vocabulary, each value is the column
# in the data matrix representing a word (key) in the vocabulary
voc_dict = vectorizer_v9.vocabulary_
vocab = list(voc_dict.keys())
vocab[0:10], len(vocab), data_mat_v09_train.shape  # why is |V| = 4819 and not 4872???

(['deed',
  'reason',
  '<hashtag>',
  'earthquake',
  'may',
  'allah',
  'forgive',
  'all',
  'forest',
  'fire'],
 4819,
 (7485, 4819))

In [10]:
voc_dict['reason'], voc_dict['<hashtag>'], voc_dict['<user>'], voc_dict['<url>'], voc_dict['earthquake']  # (3473, 0, 3, 2, 1323)

(3474, 1, 4, 3, 1324)

In [11]:
vec_reason = data_mat_v09_train[:, 3473]  # 31 tweets have the word 'reason' in it - verified in NP++
vec_reason.sum()

1

In [12]:
words_per_tweet_train = data_mat_v09_train.sum(axis=1)
print(f"words_per_tweet_train type: {type(words_per_tweet_train)}, shape: {words_per_tweet_train.shape}")
print(f"minimum tokens per original tweet: {words_per_tweet_train.min()}")
print(f"maximum tokens per original tweet: {words_per_tweet_train.max()}")

words_per_tweet_train type: <class 'numpy.ndarray'>, shape: (7485,)
minimum tokens per original tweet: 1
maximum tokens per original tweet: 29


In [13]:
# Which train tweets have 0 tokens? None after fix...
# indices_with_0 = np.where(words_per_tweet_train == 0)[0]
# id = 28, 36, 40, 6407, 7295, 8560 and 9919
# indices_with_0  # array([  19,   24,   28, 4419, 5018, 5890, 6799], dtype=int64)

In [14]:
vectorizer_v9_aug = CountVectorizer(analyzer = "word", tokenizer = None,
                                    token_pattern = r"(?u)\b\w\w+\b|<user>|<hashtag>|<url>|<number>|<>",
                                    preprocessor = None, max_features = None)
data_features_v09_aug = vectorizer_v9.fit_transform(df_aug_clean_v09['text'])
## each row rep's an augmented tweet, each column rep's a word in the vocabulary
data_mat_v09_aug = data_features_v09_aug.toarray()  # each cell is the freqency of a word in a tweet

In [15]:
words_per_tweet_aug = data_mat_v09_aug.sum(axis=1)
print(f"words_per_tweet_aug type: {type(words_per_tweet_train)}, shape: {words_per_tweet_aug.shape}")
print(f"minimum tokens per augmented tweet: {words_per_tweet_aug.min()}")
print(f"maximum tokens per augmented tweet: {words_per_tweet_aug.max()}")

words_per_tweet_aug type: <class 'numpy.ndarray'>, shape: (7485,)
minimum tokens per augmented tweet: 1
maximum tokens per augmented tweet: 26


## Vectorize with all cleaned tweet tokens

Since the max number of token in the cleaned original training data is 29 and 26 for the cleaned augmented data, a 30 token input will be selected.  This will give us an input to the model that is 30 (tokens / tweet) x (50 dimensions / token) = 1500 dimensions / tweet.

Since all tweets will be less than 30 tokens, each input will be padded with empty string token (<>).

In [16]:
# some short tweets from train_clean_v09, id=23,24,40,41
padded_token_count = 5
test_tweets = ["man", "love fruit", "<>", "you like pasta"]
tweet_vecs = {}
pad_vec = dict_glove_embs["<>"]
for test_tweet in test_tweets:
    tokens = test_tweet.split()
    padding_tokens = padded_token_count - len(tokens)
    token_vec = np.array([])
    for token in tokens:
        token_vec = np.hstack((token_vec, dict_glove_embs[token]))
    # add the padding
    for pad_token in range(padding_tokens):
        token_vec = np.hstack((token_vec, pad_vec))
    
    tweet_vecs[test_tweet] = token_vec
    np.set_printoptions(suppress=True)
    print(f"{test_tweet}: {tweet_vecs[test_tweet]}")
    print(f"shape of test_tweet vector: {tweet_vecs[test_tweet].shape}")

man: [ 0.454      0.091616  -0.013785  -0.82997   -0.96978    0.49015
  0.31657    0.74712   -0.48305    0.61955   -0.29768    0.92822
 -4.4227     0.12139    0.31704    0.48146    0.40892    0.04774
 -0.40095   -1.1193    -0.32822    1.0673    -0.10623    0.28587
  0.5852    -0.24609    0.20463   -0.8715    -0.47394   -0.80522
 -0.0020135  0.56237   -0.20017   -0.10347    0.80586   -0.031987
 -0.59742   -0.13249   -0.80784   -0.79229   -1.2715     0.46771
 -0.22299    1.0151    -0.51008    0.01278    0.96641    0.49574
  0.13395    0.37403    0.45973   -0.16703   -1.2028     0.41675
  0.14643   -0.39861   -0.35118   -0.46944    0.63799    0.49569
 -0.038122  -0.37854   -1.2221    -1.0439    -1.2604     0.01232
 -0.5159     0.1357    -0.093283   0.12307    0.48072   -0.66419
  0.50046   -0.58255    0.81583    0.72197   -0.101     -0.17283
  0.51572    0.3296    -0.0024615  0.19475    2.1163     0.20636
 -1.2026    -0.0767    -0.1058    -0.82518   -0.31287   -0.19303
  0.061489  -0.3042

In [17]:
# same thing, but with the function
np.set_printoptions(suppress=True)
for tweet in test_tweets:
    tweet_vec = pt.vectorize_tweet(tweet, dict_glove_embs, 5)
    print(f"{tweet}: {tweet_vec}")
    print(f"shape of tweet vector: {tweet_vec.shape}")

man: [ 0.454      0.091616  -0.013785  -0.82997   -0.96978    0.49015
  0.31657    0.74712   -0.48305    0.61955   -0.29768    0.92822
 -4.4227     0.12139    0.31704    0.48146    0.40892    0.04774
 -0.40095   -1.1193    -0.32822    1.0673    -0.10623    0.28587
  0.5852    -0.24609    0.20463   -0.8715    -0.47394   -0.80522
 -0.0020135  0.56237   -0.20017   -0.10347    0.80586   -0.031987
 -0.59742   -0.13249   -0.80784   -0.79229   -1.2715     0.46771
 -0.22299    1.0151    -0.51008    0.01278    0.96641    0.49574
  0.13395    0.37403    0.45973   -0.16703   -1.2028     0.41675
  0.14643   -0.39861   -0.35118   -0.46944    0.63799    0.49569
 -0.038122  -0.37854   -1.2221    -1.0439    -1.2604     0.01232
 -0.5159     0.1357    -0.093283   0.12307    0.48072   -0.66419
  0.50046   -0.58255    0.81583    0.72197   -0.101     -0.17283
  0.51572    0.3296    -0.0024615  0.19475    2.1163     0.20636
 -1.2026    -0.0767    -0.1058    -0.82518   -0.31287   -0.19303
  0.061489  -0.3042

## Vectorize tweet data



In [18]:
df_train_clean_v09 = pd.read_csv("./data/train_clean_v09.csv", encoding="utf8")
train_clean_vec_id = df_train_clean_v09['id']
train_clean_vec_text = df_train_clean_v09['text']
train_clean_vec_target = df_train_clean_v09['target']

In [19]:
# test by tokenizing first 5 tokens only
# train_clean_vec_text_vector = [pt.vectorize_tweet(tweet, dict_glove_embs, 5) for tweet in train_clean_vec_text]

In [20]:
# print(len(train_clean_vec_text_vector))      # 7485
# print(type(train_clean_vec_text_vector[0]))  # should be <class 'numpy.ndarray'>

In [21]:
# import random

# text_vector_list = [random.choices(range(100), k=1000) for _ in range(5)]
# print(len(text_vector_list))
# # text_vector_list[0]  # 1000 ints between 0 and 100
# train_clean_vec_text_vector_alt = [random.choices(range(100),k=1000) for _ in range(5)]

In [22]:
# check_index = 26  # "be nyc last week"
# print(f"embedding vector count: {len(train_clean_vec_text_vector)}")
# print(f"for tweet: {train_clean_vec_text[check_index]} | vector: {train_clean_vec_text_vector[check_index]}")

In [23]:
# vectorize train padded to 30 tokens
train_clean_vec_text_vector_pad30 = [pt.vectorize_tweet(tweet, dict_glove_embs) for tweet in train_clean_vec_text]
# only need the version below to properly write to csv without truncation
# train_pad30_list = [list(pt.vectorize_tweet(tweet, dict_glove_embs)) for tweet in train_clean_vec_text]

In [24]:
# print(type(train_clean_vec_text_vector_pad30[0]))        # <class 'numpy.ndarray'>
# print(type(train_clean_vec_text_vector_pad30[0].dtype))  # <class 'numpy.dtypes.Float64DType'>
# print(type(train_clean_vec_text_vector_pad30))     # list of numpy arrays
# print(type(train_clean_vec_text_vector_pad30[0]))  # <class 'numpy.ndarray'>

In [25]:
# create dataframe of vectorize original training data
# df_train_clean_vec = pd.DataFrame({"id": train_clean_vec_id,
#                                    "text": train_clean_vec_text,
#                                    # "text_vector": train_clean_vec_text_vector_pad30,
#                                    "text_vector": train_pad30_list,
#                                    "target": train_clean_vec_target})
# df_train_clean_vec.head()

In [26]:
# write cleaned and vectorize original training data
# df_train_clean_vec.to_csv(path_or_buf="./data/train_clean_vec.csv", index=False, encoding='utf-8')

In [30]:
feat_matrix_train = pt.make_tweet_feats(train_clean_vec_text_vector_pad30)
print(feat_matrix_train.shape, feat_matrix_train.dtype)

(7485, 1500)


NameError: name 'feat_matrix' is not defined

In [31]:
feat_matrix_train[:3, :3]

array([[ 0.8568   ,  0.76506  , -0.0091393],
       [-0.18449  , -0.93946  , -0.16102  ],
       [ 0.33808  ,  0.24919  ,  0.25473  ]])

In [42]:
np.savetxt(fname='./data/feat_matrix_train.txt', X=feat_matrix_train, fmt='%9.6f')

In [34]:
# test retrieval
x = np.loadtxt(fname='./data/feat_matrix_train.txt')
print(x.shape, x.dtype)
feat_matrix_train[:3, :3]

(7485, 1500) float64


array([[ 0.8568   ,  0.76506  , -0.0091393],
       [-0.18449  , -0.93946  , -0.16102  ],
       [ 0.33808  ,  0.24919  ,  0.25473  ]])

In [36]:
# create feature matrix for augmented data
df_aug_clean_v09 = pd.read_csv("./data/aug_clean_v09.csv", encoding="utf8")
aug_clean_vec_id = df_aug_clean_v09['id']
aug_clean_vec_text = df_aug_clean_v09['text']
aug_clean_vec_target = df_aug_clean_v09['target']

In [37]:
# vectorize aug padded to 30 tokens
aug_clean_vec_text_vector_pad30 = [pt.vectorize_tweet(tweet, dict_glove_embs) for tweet in aug_clean_vec_text]

In [40]:
# print(len(aug_clean_vec_text_vector_pad30), aug_clean_vec_text_vector_pad30[0].shape)  # (7485, (1500,)) check
feat_matrix_aug = pt.make_tweet_feats(aug_clean_vec_text_vector_pad30)  # takes ~1min, 20sec to run
print(feat_matrix_aug.shape, feat_matrix_aug.dtype)

(7485, 1500) float64


In [41]:
np.savetxt(fname='./data/feat_matrix_aug.txt', X=feat_matrix_aug, fmt='%9.6f')

In [44]:
# create feature matrix for test data
df_test_clean_v09 = pd.read_csv("./data/test_clean_v09.csv", encoding="utf8")
test_clean_vec_id = df_test_clean_v09['id']
test_clean_vec_text = df_test_clean_v09['text']

In [45]:
# vectorize test padded to 30 tokens
test_clean_vec_text_vector_pad30 = [pt.vectorize_tweet(tweet, dict_glove_embs) for tweet in test_clean_vec_text]

In [46]:
print(len(test_clean_vec_text_vector_pad30), test_clean_vec_text_vector_pad30[0].shape)  # (3263, (1500,)) check
feat_matrix_test = pt.make_tweet_feats(test_clean_vec_text_vector_pad30)  # takes ~1min, 20sec to run
print(feat_matrix_test.shape, feat_matrix_test.dtype)

(3263, 1500) float64


In [47]:
np.savetxt(fname='./data/feat_matrix_test.txt', X=feat_matrix_test, fmt='%9.6f')

In [51]:
# write out the train and aug targets - these should be identical - diff confirmed
# np.savetxt(fname='./data/targets_train.txt', X=train_clean_vec_target, fmt='%1d')
# np.savetxt(fname='./data/targets_aug.txt', X=aug_clean_vec_target, fmt='%1d')

In [55]:
train_targets = np.array(train_clean_vec_target)
print(train_targets.shape)
type(train_targets)
train_targets[:3]

(7485,)


array([1, 1, 1], dtype=int64)

## Logistic regression models

### Just the original training data

**Results look pretty good:**

+ Train Training error:  0.201737
+ Train Test error:  0.275217

In [59]:
# split training into train and train_test because test data is unlabeled
np.random.seed(711)
portion_test = 0.20  # 80/20 train/test split
train_train_samples = int((1. - portion_test) * train_targets.shape[0])
train_test_samples = train_targets.shape[0] - train_train_samples
portion_class0_test = 0.5
class0_test_samples = int(train_test_samples * portion_class0_test)
class1_test_samples = train_test_samples - class0_test_samples
print(f"{train_targets.shape[0]} total train samples, {train_train_samples} train_train samples, {train_test_samples} train_test samples")
print(f"{class0_test_samples} class 0 train test samples, {class1_test_samples} class 1 train test samples")

7485 total train samples, 5988 train_train samples, 1497 train_test samples
748 class 0 train test samples, 749 class 1 train test samples


In [62]:
# indices of 748 class 0 and 749 class 1 random samples, need the [0] on the np.where calls because
# it returns a 2-tuple and the results of the 1st logical condition are all we want
train_test_inds = np.append(np.random.choice((np.where(train_targets==0))[0], class0_test_samples, replace=False),
                            np.random.choice((np.where(train_targets==1))[0], class1_test_samples, replace=False))
# build training set from indices not in the set
train_train_inds = list(set(range(len(train_targets))) - set(train_test_inds))
# original training data used to train model
train_train_data = feat_matrix_train[train_train_inds, ]
train_train_labels = train_targets[train_train_inds, ]
# original training data used to TEST the model (because provided test data is unlabeled)
train_test_data = feat_matrix_train[train_test_inds, ]
train_test_labels = train_targets[train_test_inds, ]

print(f"train TRAIN data feature matrix: {train_train_data.shape}")
print(f"train TEST data feature matrix: {train_test_data.shape}")

train TRAIN data feature matrix: (5988, 1500)
train TEST data feature matrix: (1497, 1500)


In [66]:
from sklearn.linear_model import SGDClassifier

## fit logistic classifier on training data: minimize neg log likelihood ("log"), no regularization penalty
logreg_model_orig = SGDClassifier(loss="log_loss", penalty=None)
logreg_model_orig.fit(train_train_data, train_train_labels)

## Pull out the parameters (w,b) of the logistic regression model
w = logreg_model_orig.coef_[0,:]
b = logreg_model_orig.intercept_

In [69]:
## Get predictions on training and test data
preds_train = logreg_model_orig.predict(train_train_data)
preds_test = logreg_model_orig.predict(train_test_data)

## Compute errors
errs_train = np.sum((preds_train > 0.0) != (train_train_labels > 0.0))
errs_test = np.sum((preds_test > 0.0) != (train_test_labels > 0.0))

print("Training error: ", float(errs_train)/len(train_train_labels))
print("Test error: ", float(errs_test)/len(train_test_labels))

Training error:  0.2017368069472278
Test error:  0.27521710086840345


### The original training data PLUS the augmented data

Now we'll add the augmented data to the training data and see if this improves our results.

**Training error was cut in half, but test only improved slightly (~1.5%):**

+ Training error:  0.108365
+ Test error:  0.271209

In [77]:
# create train + aug feature and label inputs
train_train_aug_data = np.vstack((train_train_data, feat_matrix_aug))
train_train_aug_labels = np.hstack((train_train_labels, np.array(aug_clean_vec_target)))
print(train_train_aug_data.shape, train_train_aug_labels.shape)

(7485,)
(5988,)
(13473, 1500) (13473,)


In [81]:
## fit logistic classifier on training data: minimize neg log likelihood ("log"), no regularization penalty
logreg_model_aug = SGDClassifier(loss="log_loss", penalty=None)
logreg_model_aug.fit(train_train_aug_data, train_train_aug_labels)

## Pull out the parameters (w,b) of the logistic regression model
w = logreg_model_aug.coef_[0,:]
b = logreg_model_aug.intercept_

In [82]:
## Get predictions on training and test data
preds_train_aug = logreg_model_aug.predict(train_train_aug_data)
preds_test_aug = logreg_model_aug.predict(train_test_data)  # same train TEST partition, but different model

## Compute errors
errs_train_aug = np.sum((preds_train_aug > 0.0) != (train_train_aug_labels > 0.0))
errs_test_aug = np.sum((preds_test_aug > 0.0) != (train_test_labels > 0.0))  # same train TEST partition, but different model

print("Training error with aug data: ", float(errs_train_aug)/len(train_train_aug_labels))
print("Test error with aug data: ", float(errs_test_aug)/len(train_test_labels))

Training error with aug data:  0.10836487790395606
Test error with aug data:  0.2712090848363393
