# LLM Training Data Augmentation - Classification of Kaggle Disaster Data

The goal of this notebook is to prepare the data for augmentation by an LLM and classification by two models:

1. Logistic regression
2. Single hidden-layer neural network

## Data

The data used in this project comes from the kaggle *Natural Language Processing with Disaster Tweets* competition at:  

https://www.kaggle.com/competitions/nlp-getting-started/data

This data consists of two files:
+ *train.csv* - 7485 labled tweets **after duplicate removals** 
+ *test.csv* - 3263 unlabled tweets

Because the *test.csv* labels are not available, the *train.csv* file was split into the following two files:

+ train_model.csv - data used to train model, 5988 labeled tweets
+ train_test.csv - held out and not used to train model, used as *pseudo-test* data, 1497 labeled tweets (~20% of the original training sample)

## Simplier NLP Classifier Models

Two types of models are created and compared:

1. Logistic Regression - This serves as the baseline
2. Single-Hidden layer neural network with 1000 nodes in the hidden layer

## LLM

ChatGPT 3.5 turbo will be used to augment the data used to train the models.

## Encodings

The Twitter GloVe embedding will be used to vectorize the input text.  These embeddings were downloaded from:

https://nlp.stanford.edu/data/glove.twitter.27B.zip


In [1]:
%pwd

'D:\\llmamd'

## Vocabulary and tokenization

### Empty string embedding

After running all the text pre-processing steps ("pipeline"), some of the resulting tweets resulted in **empty strings**.  These result in **NaN** values when read in as dataframe and causes problems with `CountVectorize` which we need to build the token data matrix (rows = tweets, cols = token count in the tweet).

There is an embedding for the empty string token in each of the `glove.twitter.27B...` embedding files at line 38523. Because there was no token to split on, the string "<>" was used as the token to represent the empty string so the `get_glove_embed` function could read this embedding properly.

### Vectorizing a document using the entire input

In this project, a tweet is considered a document.  Each word/token in the document is represented by a d-dimensional vector.  We can concatenate all these word vectors together to create one big vector.  For example, say we have a tweet:  *summer is lovely* and we are using 50d twitter glove embeddings, each word would be represented by the following vectors where ... are the values for the other 45 dimensions in the 50d vector:

summer = [-0.40501, -0.56994, 0.34398, ..., -0.95337, 1.1409]
is = [0.18667 0.21368 0.14993, ..., -0.24608, -0.19549]
lovely = [-0.27926 -0.16338 0.50486, ..., -0.15416, -0.20196]

The entire tweet would then be represented by the following 150d vector:

[-0.40501, -0.56994, 0.34398, ..., -0.95337, 1.1409 | 0.18667 0.21368 0.14993, ..., -0.24608, -0.19549 | -0.27926 -0.16338 0.50486, ..., -0.15416, -0.20196]

where the pipe character | was inserted after each word so it's easier to see.

There are a couple of challenges to representing documents this way.  The first challenge is that our classification models need a fix input size.  The second challenge is that these vectors can get intractably large.


### Vectorize a document with mean, min and max vectors

A more common approach is to create a **mean** of the input embeddings and use this mean to represent the entire document.  Another related approach might be to create **min** and **max** vectors and concatenate them together to form a 2d dimensional vector where d = number of dimensions in the embedding vectors.  These min and max vectors are created from the minimum and maximum values of each dimension of the input embedding vectors respectively as described in the **Representing Document as Vectors** section of this workbook:

https://github.com/MichaelSzczepaniak/WordEmbeddings/blob/master/WordEmbeddings.ipynb



In [2]:
import projtools as pt
import numpy as np
import pandas as pd

dict_glove_embs = pt.get_glove_embeds()  # default is glove.twitter.27B.50d.txt which takes ~21 sec to load

Indexing word vectors...
Found 1193514 word vectors
Retrieving embeddings took  0.35 minutes


In [3]:
# print(dict_glove_embs["<>"].dtype)
# print(dict_glove_embs["man"].dtype)

In [4]:
# <> inserted as empty string embedding token in twitter embedding files at line 38523
# coeffs = "-0.29736 -0.57305 -0.39627 0.11851 0.16625 0.20137 0.15891 0.27938 -0.078399 -0.12866 0.21086 0.10652 -0.45356 -0.60928 -0.44878 -0.10511 0.32838 -0.088057 0.051537 0.46852 -0.13936 -0.71007 -0.65363 0.23445 -0.19538 0.6608 0.1313 -0.045464 0.43522 -0.96466 0.18855 0.93414 0.68161 -0.64802 0.059672 -0.69549 -0.31669 -0.48399 -0.63895 -0.35644 0.14326 0.79823 0.41653 -0.10187 0.17715 -0.20817 -0.47895 0.36954 0.4828 0.37621 -0.3492 -0.089045 0.40169 -0.8378 0.19303 -0.16941 0.2664 0.49512 -0.20796 0.69913 0.43428 0.15835 0.38629 0.24039 0.031994 -0.14381 0.52596 0.28369 -0.27033 0.22807 0.23541 -0.39603 -0.31054 -0.78715 -0.71227 -0.029253 0.24174 -0.44296 -0.836 0.064297 -0.94075 -0.18824 -0.16903 0.5849 -0.0074337 0.626 -0.49226 -0.71578 0.35292 -0.21006 -0.24776 0.57754 -0.27919 0.70211 0.039619 0.34539 -0.14673 -0.81167 0.68231 0.52827 -0.52141 -0.69099 -0.75099 0.11661 0.98226 0.35352 -0.11707 0.45133 0.69767 0.19557 -0.364 -0.035521 -0.71357 -0.83975 0.20347 -0.039052 -0.63665 -0.4491 -0.16223 0.51879 -0.7832 0.0896 -0.037932 0.23763 -0.51888 -0.17253 -0.014441 -0.5044 0.26391 -0.53308 0.92899 0.043442 -0.17849 -0.24523 -0.45531 -0.069423 -0.21187 -0.41407 -0.090711 -0.34815 0.1754 -0.21396 -0.13499 -0.64721 -0.3795 -0.14429 -0.30074 0.61857 -0.065655 -0.14137 0.45494 0.26353 -1.1331 1.0426 -0.027096 0.23131 0.32532 -0.25335 -0.34065 0.28641 -0.25686 -1.1398 0.22298 -0.2051 -0.48052 -0.065082 -0.32023 -0.045533 0.093544 -0.28296 -0.34975 0.19851 0.0086796 0.12968 0.96043 0.4946 0.47144 -0.10981 0.67961 -0.42269 0.23401 0.38641 -0.18864 -0.8254 -0.098215 -0.27643 -0.17081 0.30223 -0.62112 -0.2338 -0.39195 -0.049065 -0.28386 0.24707 -0.13131 -0.33601 -0.92245 -0.32083 -0.28469 -0.43977"
# lst_coeffs = coeffs.split()
# print(len(lst_coeffs))
# vec_coeffs = np.fromstring(coeffs, dtype='float', sep=' ')
# print(vec_coeffs.shape, vec_coeffs.shape[0])

In [5]:
# next 2 line commented out because empty string was manually added as <> token
# coeffs = np.fromstring(coeffs, dtype='float', sep=' ')
# dict_glove_embs[''] = coeffs

# find the nearest neighbor
# pt.word_NN("<>", dict_glove_embs, True)  # '\x94', U+0094, Cancel Character

## Tokens per tweet distribution

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
# actual size of vocabulary
# vocabulary_size = 4872

## add the special tokens to token_pattern parameter so we can preserve them
## <> added to fix issue with empty string and possibly use as padding
vectorizer_v9 = CountVectorizer(analyzer = "word", tokenizer = None,
                                token_pattern = r"(?u)\b\w\w+\b|<user>|<hashtag>|<url>|<number>|<>",
                                preprocessor = None, max_features = None)  #max_features = vocabulary_size)
df_train_clean_v09 = pd.read_csv("./data/train_clean_v09.csv", encoding="utf8")
data_features_v09_train = vectorizer_v9.fit_transform(df_train_clean_v09['text'])
## each row rep's a tweet, each column rep's a word in the vocabulary
data_mat_v09_train = data_features_v09_train.toarray()  # each cell is the freqency of a word in a tweet
print(data_mat_v09_train.shape)

(7485, 4819)


In [7]:
# keys are words in the vocabulary, each value is the column index
# in the data matrix representing a word (key) in the vocabulary
voc_dict = vectorizer_v9.vocabulary_
vocab = list(voc_dict.keys())
vocab[0:10], len(vocab), data_mat_v09_train.shape  # why is |V| = 4819 and not 4872? probably due to stop word removals...

(['deed',
  'reason',
  '<hashtag>',
  'earthquake',
  'may',
  'allah',
  'forgive',
  'all',
  'forest',
  'fire'],
 4819,
 (7485, 4819))

In [8]:
voc_dict['reason'], voc_dict['<hashtag>'], voc_dict['<user>'], voc_dict['<url>'], voc_dict['earthquake']

(3474, 1, 4, 3, 1324)

In [9]:
vec_reason = data_mat_v09_train[:, voc_dict['reason']]  # 31 tweets have the word 'reason' in it - verified in NP++
vec_reason.sum()

31

In [10]:
words_per_tweet_train = data_mat_v09_train.sum(axis=1)
print(f"words_per_tweet_train type: {type(words_per_tweet_train)}, shape: {words_per_tweet_train.shape}")
print(f"minimum tokens per original tweet: {words_per_tweet_train.min()}")
print(f"maximum tokens per original tweet: {words_per_tweet_train.max()}")

words_per_tweet_train type: <class 'numpy.ndarray'>, shape: (7485,)
minimum tokens per original tweet: 1
maximum tokens per original tweet: 29


## Vectorize with all cleaned tweet tokens

Since the max number of tokens in the cleaned original training data is 29 and 26 for the cleaned augmented data, a 30 token input will be selected.  This will give us an input to the model that is 30 (tokens / tweet) x (50 dimensions / token) = 1500 dimensions / tweet.

Since all tweets will be less than 30 tokens, each input will be padded with the empty string token (<>).

## Build feature matrices

The following 4 feature matrices are built and exported so the can be read back in during modeling:

+ **feats_matrix_aug.txt** - 7485 rows where each row is a vectorized tweet padded to 30 tokens where each token is represented by a 50d GloVe twitter embedding and the empty string is used as the padding token.  1500 cols are the tweets padded to 30 tokens which are each converted to a 50d GloVe embedding
+ **feats_matrix_train_train.txt** - 80% of the original training data used to train each model, same vectorization as **feats_matrix_aug.txt**, xxxx rows, yyyy columns
+ **feats_matrix_train_test.txt** - 20% of the original training data used to test each model, same vectorization as **feats_matrix_aug.txt**, xxxx rows, yyyy columns
+ **feats_matrix_test.txt** - unlabeled test data provided by kaggle to test submissions, same vectorization as **feats_matrix_aug.txt**, xxxx rows, yyyy columns

The first 3 feature matrices have the following corresponding labels (`feats_matrix_test.txt` are unlabeled tweets):
+ labels_aug.txt
+ labels_train_train.txt
+ labels_train_test.txt

In [11]:
display = pd.options.display
pd.set_option('display.max_colwidth', None)

# df_train_clean_v09 = pd.read_csv("./data/train_clean_v09.csv", encoding="utf8")  # read in at cell xx
train_clean_vec_text = df_train_clean_v09['text']
train_clean_vec_target = df_train_clean_v09['target']
train_targets = np.array(train_clean_vec_target)

df_train_clean_v09_class0 = df_train_clean_v09.loc[df_train_clean_v09['target'] == 0, :]
df_train_clean_v09_class1 = df_train_clean_v09.loc[df_train_clean_v09['target'] == 1, :]
df_aug_clean_v09 = pd.read_csv("./data/aug_clean_v09.csv", encoding="utf8")

In [12]:
# start with feats_matrix_aug.txt and labels_aug.txt
print(f"shape of df_aug_clean_v09 {df_aug_clean_v09.shape}")
# df_aug_clean_v09.head()

aug_clean_vec_id = df_aug_clean_v09['id']
aug_clean_vec_text = df_aug_clean_v09['text']
aug_clean_vec_target = df_aug_clean_v09['target']
# vectorize aug padded to 30 tokens
aug_clean_vec_text_vector_pad30 = [pt.vectorize_tweet(tweet, dict_glove_embs) for tweet in aug_clean_vec_text]
# build the feature matrix from the list of numpy arrays
feats_matrix_aug = pt.make_tweet_feats(aug_clean_vec_text_vector_pad30)  # takes ~1min, 20sec to run
labels_aug = np.array(aug_clean_vec_target)
print(f"shape of feats_matrix_aug: {feats_matrix_aug.shape}, dtype of feats_matrix_aug: {feats_matrix_aug.dtype}")
print(f"shape of labels_aug: {labels_aug.shape}, dtype of labels_aug: {labels_aug.dtype}")

shape of df_aug_clean_v09 (7485, 3)
Building tweet feature matrix took  0.99 minutes
shape of feats_matrix_aug: (7485, 1500), dtype of feats_matrix_aug: float64
shape of labels_aug: (7485,), dtype of labels_aug: int64


In [13]:
# save feat_matrix_aug and labels_aug as a text files
np.savetxt(fname='./data/feats_matrix_aug.txt', X=feats_matrix_aug, fmt='%9.6f')  # use np.loadtxt to bring this back in
labels_aug = np.array(aug_clean_vec_target)
np.savetxt(fname='./data/labels_aug.txt', X=labels_aug, fmt='%9.6f')

In [14]:
# work on feats_matrix_train_train.txt, start by splitting training into train_train
# and train_test because test data is unlabeled
np.random.seed(711)
portion_test = 0.20  # 80/20 train/test split
# compute the number of training and testing samples
train_train_samples = int((1. - portion_test) * train_targets.shape[0])
train_test_samples = train_targets.shape[0] - train_train_samples
print(f"{train_train_samples} samples from the train set will be used to train the model")
print(f"{train_test_samples} samples from the train set will be used to test the model")

5988 samples from the train set will be used to train the model
1497 samples from the train set will be used to test the model


In [15]:
train_clean_vec_text_vector_pad30 = [pt.vectorize_tweet(tweet, dict_glove_embs) for tweet in train_clean_vec_text]
feats_matrix_train = pt.make_tweet_feats(train_clean_vec_text_vector_pad30)  # all train features: train_train & train_test
print(feats_matrix_train.shape, feats_matrix_train.dtype)

Building tweet feature matrix took  0.99 minutes
(7485, 1500) float64


In [16]:
# calc the number of samples from each class used to test
portion_class0_test = 0.5
class0_test_samples = int(train_test_samples * portion_class0_test)
class1_test_samples = train_test_samples - class0_test_samples
print(f"{class0_test_samples} class 0 train test samples, {class1_test_samples} class 1 train test samples")

748 class 0 train test samples, 749 class 1 train test samples


In [17]:
# compute indices for 748 class 0 and 749 class 1 random samples,
# need the [0] on the np.where calls because it returns a 2-tuple
# and the results of the 1st logical condition are all we want
train_test_inds = np.append(np.random.choice((np.where(train_targets==0))[0], class0_test_samples, replace=False),
                            np.random.choice((np.where(train_targets==1))[0], class1_test_samples, replace=False))
# compute training set indices from indices not in the test set
train_train_inds = list(set(range(len(train_targets))) - set(train_test_inds))
# original training data used to train model
feats_matrix_train_train = feats_matrix_train[train_train_inds, ]
labels_train_train = train_targets[train_train_inds, ]
print(f"shape of the train_train_train_data: {feats_matrix_train_train.shape}, shape of the train_train_labels: {labels_train_train.shape}")

shape of the train_train_train_data: (5988, 1500), shape of the train_train_labels: (5988,)


In [18]:
# write out the train_train_data and train_train_labels
np.savetxt(fname='./data/feats_matrix_train_train.txt', X=feats_matrix_train_train, fmt='%9.6f')
np.savetxt(fname='./data/labels_train_train.txt', X=labels_train_train, fmt='%9.6f')

In [19]:
# work on feats_matrix_train_test.txt and labels_train_test.txt
feats_matrix_train_test = feats_matrix_train[train_test_inds, ]
labels_train_test = train_targets[train_test_inds, ]
print(f"shape of the train_test_train_data: {feats_matrix_train_test.shape}, shape of the train_test_labels: {labels_train_test.shape}")
# write them out
np.savetxt(fname='./data/feats_matrix_train_test.txt', X=feats_matrix_train_test, fmt='%9.6f')
np.savetxt(fname='./data/labels_train_test.txt', X=labels_train_test, fmt='%9.6f')

shape of the train_test_train_data: (1497, 1500), shape of the train_test_labels: (1497,)


In [20]:
# build and write the feature matrix for the unlabeled test data
df_test_clean_v09 = pd.read_csv("./data/test_clean_v09.csv", encoding="utf8")
test_clean_vec_text = df_test_clean_v09['text']
test_clean_vec_text_vector_pad30 = [pt.vectorize_tweet(tweet, dict_glove_embs) for tweet in test_clean_vec_text]
feats_matrix_test = pt.make_tweet_feats(test_clean_vec_text_vector_pad30)
print(f"shape of the feats_matrix_test: {feats_matrix_test.shape}")
np.savetxt(fname='./data/feats_matrix_test.txt', X=feats_matrix_test, fmt='%9.6f')

Building tweet feature matrix took  0.19 minutes
shape of the feats_matrix_test: (3263, 1500)


In [21]:
# read in the data and check that it's similar to what we wrote out
feats_matrix_aug_ff = np.loadtxt('./data/feats_matrix_aug.txt')
feats_matrix_train_train_ff = np.loadtxt('./data/feats_matrix_train_train.txt')
feats_matrix_train_test_ff = np.loadtxt('./data/feats_matrix_train_test.txt')
feats_matrix_test_ff = np.loadtxt('./data/feats_matrix_test.txt')
# test the load
print(f"shape of computed feats_matrix_aug: {feats_matrix_aug.shape}, shape of read in feats_matrix_aug: {feats_matrix_aug_ff.shape}")
print(np.allclose(feats_matrix_aug, feats_matrix_aug_ff, rtol=1e-04, atol=1e-06))
# print(np.allclose(feats_matrix_aug, feats_matrix_aug_ff))  # not sure why there is this large a difference...
print(np.allclose(feats_matrix_train_train, feats_matrix_train_train_ff, rtol=1e-04, atol=1e-06))
print(np.allclose(feats_matrix_train_test, feats_matrix_train_test_ff, rtol=1e-04, atol=1e-06))
print(np.allclose(feats_matrix_test, feats_matrix_test_ff, rtol=1e-04, atol=1e-06))

shape of computed feats_matrix_aug: (7485, 1500), shape of read in feats_matrix_aug: (7485, 1500)
True
True
True
True
