# LLM Training Data Augmentation - Classification of Kaggle Disaster Data

The goal of this notebook is to prepare the data for augmentation by an LLM and classification by two models:

1. Logistic regression
2. Single hidden-layer neural network

## Data

The data used in this project comes from the kaggle *Natural Language Processing with Disaster Tweets* competition at:  

https://www.kaggle.com/competitions/nlp-getting-started/data

This data consists of two files:
+ *train.csv* - 7613 labled tweets
+ *test.csv* - 3236 unlabled tweets

Because the *test.csv* labels are not available, the *train.csv* file was split into the following two files:

+ train_model.csv - data used to train model, 6090 labeled tweets
+ train_test.csv - held out and not used to train model, used as *pseudo-test* data, 1523 labeled tweets (~20% of the original training sample)

## Non-Transformer Models

Two types of models are created and compared:

1. Logistic Regression - This serves as the baseline
2. Single-Hidden layer neural network with 1000 nodes in the hidden layer

## LLM

ChatGPT 3.5 turbo will be used to augment the data used to train the models.

## Encodings

The Twitter GloVe embedding will be used to vectorize the input text.  These embeddings were downloaded from:

https://nlp.stanford.edu/data/glove.twitter.27B.zip


# Preprocessing

## Manual inspection of train.csv

The following issues observered in the data are listed below.  They are numbered to indicate the order in which they were processed.  For example, spillover lines were fixed first.  This order is important because removing things like punctuation too early would make things like identifying user names or hashtags in a tweet impossible or make URLs invalid.

### 1. Spillover lines

The first issue we see with this data is that while most of the samples are on there own line, some spill over to adjacent lines. Here are few examples:

>`61,ablaze,,"on the outside you're ablaze and alive`  
>`but you're dead inside",0`  
>`74,ablaze,India,"Man wife get six years jail for setting ablaze niece`  
>`http://t.co/eV1ahOUCZA",1`  
>`86,ablaze,Inang Pamantasan,"Progressive greetings!`  
>  
>`In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0`  
>`117,accident,,"mom: 'we didn't get home as fast as we wished'`  
>`me: 'why is that?'`  
>`mom: 'there was an accident and some truck spilt mayonnaise all over ??????",0`

The custom function `fix_spillover_lines` was written to fix these lines. Its code is available in the projtools module.

### 2. Text-Target Duplicates

If two or more rows in the original data have the same values in the `text` and `target` fields (columns), then these are considered **text-target duplicates**.  Because we are only using the content of the tweet (value in the `text` field) to classify it as **disaster** or **not disaster**, only one of these instances provides useful information for the model to learn.  For this reasons, only the first instance of these duplicates are retained and remainder are discarded from the training set.

### 3. Cross-Target Duplicates

When 2 rows in the data have the same `text` values (tweet content), but different values for `target`, these rows are considered **cross-target duplicates**.  Examples of these types of duplicates are shown below.  Since we don't know which tweet has the correct target value, all of these types of duplicates are removed from the training set.

<img src='./visuals/cross_target_dupes.png'></img>

### 4. Normalizing URLs

Some tweet contain one or more URLs.  I assume that the content of a ULR does not contain any useful, but since a `<url>` token exists in the twitter gloVe embeddings, URLS will be replaced by this token.  

Although the actual URL may not contain much useful information, the count of URLs occuring in a tweet may be a useful feature and are counted before they are normalized.  About 90% of the URLs in the training data are of the form `http://t.co/<10 digit hash>`. For example: `http://t.co/9FxPiXQuJt`.  In about 10% of cases, these URLs start with `https:\\`.

The `replace_urls` function replaces each URL by the string "<url>" for the reasons stated above.

<s>#### 2.1 Counting URLs in each tweet</s>

<s>The custom function `make_url_counts` is used to create a `url_count` feature/column.  This needs to be called before calling `replace_urls`.</s>

### 5. Process Twitter-specifc characters

Because the `@` and `#` characters have special meaning in tweets, they need to be processed before removing other punctuation.  When a `@<username>` is seen in a tweet, it is a reference to a user whose name is `username`.  When a `#<hashname>` is seen in a tweet, it specifies a hashtag which is a reference to all tweet tweets that use the `hashname` hashtag.  In processing these characters, `@<username>` is converted to `<user> username` and `#<hashname>` is converted to `<hashtag> hashname`.  These replacement tokens were selected because they also have mappings in the embeddings file described in the **Normalizing URLs** section.

### 6. Expanding Contractions

Contraction fragments are included as vectorized tokens in the twitter gloVe embeddings which means that we don't need to expand these manually.  The spaCy tokenizer will separate the first word from the contraction fragments: e.g "you're" will be tokenized into `["you", "'re"]`.  Because the embeddings file has a listing for the contraction fragment token `'re` (as well as other contraction fragments such as 'm, 's, 'll, etc.), we don't need to convert these to their actual word forms (e.g. "am", "is", "will", etc.) before vectorizing.

### 7. Normalize digits, remove stop words, lemmatize, make lower case and remove punctuation

The function `spacy_digits_and_stops` does all the tasks listed in this section and stores the lemmatized text as lower case as a final step.  Digit normalization is done by replacing sets of consecutive digits with the token `<number>`.  As with the other token replacements, this one also was chosen because it has a mapping in the embeddings file.




## Test the complete pre-processing pipeline

The augmented data doesn't need the following steps that the original data does:

1. Fix spillover lines
2. Fix text-target duplicates
3. Fix cross-over target duplicates

The following pre-processing steps are shared between the augmented and original data:

4. Replace URLs with <url> token
5. Replace the twitter-specific characters @ with <user> and # with <hashtag>
6. Expand the contractions
7. Remove digits, stop words, punctuation and make everything lower case

In [1]:
import pandas as pd
import spacy

df_train_v03 = pd.read_csv("./data/train_clean_v03.csv", encoding="utf8")
df_test_v01 = pd.read_csv("./data/test_clean_v01.csv", encoding="utf8")
print(df_train_v03.shape, df_test_v01.shape)

(7485, 5) (3263, 4)


In [2]:
# remove the following from the default stop word list

nlp = spacy.load("en_core_web_md")

not_stops = {"you", "on", "not", "from", "was", "but", "your", "all", "no", "when",
             "now", "more", "over", "some", "first", "full", "down", "may", "only",
             "last", "many", "never", "any", "everyone", "every", "before", "under",
             "top", "most", "during", "next", "while", "call", "very", "nothing", 
              "anything", "everything", "sometimes", "serious", "everywhere", "none",
              "except", "within", "above", "below", "nobody", "afterwards", "anywhere"}
nlp.Defaults.stop_words -= not_stops  # nlp instatiated cell 24

In [3]:
import projtools as pt

train_v03_id = df_train_v03['id'].tolist()
train_v03_keyword = df_train_v03['keyword'].tolist()
train_v03_location = df_train_v03['location'].tolist()
train_v03_text = df_train_v03['text'].tolist()  # only field pipeline manipulates
train_v03_target = df_train_v03['target'].tolist()

train_v04_text_urls_fixed = pt.replace_urls(train_v03_text)
train_v05_text_fixed = pt.replace_twitter_specials(train_v04_text_urls_fixed)
train_v06_text_fixed = pt.expand_contractions(train_v05_text_fixed)
df_contractions_expanded = pd.DataFrame({'id': train_v03_id,
                                         'text': train_v06_text_fixed})
dict_df_stops = pt.spacy_digits_and_stops(df_contractions_expanded)
df_train_clean_v07 = dict_df_stops['df']
stops_removed = dict_df_stops['stops_removed']
train_clean_v07_text_fixed = df_train_clean_v07['text'].tolist()

df_train_v07_full_pipe = pd.DataFrame({'id': train_v03_id,
                                       'keyword': train_v03_keyword,
                                       'location': train_v03_location,
                                       'text': train_clean_v07_text_fixed,
                                       'target': train_v03_target})

df_train_v07_full_pipe.to_csv(path_or_buf="./data/train_clean_v07b.csv", index=False, encoding='utf-8')

In [4]:
df_train_v07_full_pipe_function = pt.preproccess_pipeline("./data/train_clean_v03.csv", "./data/train_clean_v07c.csv")