# LLM Training Data Augmentation - Classification of Kaggle Disaster Data

The goal of this notebook is to explore whether a HuggingFace model (HFM) can enhance the performance of non-transformer-based text classification models by augmenting the training data.

## Data

The data used in this project comes from the kaggle *Natural Language Processing with Disaster Tweets* competition at:  

https://www.kaggle.com/competitions/nlp-getting-started/data

This data consists of two files:
+ *train.csv* - 7613 labled tweets
+ *test.csv* - 3236 unlabled tweets

Because the *test.csv* labels are not available, the *train.csv* file was split into the following two files:

+ train_model.csv - data used to train model, 6090 labeled tweets
+ train_test.csv - held out and not used to train model, used as *pseudo-test* data, 1523 labeled tweets (~20% of the original training sample)

## Non-Transformer Models

Two types of models are created and compared:

1. Logistic Regression - This serves as the baseline
2. Single-Hidden layer neural network with 1000 nodes in the hidden layer

## HuggingFace Models

The *TBD* Hugging Face transformer model was used to provide both uninformed and informed assistance through augmenting the data used to train the non-transformer-based models.

## Encodings

Two types of encodings are used to vectorize the inputs:

1. One-hot encoding
2. Twitter GloVe embedding: https://nlp.stanford.edu/data/glove.twitter.27B.zip


# Preprocessing

## Manual inspection of train.csv

The following issues observered in the data are listed below.  They are numbered to indicate the order in which they were processed.  For example, spillover lines were fixed first, then URLs, etc.  This order is important because removing things like punctuation too early would make things like identifying user names or hashtags in a tweet impossible or make URLs invalid.

### 1. Spillover lines

The first issue we see with this data is that while most of the samples are on there own line. Here are few examples:

>`61,ablaze,,"on the outside you're ablaze and alive`  
>`but you're dead inside",0`  
>`74,ablaze,India,"Man wife get six years jail for setting ablaze niece`  
>`http://t.co/eV1ahOUCZA",1`  
>`86,ablaze,Inang Pamantasan,"Progressive greetings!`  
>  
>`In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0`  
>`117,accident,,"mom: 'we didn't get home as fast as we wished'`  
>`me: 'why is that?'`  
>`mom: 'there was an accident and some truck spilt mayonnaise all over ??????",0`

The custom function `fix_spillover_lines` was written to fix these lines. Its code is available in the projtools module.

### 2. Normalizing URLs

Some tweet contain one or more URLs.  I assume that the content of a ULR does not contain any useful, but since a `<url>` token exists in the twitter gloVe embeddings, URLS will be replaced by this token.  

Although the actual URL may not contain much useful information, the count of URLs occuring in a tweet may be a useful feature and are counted before they are normalized.  About 90% of the URLs in the training data are of the form `http://t.co/<10 digit hash>`. For example: `http://t.co/9FxPiXQuJt`.  In about 10% of cases, these URLs start with `https:\\`.

#### 2.1 Counting URLs in each tweet

The custom function `make_url_counts` is used to create a `url_count` feature/column.  This needs to be called before calling `replace_urls`.

#### 2.2 Normalizing URLs

The `replace_urls` function replaces each URL by the string "<url>" for the reasons stated above.  The needs to be called after `replace_urls` is called.


### 3 Process Twitter-specifc characters

Because the `@` and `#` characters have special meaning in tweets, they need to be processed before removing other punctuation.  When a `@<username>` is seen in a tweet, it is a reference to a user whose name is `username`.  When a `#<hashname>` is seen in a tweet, it specifies a hashtag which is a reference to all tweet tweets that use the `hashname` hashtag.  In processing these characters, `@<username>` is converted to `<user> username` and `#<hashname>` is converted to `<hashtag> hashname`.  These replacement tokens were selected because they also have mappings in the embeddings file described in the **Normalizing URLs** section.

### 4. Tokenize text and clean up non-words

#### 4.1 Contractions

Contraction fragments are included as vectorized tokens in the twitter gloVe embeddings which means that we don't need to expand these manually.  The spaCy tokenizer will separate the first word from the contraction fragments: e.g "you're" will be tokenized into `["you", "'re"]`.  Because the embeddings file has a listing for the contraction fragment token `'re` (as well as other contraction fragments such as 'm, 's, 'll, etc.), we don't need to convert these to their actual word forms (e.g. "am", "is", "will", etc.) before vectorizing.

#### 4.2 Remove remaining punctuation

The custom function `replace_with_space` is run to remove any remaining punctuation.

#### 4.3 Remove digits

The function `replace_numbers` is run to replace sets of consecutive digits digits with the token `<number>`.  As with the other token replacements, this one also was chosen because it has a mappling in the embeddings file.



In [1]:
import numpy as np
import string as st
import matplotlib as mp
import matplotlib.pyplot as plt

# To get around the "UnicodeDecodeError: 'charmap' codec can't decode byte ..." error,
# need specify encoding when reading this data in as described in the solution I upvoted here:
# https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character
# with open("./data/train.csv", encoding="utf8") as f:  # works, but setting errors removes unneeded chars
with open("./data/train.csv", encoding="utf8", errors='ignore') as f_train:
    content_train = f_train.readlines()

with open("./data/test.csv", encoding="utf8", errors='ignore') as f_test:
    content_test = f_test.readlines()

print(len(content_train), len(content_test))  # 8562, 3700  BEFORE applying any fixes

8562 3700


In [2]:
# print some examples of spillover lines
with open("./debug/train_debug_spillover_chunk.txt", encoding="utf8", errors='ignore') as f:
    content_train_debug = f.readlines()

for i in [0, 42, 43, 53, 54, 64, 65, 66]:
    print(content_train_debug[i].strip())

id,keyword,location,text,target
61,ablaze,,"on the outside you're ablaze and alive
but you're dead inside",0
74,ablaze,India,"Man wife get six years jail for setting ablaze niece
http://t.co/eV1ahOUCZA",1
86,ablaze,Inang Pamantasan,"Progressive greetings!

In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0


In [3]:
import projtools as pt
# test the fix for the spillover lines on the training data
fixed_train_debug = pt.fix_spillover_lines(content_train_debug)

# for i, line in enumerate(fixed_list):
#     print(i, line)

# check that good lines are still good and spillover lines (*) are fixed
#id = header 32  25 *61 *74 *86 *117 119 120
for j in [0, 22, 36, 42, 52, 62, 81, 83, 84]:
    print(fixed_train_debug[j])  # spillover lines are now consolidated to a single line

id,keyword,location,text,target
32,,,London is cool ;),0
53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE http://t.co/qqsmshaJ3N,0
61,ablaze,,"on the outside you're ablaze and alive but you're dead inside",0
74,ablaze,India,"Man wife get six years jail for setting ablaze niece http://t.co/eV1ahOUCZA",1
86,ablaze,Inang Pamantasan,"Progressive greetings!  In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0
117,accident,,"mom: 'we didn't get home as fast as we wished' me: 'why is that?' mom: 'there was an accident and some truck spilt mayonnaise all over ??????",0
119,accident,,Can wait to see how pissed Donnie is when I tell him I was in ANOTHER accident??,0
120,accident,"Arlington, TX",#TruckCrash Overturns On #FortWorth Interstate http://t.co/Rs22LJ4qFp Click here if you've been in a crash&gt;http://t.co/Ld0unIYw4k,1


In [4]:
# fix the spillover lines in the train and test data, then write out fixed data
# fixed_train = pt.fix_spillover_lines(content_train)
# with open(file='./data/train_clean_v01.csv', mode='w', encoding="utf8", errors='ignore') as f_train_out:
#     for line in fixed_train:
#         f_train_out.write(line)
#         f_train_out.write('\n')

# fixed_test = pt.fix_spillover_lines(content_test)
# with open(file='./data/test_clean_v01.csv', mode='w', encoding="utf8", errors='ignore') as f_test_out:
#     for line in fixed_test:
#         f_test_out.write(line)
#         f_test_out.write('\n')

## Normalizing URLs

In [5]:
# normalize urls
with open("./data/train_clean_v01.csv", encoding="utf8", errors='ignore') as f:
    v01_train_lines = f.readlines()

with open("./data/test_clean_v01.csv", encoding="utf8", errors='ignore') as f:
    v01_test_lines = f.readlines()

# first, count the urls before replacing them
v02_train_lines = pt.make_url_counts(v01_train_lines)
v02_test_lines = pt.make_url_counts(v01_test_lines)

# second, replace the urls with the <url> token
v02_train_lines = pt.replace_urls(v02_train_lines)
v02_test_lines = pt.replace_urls(v02_test_lines)


# look at some lines that have urls to check if they are getting counted
for i in range(32, 42):
    print(v02_train_lines[i])
print()
for j in range(16, 26):
    print(v02_test_lines[j])

# write file with url fixes
# pt.write_lines_to_csv(v02_train_lines, "./data/train_clean_v02.csv")
# pt.write_lines_to_csv(v02_test_lines, "./data/test_clean_v02.csv")

48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze <url>,1,1
49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT <url>,0,1
50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set ablaze in Aba. <url>,1,1
52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0,0
53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE <url>,0,1
54,ablaze,Pretoria,@PhDSquares #mufc they've built so much hype around new acquisitions but I doubt they will set the EPL ablaze this season.,0,0
55,ablaze,World Wide!!,INEC Office in Abia Set Ablaze - <url>,1,1
56,ablaze,,Barbados #Bridgetown JAMAICA ÛÒ Two cars set ablaze: SANTA CRUZ ÛÓ Head of the St Elizabeth Police Superintende...  <url>,1,1
57,ablaze,Paranaque City,Ablaze for you Lord :D,0,0
59,ablaze,Live On Webcam,Check these out: <url> <url> <url> <url> #nsfw,0,4

46,ablaze,London,Birmingham Wholesale Market is ablaze BBC News - Fire breaks out at Birmingham's Wholesale Market <url

## Process Twitter-specifc characters

In [6]:
# process twitter-specific chars
with open("./data/train_clean_v02.csv", encoding="utf8", errors='ignore') as f:
    v02_train_lines = f.readlines()

with open("./data/test_clean_v02.csv", encoding="utf8", errors='ignore') as f:
    v02_test_lines = f.readlines()

v03_train_lines = pt.replace_twitter_specials(v02_train_lines)
v03_test_lines = pt.replace_twitter_specials(v02_test_lines)

# write file with twitter-specific chars fixes
# pt.write_lines_to_csv(v03_train_lines, "./data/train_clean_v03.csv")
# pt.write_lines_to_csv(v03_test_lines, "./data/test_clean_v03.csv")

## Tokenize text and clean up non-words

In [7]:
# load gloVe embeddings into dict
dict_embeddings = pt.get_glove_embeds(embed_path = "./embeddings/glove.twitter.200d.TEST.txt")
# test the embeddings read
test_keys = ["<user>", "na", "all"]
# for key in test_keys:
#     print("key = ", key)
#     print("value = ", dict_embeddings[key])

# load embeddings into spaCy Vocab


# export Vocab to save time later

Indexing word vectors.
Found 174 word vectors.


In [12]:
import spacy as sp

nlp = sp.load("en_core_web_md")  # load language model
print("spaCy en_core_web_lg model loaded...")

# load embeddings into spaCy Vocab
vocab = sp.vocab.Vocab()
for i, (token, vector) in enumerate(dict_embeddings.items()):
    if i % 10 == 0:
        print("loading token ", i, " which is ", token)
    vocab.set_vector(token, vector)

# test the vocab load - grab a few vectors
for token in test_keys:
    print(f"token {token} has vector:")
    print(vocab.get_vector(token))

spaCy en_core_web_lg model loaded...
loading token  0  which is  <user>
loading token  10  which is  i
loading token  20  which is  )
loading token  30  which is  no
loading token  40  which is  la
loading token  50  which is  o
loading token  60  which is  >
loading token  70  which is  are
loading token  80  which is  we
loading token  90  which is  ♥
loading token  100  which is  _
loading token  110  which is  now
loading token  120  which is  ~
loading token  130  which is  people
loading token  140  which is  're
loading token  150  which is  >>
loading token  160  which is  [
loading token  170  which is  q
token <user> has vector:
[ 3.1553e-01  5.3765e-01  1.0177e-01  3.2553e-02  3.7980e-03  1.5364e-02
 -2.0344e-01  3.3294e-01 -2.0886e-01  1.0061e-01  3.0976e-01  5.0015e-01
  3.2018e-01  1.3537e-01  8.7039e-03  1.9110e-01  2.4668e-01 -6.0752e-02
 -4.3623e-01  1.9302e-02  5.9972e-01  1.3444e-01  1.2801e-02 -5.4052e-01
  2.7387e-01 -1.1820e+00 -2.7677e-01  1.1279e-01  4.6596e-01 

In [None]:
import pandas as pd
import spacy as sp

# break out the text column so it can be operated on separately
df_train04 = pd.read_csv('./data/train_clean_v03.csv', encoding="utf8")
df_test04 = pd.read_csv('./data/test_clean_v03.csv', encoding="utf8")

# df_train04 = pd.read_csv('./data/train_clean_v01.csv', encoding="utf8")
# df_test04 = pd.read_csv('./data/test_clean_v01.csv', encoding="utf8")

train_text_lines = df_train04['text'].to_list()
test_text_lines = df_test04['text'].to_list()

# print("..._clean_v03.csv tweets text loaded...")
print("..._clean_v01.csv tweets text loaded...")

nlp = sp.load("en_core_web_lg")  # load language model
print("spaCy en_core_web_lg model loaded...")

train_docs = []
# create spacy doc objects
for train_line in train_text_lines:
    train_docs.append(nlp(train_line))

print

test_docs = []
for test_line in test_text_lines:
    test_docs.append(nlp(test_line))

for i in range(0, 40):
    print([token.text for token in train_docs[i]])

In [None]:
# need understand available tokens in lg spaCy model


In [None]:
# read in training data as dataframe
# import pandas as pd

# df_train = pd.read_csv('./data/train_clean_v01.csv', encoding="utf8")
# print(df_train.shape)
# df_train.head()