# Classification of Kaggle Disaster Data

The goal of this notebook is to explore whether a HuggingFace model (HFM) can enhance the performance of non-transformer-based text classification models by augmenting the training data.

## Data

The data used in this project comes from the kaggle *Natural Language Processing with Disaster Tweets* competition at:  

https://www.kaggle.com/competitions/nlp-getting-started/data

This data consists of two files: *train.csv* (x labled tweets) and *test.csv* (y unlabled tweets)

Because the *test.csv* labels are not available, the *train.csv* file was split into the following two files:

+ train_model.csv - data used to train model, x labeled tweets
+ train_test.csv - not used to train model, used as *pseudo-test* data, y labeled tweets 

## Non-Transformer Models

Two types of models are created and compared:

1. Logistic Regression - This serves as the baseline
2. Single-Hidden layer neural network with 100 nodes in the hidden layer

## HuggingFace Models

The *TBD* Hugging Face transformer model was used to provide both uninformed and informed assistance through augmenting the data used to train the non-transformer-based models.

## Encodings

Two types of encodings are used to vectorize the inputs:

1. One-hot encoding
2. Twitter GloVe embedding: https://nlp.stanford.edu/data/glove.twitter.27B.zip


# Preprocessing

## Manual inspection of train.csv

The following issues observered in the data are listed below.  They are numbered to indicate the order in which they were processed.  For example, spillover lines were fixed first, then URLs, etc.  This order is important because removing things like punctuation too early would make things like identifying user names or hashtags in a tweet impossible or make URLs invalid.

### 1. Spillover lines

The first issue we see with this data is that while most of the samples are on there own line. Here are few examples:

>`61,ablaze,,"on the outside you're ablaze and alive`  
>`but you're dead inside",0`  
>`74,ablaze,India,"Man wife get six years jail for setting ablaze niece`  
>`http://t.co/eV1ahOUCZA",1`  
>`86,ablaze,Inang Pamantasan,"Progressive greetings!`  
>  
>`In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0`  
>`117,accident,,"mom: 'we didn't get home as fast as we wished'`  
>`me: 'why is that?'`  
>`mom: 'there was an accident and some truck spilt mayonnaise all over ??????",0`

The custom function `fix_spillover_lines` was written to fix these lines. Its code is available in the projtools module.

### 2. URLs

Some tweet contain one or more URLs.  I assume that the content of a ULR does not contain any useful.  However, the count of URLs occuring in a tweet may be a useful feature and are counted before removing them.  About 90% of the URLs in the training data are of the form `http://t.co/<10 digit hash>`. For example: `http://t.co/9FxPiXQuJt`.  In about 10% of cases, these URLs start with `https:\\`.

#### 2.1 Count URLs in each tweet

The custom function `make_url_counts` is used to create a `url_count` feature/column.  This is called before removing the URLs as described in the next section.

#### 2.2 Remove URLs

The `replace_urls` function replaces each URL by the string "web link".

### 3. Punctuation

#### 3.1 Process Twitter-specifc characters

Because the `@` and `#` characters have special meaning in tweets, they need to be processed before removing other punctuation.  When a `@<username>` is seen in a tweet, it is a reference to a user names `username`.  When a `#<hashname>` is seen in a tweet, it specifies a hashtag which is a reference to all tweet tweets that use the `hashname` hashtag.  In processing these characters, `@<username>` is converted to `at username` and `#<hashname>` is converted to `hashtag hashname`.

#### 3.2 Remove remaining punctuation

The custom function `replace_with_space` is run to remove any remaining punctuation.

#### 3.3 Remove digits

The function `replace_with_space` is run to remove any remaining digits.



In [1]:
import numpy as np
import string as st
import matplotlib as mp
import matplotlib.pyplot as plt

# To get around the "UnicodeDecodeError: 'charmap' codec can't decode byte ..." error,
# need specify encoding when reading this data in as described in the solution I upvoted here:
# https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character
# with open("./data/train.csv", encoding="utf8") as f:  # works, but setting errors removes unneeded chars
with open("./data/train.csv", encoding="utf8", errors='ignore') as f_train:
    content_train = f_train.readlines()

with open("./data/test.csv", encoding="utf8", errors='ignore') as f_test:
    content_test = f_test.readlines()

print(len(content_train), len(content_test))  # 8562, 3700  BEFORE applying any fixes

8562 3700


In [2]:
# print some examples of spillover lines
with open("./debug/train_debug_chunk.txt", encoding="utf8", errors='ignore') as f:
    content_train_debug = f.readlines()

for i in [0, 42, 43, 53, 54, 64, 65, 66]:
    print(content_train_debug[i].strip())

id,keyword,location,text,target
61,ablaze,,"on the outside you're ablaze and alive
but you're dead inside",0
74,ablaze,India,"Man wife get six years jail for setting ablaze niece
http://t.co/eV1ahOUCZA",1
86,ablaze,Inang Pamantasan,"Progressive greetings!

In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0


In [3]:
import projtools as pt
# test the fix for the spillover lines on the training data
fixed_train_debug = pt.fix_spillover_lines(content_train_debug)

# for i, line in enumerate(fixed_list):
#     print(i, line)

# check that good lines are still good and spillover lines (*) are fixed
#id = header 32  25 *61 *74 *86 *117 119 120
for j in [0, 22, 36, 42, 52, 62, 81, 83, 84]:
    print(fixed_train_debug[j])  # spillover lines are now consolidated to a single line

id,keyword,location,text,target
32,,,London is cool ;),0
53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE http://t.co/qqsmshaJ3N,0
61,ablaze,,"on the outside you're ablaze and alive but you're dead inside",0
74,ablaze,India,"Man wife get six years jail for setting ablaze niece http://t.co/eV1ahOUCZA",1
86,ablaze,Inang Pamantasan,"Progressive greetings!  In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0
117,accident,,"mom: 'we didn't get home as fast as we wished' me: 'why is that?' mom: 'there was an accident and some truck spilt mayonnaise all over ??????",0
119,accident,,Can wait to see how pissed Donnie is when I tell him I was in ANOTHER accident??,0
120,accident,"Arlington, TX",#TruckCrash Overturns On #FortWorth Interstate http://t.co/Rs22LJ4qFp Click here if you've been in a crash&gt;http://t.co/Ld0unIYw4k,1


In [4]:
# fix the spillover lines in the train and test data, then write out fixed data
fixed_train = pt.fix_spillover_lines(content_train)
with open(file='./data/train_clean_v01.csv', mode='w', encoding="utf8", errors='ignore') as f_train_out:
    for line in fixed_train:
        f_train_out.write(line)
        f_train_out.write('\n')

fixed_test = pt.fix_spillover_lines(content_test)
with open(file='./data/test_clean_v01.csv', mode='w', encoding="utf8", errors='ignore') as f_test_out:
    for line in fixed_test:
        f_test_out.write(line)
        f_test_out.write('\n')

In [18]:
# count urls and add url_count column
train_url_counts = pt.make_url_counts(fixed_train)
# test
for i, train_line in enumerate(train_url_counts[:100]):
    urls = re.findall("http://t.co/[a-zA-Z0-9]{10}", train_line)
    urls.extend(re.findall("https://t.co/[a-zA-Z0-9]{10}", train_line))
    print(i, "|", train_line)
    print(urls)

0 | id,keyword,location,text,target,url_count
[]
1 | 1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1,0
[]
2 | 4,,,Forest fire near La Ronge Sask. Canada,1,0
[]
3 | 5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1,0
[]
4 | 6,,,"13,000 people receive #wildfires evacuation orders in California ",1,0
[]
5 | 7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school ,1,0
[]
6 | 8,,,#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires,1,0
[]
7 | 10,,,"#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas",1,0
[]
8 | 13,,,I'm on top of the hill and I can see a fire in the woods...,1,0
[]
9 | 14,,,There's an emergency evacuation happening now in the building across the street,1,0
[]
10 | 15,,,I'm afraid that the tornado is coming to our area...,1

In [27]:
import re

revised_lines = []
# replace urls
for train_line in train_url_counts[32:100]:
    urls_http = re.findall("http://t.co/[a-zA-Z0-9]{10}", train_line)
    urls_https = re.findall("https://t.co/[a-zA-Z0-9]{10}", train_line)
    if len(urls_http) > 0:
        revised_lines.append(re.sub("http://t.co/[a-zA-Z0-9]{10}", "web link", train_line))
    elif len(urls_https) > 0:
        revised_lines.append(re.sub("https://t.co/[a-zA-Z0-9]{10}", "web link", train_line))
    else:
        revised_lines.append(train_line)

for revised_line in revised_lines:
    print(revised_line)

48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze web link,1,1
49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT web link,0,1
50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set ablaze in Aba. web link,1,1
52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0,0
53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE web link,0,1
54,ablaze,Pretoria,@PhDSquares #mufc they've built so much hype around new acquisitions but I doubt they will set the EPL ablaze this season.,0,0
55,ablaze,World Wide!!,INEC Office in Abia Set Ablaze - web link,1,1
56,ablaze,,Barbados #Bridgetown JAMAICA ÛÒ Two cars set ablaze: SANTA CRUZ ÛÓ Head of the St Elizabeth Police Superintende...  web link,1,1
57,ablaze,Paranaque City,Ablaze for you Lord :D,0,0
59,ablaze,Live On Webcam,Check these out: web link web link web link web link #nsfw,0,4
61,ablaze,,"on the outside you're ablaze and alive but you're dead inside",0,0
62,ablaze,m

In [7]:
# fix special twitter characters @ and #
fix_train_special_chars = pt.replace_twitter_specials(fixed_train)

# check that lines with @ and # are fixed
#id = header 48   49 54  91  139
for j in [0, 32, 33, 37, 64, 98]:
    print(fix_train_special_chars[j])  # 

id,keyword,location,text,target
48,ablaze,Birmingham,at bbcmtd Wholesale Markets ablaze http://t.co/lHYXEOHY6C,1
49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. hash tag metal hash tag RT http://t.co/YAo1e0xngw,0
54,ablaze,Pretoria,at PhDSquares hash tag mufc they've built so much hype around new acquisitions but I doubt they will set the EPL ablaze this season.,0
91,ablaze,"Concord, CA",at Navista7 Steve these fires out here are something else! California is a tinderbox - and this clown was setting my 'hood ablaze at News24680,1
139,accident,"Hagerstown, MD",hash tag BREAKING: there was a deadly motorcycle car accident that happened to hash tag Hagerstown today. I'll have more details at 5 at Your4State. hash tag WHAG,1


In [8]:
# read in training data as dataframe
# import pandas as pd

# df_train = pd.read_csv('./data/train_clean_v01.csv', encoding="utf8")
# print(df_train.shape)
# df_train.head()