# Classification of Kaggle Disaster Data

The goal of this notebook is to explore whether a HuggingFace model (HFM) can enhance the performance of non-transformer-based text classification models by augmenting the training data.

## Data

The data used in this project comes from the kaggle *Natural Language Processing with Disaster Tweets* competition at:  

https://www.kaggle.com/competitions/nlp-getting-started/data

This data consists of two files: *train.csv* (x labled tweets) and *test.csv* (y unlabled tweets)

Because the *test.csv* labels are not available, the *train.csv* file was split into the following two files:

+ train_model.csv - data used to train model, x labeled tweets
+ train_test.csv - not used to train model, used as *pseudo-test* data, y labeled tweets 

## Non-Transformer Models

Two types of models are created and compared:

1. Logistic Regression - This serves as the baseline
2. Single-Hidden layer neural network with 100 nodes in the hidden layer

## HuggingFace Models

The *TBD* Hugging Face transformer model was used to provide both uninformed and informed assistance through augmenting the data used to train the non-transformer-based models.

## Encodings

Two types of encodings are used to vectorize the inputs:

1. One-hot encoding
2. Twitter GloVe embedding: https://nlp.stanford.edu/data/glove.twitter.27B.zip


# Preprocessing

## Manual inspection of train.csv

The following issues observered in the data are listed below.  They are numbered to indicate the order in which they were fixed.  For example, spillover lines were fixed first, then lines that start with ??, etc.

### 1. Spillover lines

The first issue we see with this data is that while most of the samples are on there own line. Here are few examples:

>`61,ablaze,,"on the outside you're ablaze and alive`  
>`but you're dead inside",0`  
>`74,ablaze,India,"Man wife get six years jail for setting ablaze niece`  
>`http://t.co/eV1ahOUCZA",1`  
>`86,ablaze,Inang Pamantasan,"Progressive greetings!`  
>  
>`In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0`  
>`117,accident,,"mom: 'we didn't get home as fast as we wished'`  
>`me: 'why is that?'`  
>`mom: 'there was an accident and some truck spilt mayonnaise all over ??????",0`

The custom function `fix_spillover_lines` was written to fix these lines. Its code is available in the projtools module.

### 2. TBD

In [1]:
import numpy as np
import string as st
import matplotlib as mp
import matplotlib.pyplot as plt

# To get around the "" error,
# need specify encoding when reading this data in as described in the solution I upvoted here:
# https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character
# with open("./data/train.csv", encoding="utf8") as f:  # works, but setting errors removes unneeded chars
with open("./data/train.csv", encoding="utf8", errors='ignore') as f_train:
    content_train = f_train.readlines()

with open("./data/test.csv", encoding="utf8", errors='ignore') as f_test:
    content_test = f_test.readlines()

print(len(content_train), len(content_test))  # 8562, 3700  BEFORE applying any fixes

8562 3700


In [2]:
# print some examples of spillover lines
with open("./debug/train_debug_chunk.txt", encoding="utf8", errors='ignore') as f:
    content_train_debug = f.readlines()

for i in [0, 42, 43, 53, 54, 64, 65, 66]:
    print(content_train_debug[i].strip())

id,keyword,location,text,target
61,ablaze,,"on the outside you're ablaze and alive
but you're dead inside",0
74,ablaze,India,"Man wife get six years jail for setting ablaze niece
http://t.co/eV1ahOUCZA",1
86,ablaze,Inang Pamantasan,"Progressive greetings!

In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0


In [3]:
import projtools as pt
# test the fix for the spillover lines on the training data
fixed_train_debug = pt.fix_spillover_lines(content_train_debug)

# for i, line in enumerate(fixed_list):
#     print(i, line)

# check that good lines are still good and spillover lines (*) are fixed
#id = header 32  25 *61 *74 *86 *117 119 120
for j in [0, 22, 36, 42, 52, 62, 81, 83, 84]:
    print(fixed_train_debug[j])  # spillover lines are now consolidated to a single line

id,keyword,location,text,target
32,,,London is cool ;),0
53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE http://t.co/qqsmshaJ3N,0
61,ablaze,,"on the outside you're ablaze and alive but you're dead inside",0
74,ablaze,India,"Man wife get six years jail for setting ablaze niece http://t.co/eV1ahOUCZA",1
86,ablaze,Inang Pamantasan,"Progressive greetings!  In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0
117,accident,,"mom: 'we didn't get home as fast as we wished' me: 'why is that?' mom: 'there was an accident and some truck spilt mayonnaise all over ??????",0
119,accident,,Can wait to see how pissed Donnie is when I tell him I was in ANOTHER accident??,0
120,accident,"Arlington, TX",#TruckCrash Overturns On #FortWorth Interstate http://t.co/Rs22LJ4qFp Click here if you've been in a crash&gt;http://t.co/Ld0unIYw4k,1


In [4]:
# fix the spillover lines in the train and test data, then write out fixed data
fixed_train = pt.fix_spillover_lines(content_train)
with open(file='./data/train_clean_v01.csv', mode='w', encoding="utf8", errors='ignore') as f_train_out:
    for line in fixed_train:
        f_train_out.write(line)
        f_train_out.write('\n')

fixed_test = pt.fix_spillover_lines(content_test)
with open(file='./data/test_clean_v01.csv', mode='w', encoding="utf8", errors='ignore') as f_test_out:
    for line in fixed_test:
        f_test_out.write(line)
        f_test_out.write('\n')

In [5]:
# read in training data that has fixed spillover lines
import pandas as pd

df_train = pd.read_csv('./data/train_clean_v01.csv', encoding="utf8")
df_train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
