# Classification of Kaggle Disaster Data

The goal of this notebook is to explore whether a HuggingFace model (HFM) can enhance the performance of non-transformer-based text classification models by augmenting the training data.

## Data

The data used in this project comes from the kaggle *Natural Language Processing with Disaster Tweets* competition at:  

https://www.kaggle.com/competitions/nlp-getting-started/data

This data consists of two files: *train.csv* (x labled tweets) and *test.csv* (y unlabled tweets)

Because the *test.csv* labels are not available, the *train.csv* file was split into the following two files:

+ train_model.csv - data used to train model, x labeled tweets
+ train_test.csv - not used to train model, used as *pseudo-test* data, y labeled tweets 

## Non-Transformer Models

Two types of models are created and compared:

1. Logistic Regression - This serves as the baseline
2. Single-Hidden layer neural network with 100 nodes in the hidden layer

## HuggingFace Models

The *TBD* Hugging Face transformer model was used to provide both uninformed and informed assistance through augmenting the data used to train the non-transformer-based models.

## Encodings

Two types of encodings are used to vectorize the inputs:

1. One-hot encoding
2. Twitter GloVe embedding: https://nlp.stanford.edu/data/glove.twitter.27B.zip


## Preprocessing

### Manual inspection of train.csv

The first issue we see with this data is that while most of the samples are on there own line. Here are few examples:

>`61,ablaze,,"on the outside you're ablaze and alive`  
>`but you're dead inside",0`  
>`74,ablaze,India,"Man wife get six years jail for setting ablaze niece`  
>`http://t.co/eV1ahOUCZA",1`
>`86,ablaze,Inang Pamantasan,"Progressive greetings!`  
>  
>`In about a month students would have set their pens ablaze in The Torch Publications'... http://t.co/9FxPiXQuJt",0`  
>`117,accident,,"mom: 'we didn't get home as fast as we wished'`  
>`me: 'why is that?'`  
>`mom: 'there was an accident and some truck spilt mayonnaise all over ??????",0`

In [2]:
import numpy as np
import string as st
import matplotlib as mp
import matplotlib.pyplot as plt

# To get around the "" error,
# need specify encoding when reading this data in as described in the solution I upvoted here:
# https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character
# with open("./data/train.csv", encoding="utf8") as f:  # works, but setting errors removes unneeded chars
with open("./data/train.csv", errors="ignore") as f:
    content_train = f.readlines()

print(len(content_train))  # 8562

8562


In [3]:
# test
with open("./data/train_debug_chunk.txt", errors="ignore") as f:
    content_lines = f.readlines()

In [8]:
# fix_spillover_lines
import re

fixed_content = []
start_new_line = True
fixed_current_line = ""

for i, line in enumerate(content_lines[:-1]):
    
    if i == 0:
        fixed_content.append(line)
        continue  # first line are headers
    
    line_next = content_lines[i+1].strip()

    current_result = re.search("^[0-9]+[,]", line)    # start with 1-5 digit(s) followed by a comma
    next_result = re.search("^[0-9]+[,]", line_next)
    current_starts_with_digit = current_result is not None
    next_starts_with_digit = next_result is not None

    if start_new_line:
        fixed_current_line = line.strip()

    if current_starts_with_digit:
        if next_starts_with_digit:
            # if both current and next lines start with digit and comma,
            # then current line is on its own line
            fixed_content.append(line.strip())
            start_new_line = True
        else:
            # if current start with digit and comma but the next line doesn't,
            # assume the next line is a continuation of the current
            fixed_current_line = fixed_current_line + " " + line_next
            start_new_line = False
    else:
        # current line does not start with a digit
        if next_starts_with_digit:
            # if current line doesn't start with a digit, but next one does,
            # assume current line is the last fragment of the previous line
            fixed_content.append(fixed_current_line)
            start_new_line = True
        else:
            # if neither current or next line starts with a digit,
            # assume the next line is a continuation of the current
            fixed_current_line = fixed_current_line + " " + line_next
            start_new_line = False
            if i == len(content_lines) - 2:
                fixed_content.append(fixed_current_line)
        
for fixed_line in fixed_content:
    print(fixed_line)


id,keyword,location,text,target

1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1
4,,,Forest fire near La Ronge Sask. Canada,1
5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1
6,,,"13,000 people receive #wildfires evacuation orders in California ",1
7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school ,1
8,,,#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires,1
10,,,"#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas",1
13,,,I'm on top of the hill and I can see a fire in the woods...,1
14,,,There's an emergency evacuation happening now in the building across the street,1
15,,,I'm afraid that the tornado is coming to our area...,1
16,,,Three people died from the heat wave so far,1
17,,,Haha South Tampa is getting flooded hah- WAIT

In [None]:
test_lines = ['1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1',
              '4,,,Forest fire near La Ronge Sask. Canada,1',
              '74,ablaze,India,"Man wife get six years jail for setting ablaze niece"',
              'http://t.co/eV1ahOUCZA",1']

for line in test_lines:
    str_test = re.search("^[0-9]+[,]", line)
    print(str_test)

