# Week 4 Mini Project: RNN Disaster Tweets Classification

By: Jaeyoung Oh

Repo: https://github.com/BlueJayVRStudio/CSCA5642_Week4

## Problem Statement

The objective of this week's mini project is to classify the context of Tweets. It is a simple binary classification between whether it is about real disasters or not. The set of data consists of a training set and a test set. The test set is reserved only for submission and not validation. The training set consists of 7613 hand-classified data points each composed of ID, keyword, location, body of text and target label. We will first explore keyword, location and text body to identify potential input columns and then perform necessary preprocessing steps on the selected input.

## EDA

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

train_df = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')

In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [6]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


In [9]:
sample_df = train_df.sample(10)

In [15]:
sample_df

Unnamed: 0,id,keyword,location,text,target
4289,6093,hellfire,United Hoods of the Globe,HELLFIRE EP - SILENTMIND &amp; @_bookofdaniel ...,0
5874,8391,ruin,"MÌ©rida, YucatÌÁn",babe I'm gonna ruin you if you let me stay,0
375,538,army,"Burbank,CA",@AP what a violent country get the army involv...,1
4343,6167,hijack,"Near Richmond, VA",Another Mac vuln!\n\nhttps://t.co/OxXRnaB8Un,0
6807,9753,tragedy,,Rly tragedy in MP: Some live to recount horror...,1
245,348,annihilation,,Evildead - Annihilation of Civilization http:/...,0
2005,2881,damage,??? ?? ???????,If Trillion crosses the line a 3rd time he doe...,1
6885,9870,traumatised,ELVY,Think I'm traumatised for life,0
6157,8783,siren,"Honolulu,Hawaii",Serephina the Siren &lt;3 http://t.co/k6UEtsnLHT,0
4725,6721,lava,,I lava you ?????? http://t.co/aeZ3aK1lRN,0


ID has no special meaning, so we can easily drop ID from the dataset. Additionally, there are too many null values in keyword and location. Keyword is selected word from the text so it is redudant information. Although location might provide some special context, there are too many null values and does not contain too much meaning. Therefore, we really only need to consider the text bodies as our input column. 

Now we will do some basic NLP preprocessing. *Here are some considerations*:
1. Most NLP tasks favor removing stop words and porter stemming, but for complex context dependent tasks like identifying real disaster in a tweet, preservation of stop words and suffixes may be quintessential especially taking into account the fact that RNN's can discern sequential/temporal patterns.
2. Because RNN learns sequential information, we have to use word tokens as opposed to vectors.
3. URL's don't provide enough contextual clues and unnecessarily increase complexity. Thus, we can easily decide to remove URL's.
4. For similar reason to removing URL's we can convert all texts to lowercase.

*Next we look at class distribution to make sure there isn't too much class imbalance*:

*Finally, we will do a 80-20 test-validation split on the train dataset*:

In [30]:
from sklearn.model_selection import train_test_split

# 80-20 test-validation split
X_train, X_test, y_train, y_test = train_test_split(
    train_df['text'], train_df['target'], test_size=0.2, random_state=42)