# Analysis of "Natural Language Processing with Disaster Tweets" Competition

Submission by: Atul Parida

## Overview
This notebook contains the code and analysis for the "Natural Language Processing with Disaster Tweets" competition on Kaggle. In this competition, the goal is to build a machine learning model that predicts whether tweets are about real disasters or not. We will explore the dataset, preprocess the text data, build and evaluate NLP models, and make predictions.

## Table of Contents
1. [Data Exploration](#data-exploration)
2. [Data Preprocessing](#data-preprocessing)
3. [Model Building](#model-building)
4. [Model Evaluation](#model-evaluation)
5. [Results and Conclusion](#results-and-conclusion)


### Data Exploration

In [67]:
### ESSENTIAL IMPORTS ###

import jax.numpy as np
import jax
from jax import jit, vmap, grad

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os, shutil
import nltk
import random
from bs4 import BeautifulSoup

from sklearn.model_selection import train_test_split
from collections import defaultdict



In [68]:
### DATA IMPORTS ###
cwd = os.getcwd()
test_data_path = os.path.join(cwd, "data", "test.csv")
train_data_path = os.path.join(cwd, "data", "train.csv")

test_op_path = os.path.join(cwd, "outputs", "test_output.csv")
train_op_path = os.path.join(cwd, "outputs", "train_output.csv")

train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)

In [69]:
### REVIEWING FORMATS ###
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


As we can see, the typical parameters of each row are the keyword, location, text, and target value, with 1 being a disaster-related tweet and 0 being a non-disaster-related tweet. Each tweet text can contain additional elements such as emojis, hashtags, URLs, and HTML tags. 

In [70]:
print("Length of training data: ", len(train_data))

train_data.isnull().sum() # Find the number of null values to determine the correct approach for data cleaning and imputation.

Length of training data:  7613


id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [71]:
train_data.target.value_counts() # Find the number of disaster and non-disaster tweets in the training dataset.

target
0    4342
1    3271
Name: count, dtype: int64

In [72]:
print("Length of testing data: ", len(test_data))

test_data.isnull().sum() # Find the number of null values to determine the correct approach for data cleaning and imputation.

Length of testing data:  3263


id             0
keyword       26
location    1105
text           0
dtype: int64

In [73]:
print(f"Unique locations vs total locations in training dataset: {train_data["location"].nunique()} unique in {len(train_data) - train_data.location.isnull().sum()} total")
print(f"Unique locations vs total locations in testing dataset: {test_data["location"].nunique()} unique in {len(test_data) - test_data.location.isnull().sum()} total")

Unique locations vs total locations in training dataset: 3341 unique in 5080 total
Unique locations vs total locations in testing dataset: 1602 unique in 2158 total


In [74]:
print(f"Unique keywords vs total keywords in training dataset: {train_data["keyword"].nunique()} unique in {len(train_data) - train_data.keyword.isnull().sum()} total")
print(f"Unique keywords vs total keywords in testing dataset: {test_data["keyword"].nunique()} unique in {len(test_data) - test_data.keyword.isnull().sum()} total")

Unique keywords vs total keywords in training dataset: 221 unique in 7552 total
Unique keywords vs total keywords in testing dataset: 221 unique in 3237 total


In [75]:
train_kw_dict = { }
test_kw_dict = { }
for kw in train_data["keyword"]:
    if str(kw) == 'nan':
        continue
    elif kw in train_kw_dict.keys():
        train_kw_dict[str(kw)] += 1
    else:
        train_kw_dict[str(kw)] = 1

for kw in test_data["keyword"]:
    if str(kw) == 'nan':
        continue
    elif kw in test_kw_dict.keys():
        test_kw_dict[str(kw)] += 1
    else:
        test_kw_dict[str(kw)] = 1

common_kw = []
for kw in test_kw_dict.keys():
    if kw in train_kw_dict.keys():
        common_kw.append(kw)

print(f"Number of common keywords in training and testing datasets: { len(common_kw) }")
print(f"Training keywords: {len(train_kw_dict.keys())}")
print(f"Testing keywords: {len(test_kw_dict.keys())}")


Number of common keywords in training and testing datasets: 221
Training keywords: 221
Testing keywords: 221


As observed, both the training and testing datasets have roughly 33% of tweets which do not contain the location, and roughly 0.8 to 1% of tweets with no keyword provided. This suggests that the training and testing datasets were both taken from the same sample dataset of disaster-related tweets. 

The reason for a lack of locations is because they are based on user inputs, and not automatically appended to the tweet. Additionally, not all locations provided by the user may correspond to an actual location, hence having a high number of unique values which may skew the model. Due to this, ```location``` won't be considered as a feature.

The total number of keywords appears to be the same, and the keywords themselves appear to be the same between both the training and testing datasets, which means that ```keywords``` can be used as a feature. This also suggests the sampling of the training and testing datasets from the same parent dataset.

From the features provided and their general characteristics, we can also define some metafeatures which can assist in increasing our model's accuracy. This can include:
- **Word count:** number of total words in the text
- **Unique word count:** number of unique words in the text
- **Stop word count:** number of stop words in the text
- **Char count:** number of characters used in the text
- **Mean word length:** mean length of words used in the text
- **Slang word count:** number of slang terms used in the text.

More general metafeatures are suggested as they are the most generalisable to tweets, irrespective of whether they include URLs or emojis or other characteristics. It makes the analysis more objective.

Data that can be cleaned up from the tweets include the following:
- **Hashtags:**
- **Mentions:**
- **URLs:**
- **Punctuation:**
- **Stop words:**
- **Special characters and emojis:**
- **Slang words:**

Additional cleaning steps that can be undertaken would include lowercase transformation, duplicate space removal, spell-correction, and lemmatization of the text.

### Data Preprocessing

In [76]:
### CREATING PREPROCESSING METHODS ###
def include_word_count(input_df): # Addition of word_count metafeature
    pass

def include_unique_word_count(input_df): # Addition of unique_word_count metafeature
    pass

def include_stop_word_count(input_df): # Addition of stop_word_count metafeature
    pass

def include_char_count(input_df): # Addition of char_count metafeaure
    pass

def include_mean_word_length(input_df): # Addition of mean_word_length metafeature
    pass

def keyword_imputation(input_df):
    pass

In [None]:
### PREPROCESSING METHODS FOR REMOVAL ###
def remove_stop_words(input_df):
    pass

def remove_URLs(input_df):
    pass

def remove_punctuation(input_df):
    pass

def remove_mentions(input_df):
    pass



### Model Building

In [77]:
### CREATING ESSENTIAL CLASSES AND FUNCTIONS ###

### Model Evaluation

In [78]:
### ANALYZING HYPERPARAMETER VALUES AND BEHAVIOUR ###

### Results and Conclusion

In [79]:
### OPTIMAL MODEL ###

In [80]:
### OPTIMAL RESULTS ###

[Jump to top](#analysis-of-natural-language-processing-with-disaster-tweets-competition)