# Natural Language Processing with Disaster Tweets

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified.

## Evaluation
Submissions are evaluated using F1 between the predicted and expected answers.

## Glossary
* Importing libraries;
* Importing data;
* Light EDA;
* Preprocessing;
* Modeling;
* Tuning;
* Predictions.

## Importing libraries

In [15]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.max_rows', False)

import spacy
import re
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier

## Importing data

In [28]:
train_df = pd.read_csv('./data/train.csv')
test_df =  pd.read_csv('./data/test.csv')
sample_df = pd.read_csv('./data/sample_submission.csv')

print(f"Train shape: {train_df.shape}")
display(train_df.head())
display(train_df.tail())

print(f"Test shape: {test_df.shape}")
display(test_df.head())
display(test_df.tail())

print(f"Sample shape: {sample_df.shape}")
display(sample_df.head())
display(sample_df.tail())

Train shape: (7613, 5)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Unnamed: 0,id,keyword,location,text,target
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1
7612,10873,,,The Latest: More Homes Razed by Northern Calif...,1


Test shape: (3263, 4)


Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


Unnamed: 0,id,keyword,location,text
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...
3262,10875,,,#CityofCalgary has activated its Municipal Eme...


Sample shape: (3263, 2)


Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


Unnamed: 0,id,target
3258,10861,0
3259,10865,0
3260,10868,0
3261,10874,0
3262,10875,0


## Light EDA

In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [4]:
features_with_na = [feature for feature in train_df.columns if train_df[feature].isnull().sum() > 0]

for feature in features_with_na:
    print(f"{feature} has {np.round(train_df[feature].isnull().mean(), 2)* 100}% missing values.")

keyword has 1.0% missing values.
location has 33.0% missing values.


In [5]:
# Get a random sample of train_df in the column location that aren't null
location_not_null_sample = train_df.loc[train_df.location.notnull(), 'location'].sample(frac=1).tolist()
location_not_null_sample[:5]

['Phoenix, Arizona, USA',
 'Leeds',
 'UK',
 'Kaneohe',
 '#1 Vacation Destination,HAWAII']

In [6]:
keyword_not_null_sample = train_df.loc[train_df.keyword.notnull(), 'keyword'].sample(frac=1).tolist()
keyword_not_null_sample[:5]

['explosion', 'disaster', 'mass%20murder', 'mudslide', 'panic']

So location has quite a lot of missing values, and I don't think that this will be helpful on this problem, so I'll drop it later. On the other hand, keyword has few missing values, and they appears to be quite informative about what's happening in the tweet in questions, so I'll use some inputer later.

Example of text

In [7]:
train_df.text[38]

'Barbados #Bridgetown JAMAICA \x89ÛÒ Two cars set ablaze: SANTA CRUZ \x89ÛÓ Head of the St Elizabeth Police Superintende...  http://t.co/wDUEaj8Q4J'

So just by seeing this example is possible to see that those phrases has a lot of symbols, probably urls and numbers, this is a thing to remove on preprocessing.

## Preprocessing

Now that I've get some familiarization with the data, I'll preprocess the data for train some models.

In [8]:
def preprocessing(dataframe):
    # make all characters lowercase
    dataframe['text'] = dataframe['text'].apply(lambda x: x.lower())
    # remove urls
    dataframe['text'] = dataframe['text'].apply(lambda x: re.sub(r'http\S+', '', x))
    # remove punctuation
    dataframe['text'] = dataframe['text'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', x))
    # remove extra spaces
    dataframe['text'] = dataframe['text'].apply(lambda x: re.sub(r'\s+', ' ', x))
    # remove numbers
    dataframe['text'] = dataframe['text'].apply(lambda x: re.sub(r'\d+', '', x))
    # use simple imputer to fill missing values
    dataframe = dataframe.fillna('')
    # concatenate the keyword and text columns
    dataframe['text'] = dataframe['keyword'] + ' ' + dataframe['text']
    # drop location and id column
    dataframe = dataframe.drop(['location', 'id', 'keyword'], axis=1)
    # lemmatize the text using spacy
    nlp = spacy.load('en_core_web_sm')
    dataframe['text'] = dataframe['text'].apply(lambda x: ' '.join([token.lemma_ for token in nlp(x)]))
    # remove stop words
    stop_words = spacy.lang.en.stop_words.STOP_WORDS
    dataframe['text'] = dataframe['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
    
    return dataframe

In [9]:
train_df_for_preprocess = train_df.copy()
train_preprocessed = preprocessing(train_df_for_preprocess)
display(train_preprocessed.head())
display(train_preprocessed.tail())

Unnamed: 0,text,target
0,deed reason earthquake allah forgive,1
1,forest fire near la ronge sask canada,1
2,resident ask shelter place notify officer evac...,1
3,people receive wildfire evacuation order calif...,1
4,send photo ruby alaska smoke wildfires pour sc...,1


Unnamed: 0,text,target
7608,giant crane hold bridge collapse nearby home,1
7609,ariaahrary thetawni control wild fire californ...,1
7610,m utckm s volcano hawaii,1
7611,police investigate ebike collide car little po...,1
7612,late home raze northern california wildfire ab...,1


In [10]:
train_preprocessed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    7613 non-null   object
 1   target  7613 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 119.1+ KB


In [11]:
train_preprocessed.text[38]

'ablaze barbados bridgetown jamaica car set ablaze santa cruz head st elizabeth police superintende'

In [21]:
# Use CountVectorizer on my train_preprocessed.text column
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_preprocessed.text)
train_vec = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

In [22]:
train_vec.head()

Unnamed: 0,20accident,20bag,20bagge,20bagging,20bags,20bang,20bomb,20bombe,20bomber,20bombing,...,zoom,zotar,zouma,zourryart,zrnf,zss,zumiez,zurich,zxathetis,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Modeling

In [14]:
X_train, X_val, y_train, y_val = train_test_split(train_vec, train_preprocessed.target, test_size=0.2)

model_dict = {
    'MultinomialNB': MultinomialNB(),
    'RandomForestClassifier': RandomForestClassifier(),
    'XGBClassifier': XGBClassifier()
}

f1_score_dict = {
    'MultinomialNB': [],
    'RandomForestClassifier': [],
    'XGBClassifier': []
}

for model in model_dict:
    model_dict[model].fit(X_train, y_train)
    y_pred = model_dict[model].predict(X_val)

    f1_score_dict[model] = f1_score(y_val, y_pred, average='weighted')
    print(f"{model} f1 score: {f1_score_dict[model]}")

MultinomialNB f1 score: 0.793712861453275
RandomForestClassifier f1 score: 0.7866847160269715
XGBClassifier f1 score: 0.7739369255422799


MultinomialNB have get the better f1 score.

## Tuning

In [16]:
# Define the parameter grid for each model
param_grid = {
    'MultinomialNB': {'alpha': [0.1, 0.5, 1.0]},
    'RandomForestClassifier': {'n_estimators': [50, 100, 200], 'max_depth': [None, 5, 10]},
    'XGBClassifier': {'learning_rate': [0.1, 0.01], 'max_depth': [3, 5]}
}

# Initialize a dictionary to store the best models
best_models = {}

for model_name in model_dict:
    model = model_dict[model_name]
    param_grid_model = param_grid[model_name]

    # Create GridSearchCV for each model
    grid_search = GridSearchCV(model, param_grid_model, scoring='f1_weighted', cv=5)
    grid_search.fit(X_train, y_train)

    # Store the best model in the dictionary
    best_models[model_name] = grid_search.best_estimator_

# Evaluate the best models on the validation set
for model_name, best_model in best_models.items():
    y_pred = best_model.predict(X_val)
    f1_score_val = f1_score(y_val, y_pred, average='weighted')
    print(f"Best {model_name} f1 score on validation set: {f1_score_val}")

Best MultinomialNB f1 score on validation set: 0.793712861453275
Best RandomForestClassifier f1 score on validation set: 0.784837907752378
Best XGBClassifier f1 score on validation set: 0.7363425442993826


Actually, it really doesnt' have get any better with the tuning, maybe I've to try some other parameters later.

## Predict

In [17]:
test_df_for_preprocess = test_df.copy()
test_preprocessed = preprocessing(test_df_for_preprocess)
display(test_preprocessed.head())
display(test_preprocessed.tail())

Unnamed: 0,text
0,happen terrible car crash
1,hear earthquake different city stay safe
2,forest fire spot pond geese flee street I save
3,apocalypse light spokane wildfire
4,typhoon soudelor kill china taiwan


Unnamed: 0,text
3258,earthquake safety los angeles safety fastener ...
3259,storm ri worse hurricane cityampother hard hit...
3260,green line derailment chicago
3261,meg issue hazardous weather outlook hwo
3262,cityofcalgary activate municipal emergency pla...


Just transforming here (not fiting) to use the same tokens from the train dataset

In [29]:
X = vectorizer.transform(test_preprocessed.text)
test_vec = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

In [30]:
test_vec

Unnamed: 0,20accident,20bag,20bagge,20bagging,20bags,20bang,20bomb,20bombe,20bomber,20bombing,...,zoom,zotar,zouma,zourryart,zrnf,zss,zumiez,zurich,zxathetis,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [33]:
best_model = best_models['XGBClassifier']

X_train, y_train = train_vec, train_preprocessed.target
best_model.fit(X_train, y_train)
y_pred = best_model.predict(test_vec)

test_df['target'] = y_pred

columns_to_drop = ['keyword', 'location', 'text']
columns_to_drop = [col for col in columns_to_drop if col in test_df.columns]
test_df.drop(columns=columns_to_drop, inplace=True)

test_df.to_csv('submission.csv', index=False)
test_df

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
5,12,0
6,21,0
7,22,0
8,27,0
...,...,...
