# Disaster Tweets Classification Using Natural Language Processing (NLP)
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster.

This dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website here.

Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480

Competition link : https://www.kaggle.com/c/nlp-getting-started/overview

## Problem Statement:
To classify tweets whether they indicate a disaster or not. - Binary Classification

## Project Planning
1. Import Libraries
2. Load Data

## Importing Libraries

In [None]:
#!pip install catboost

In [None]:
import warnings
warnings.filterwarnings("ignore")
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', 150)

import seaborn as sns
import matplotlib.pyplot as plt

import re
import string
from wordcloud import STOPWORDS
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk import FreqDist, word_tokenize
from nltk.corpus import stopwords
from nltk import bigrams
import spacy
from spacy.lang.en.examples import sentences 
nlp = spacy.load("en_core_web_sm")

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import cross_val_score

# Machine Learning models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from lightgbm import LGBMClassifier
#from catboost import CatBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

plt.rcParams.update({'font.size': 12})

## Load Data

In [None]:
# Load data
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
sub_sample = pd.read_csv("sample_submission.csv")

print (df_train.shape, df_test.shape, sub_sample.shape)

In [None]:
# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# df_train = pd.read_csv(r'/content/drive/MyDrive/Projects and Datasets/Disaster Tweet Classification NLP/train.csv')
# df_test = pd.read_csv(r'/content/drive/MyDrive/Projects and Datasets/Disaster Tweet Classification NLP/test.csv')

In [None]:
print('df_train data shape: ',df_train.shape)
print('df_test data shape: ',df_test.shape)

In [None]:
df_train.head()

In [None]:
df_test.head()

## Exploratory Data Analysis

In [None]:
df_train.info()

In [None]:
print('Null values from df_train data')
null_df_train = df_train.isnull().sum(axis=0)
print(null_df_train)

print('\n\nNull values from df_test data')
null_df_test = df_test.isnull().sum(axis=0)
print(null_df_test)

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))

sns.barplot(x = null_df_train.index, y = null_df_train.values/df_train.shape[0], ax=ax[0])
sns.barplot(x = null_df_test.index, y = null_df_test.values/df_test.shape[0], ax=ax[1])

ax[0].set_ylabel('Value Percentage', size=17)
ax[0].set_title('Train Set', fontsize=17)
ax[1].set_title('Test Set', fontsize=17)

for ax in ax:
  ax.tick_params(labelsize=10)
  for p in ax.patches:
      ax.annotate('{:.2f}'.format(p.get_height()),
                  (p.get_x() + 0.4, p.get_height()),
                  ha='center', va='bottom', color='black', size=17)
plt.show()

Very similar null value distribution of Train and Test data. It might indicate Train and test data are good samples from the population.

Null values in 'Keyword' column is imputed with 'None' value first. As keyword is an important feature for summarizing the disaster, it can be filled with a word from tweet text. This treatment can be done during Preprocessing.  
Need to explore 'location' field to impute null values.

Lets explore the 'keyword' column 

In [None]:
# Proportion of Target Classes
class_count = df_train.groupby('target').count()['id']/df_train.shape[0]
print(class_count)

plt.figure(figsize = (10,10))
df_train.groupby('target').count()['id'].plot(kind='pie', 
                                          labels=['Not Disaster (57%)', 'Disaster (43%)'],
                                          title='Target distribution in df_training Set',
                                          ylabel='')

In [None]:
df_train['keyword'].value_counts()

In [None]:
df_train_temp = df_train['keyword'].value_counts()
df_train_temp[df_train_temp.values < 30]

There is a '%20' character in the text, this needs to treated with space.

### Treating 'keyword' column

In [None]:
# Fill missing values with 'None'
df_train['keyword'] = df_train['keyword'].fillna(f'None')
df_test['keyword'] = df_test['keyword'].fillna(f'None')

# fix '20%' typo in 'keyword' column
df_train['keyword'] = df_train['keyword'].apply(lambda x: re.sub('%20', ' ', x))
df_test['keyword'] = df_test['keyword'].apply(lambda x: re.sub('%20', ' ', x))

In [None]:
# Filling 'None' values in 'keyword' column with a word from 'keyword' column values, which is present in that text.
# For each row with 'keyword' = None
#   Check corresponding 'text' for an existing 'keyword' value
#       If found, replace 'None' with that 'keyword' value
no_keyword = df_train['keyword'] == 'None'
keywords = np.unique(df_train[~no_keyword]['keyword'].to_numpy())

for df in [df_train, df_test]:
    for i in range(len(df)):
        if df.loc[i, 'keyword'] == 'None':
            for k in keywords:
                if k in df.loc[i, 'text'].lower():
                    df.loc[i, 'keyword'] = k
                    break

In [None]:
print('Number of missing values left:')
print('For Train set:', df_train[df_train['keyword'] == 'None'].shape[0])
print('For Test set:', df_test[df_test['keyword'] == 'None'].shape[0])

pd.concat([df_train[df_train['keyword'] == 'None']['text'], df_test[df_test['keyword'] == 'None']['text']])

These are the final 'text' columns values having 'None' value for 'keyword' column. They don't have any significant keyword, so left as they are, these rows have 'keyword'= None

In [None]:
# Fill missing values with 'None'
df_train['location'] = df_train['location'].fillna(f'None')
df_test['location'] = df_test['location'].fillna(f'None')

In [None]:
df_train.isna().sum()
df_test.isna().sum()

In [None]:
# Top 20 keywords for each class

disaster = df_train[df_train['target']==1]['keyword'].value_counts().head(20)
non_disaster = df_train[df_train['target']==0]['keyword'].value_counts().head(20)

fig, ax = plt.subplots(1,2, figsize=(20,7))

ax[0].set_title('Top keywords for disaster tweets')
ax[0].set_xlabel('Count')
sns.barplot(disaster, disaster.index, color='coral', ax=ax[0] )

ax[1].set_title('Top keywords for non-disaster tweets')
ax[1].set_xlabel('Count')
sns.barplot(non_disaster, non_disaster.index, color='skyblue',  ax=ax[1])

In [None]:
# Tweet Length for both classes

pos_tw_len = df_train[df_train['target'] == 1]['text'].str.len()
neg_tw_len = df_train[df_train['target'] == 0]['text'].str.len()

fig, ax = plt.subplots(1,2, figsize=(20,7))
ax[0].set_xlabel(' ')
ax[0].set_title('Length of Disastrous Tweets')
sns.distplot(pos_tw_len, label='Disaster Tweet length', ax=ax[0], color='red')

ax[1].set_xlabel(' ')
ax[1].set_title('Length of Non-Disastrous Tweets')
sns.distplot(neg_tw_len, label='Non-Disaster Tweet length', ax=ax[1])

In [None]:
# Word Count of Tweets in both classes
pos_tw_len = df_train[df_train['target'] == 1]['text'].apply(lambda x: len(x.split(' ')))
neg_tw_len = df_train[df_train['target'] == 0]['text'].apply(lambda x: len(x.split(' ')))

fig, ax = plt.subplots(1,2, figsize=(20,7))
ax[0].set_xlabel(' ')
ax[0].set_title('Word Count of Disastrous Tweets')
sns.distplot(pos_tw_len, label='Disaster Tweet length', ax=ax[0], color='red')

ax[1].set_xlabel(' ')
ax[1].set_title('Word Count of Non-Disastrous Tweets')
sns.distplot(neg_tw_len, label='Non-Disaster Tweet length', ax=ax[1])

In [None]:
# Number of Unique words in Tweets in both classes
pos_tw_len = df_train[df_train['target'] == 1]['text'].apply(lambda x: len(set(x.split(' '))))
neg_tw_len = df_train[df_train['target'] == 0]['text'].apply(lambda x: len(set(x.split(' '))))

fig, ax = plt.subplots(1,2, figsize=(20,7))
ax[0].set_xlabel(' ')
ax[0].set_title('Unique Word Count of Disastrous Tweets')
sns.distplot(pos_tw_len, label='Disaster Tweet length', ax=ax[0], color='red')

ax[1].set_xlabel(' ')
ax[1].set_title('Unique Word Count of Non-Disastrous Tweets')
sns.distplot(neg_tw_len, label='Non-Disaster Tweet length', ax=ax[1])

In [None]:
# Number of occurances of # hashtag in a tweet in both classes
pos_tw = df_train[df_train['target'] == 1]['text'].apply(lambda x: x.count('#'))
neg_tw = df_train[df_train['target'] == 0]['text'].apply(lambda x: x.count('#'))

fig, ax = plt.subplots(1,2, figsize=(20,7))
ax[0].set_xlabel(' ')
ax[0].set_title('Hashtags Count of Disastrous Tweets')
sns.distplot(pos_tw, label='Disaster Tweet length', ax=ax[0], color='red')

ax[1].set_xlabel(' ')
ax[1].set_title('Hashtags Count of Non-Disastrous Tweets')
sns.distplot(neg_tw, label='Non-Disaster Tweet length', ax=ax[1])


In [None]:
# Top 20 Hastags for each class
def find_hashtags(tweet):
    return " ".join([match.group(0)[1:] for match in re.finditer(r"#\w+", tweet)]) or 'None'
df_train['hashtags'] = df_train['text'].apply(lambda x: find_hashtags(x))
df_test['hashtags'] = df_test['text'].apply(lambda x: find_hashtags(x))

fig, ax = plt.subplots(1,2, figsize=(20,7))


freq_d = FreqDist(w for w in word_tokenize(' '.join(df_train.loc[df_train['target']==1, 'hashtags'])) if w != 'None')
df_d = pd.DataFrame.from_dict(freq_d, orient='index', columns=['count'])
hashtag_d = df_d.sort_values('count', ascending=False).head(20)
sns.barplot(hashtag_d['count'], hashtag_d.index, color='coral', ax = ax[0])
ax[0].set_title('Top 20 hastags in disaster tweets')

freq_nd = FreqDist(w for w in word_tokenize(' '.join(df_train.loc[df_train['target']==0, 'hashtags'])) if w != 'None')
df_nd = pd.DataFrame.from_dict(freq_nd, orient='index', columns=['count'])
hashtag_nd = df_nd.sort_values('count', ascending=False).head(20)
sns.barplot(hashtag_nd['count'], hashtag_nd.index, ax = ax[1], color='skyblue')
ax[1].set_title('Top 20 hastags in non-disaster tweets')

plt.show()

In [None]:
# df_train[df_train['location'] != 'None']['location'].value_counts().plot(kind='pie')

## Preprocessing

In [None]:
df_train['text'][0:20]

### Data Cleaning
Need RegExp to clean the text, remove puntuations, remove stop words, and Lemmatize words.

In [None]:
stop_words = set(list(STOPWORDS) + stopwords.words('english'))

In [None]:
def preprocess(data):
  '''The below preprocessing is performed.
    1. Lower casing
    2. Cleaning with RegExp
    3. Tokenizing
    4. Remove Punctuations
    5. Remove Stopwords
    6. Lemmatize
  '''
  # Converting all the text data to its lower form
  data = data.lower()

  # Cleaning with RegExp
  # Removing URLs from the text data
  data = re.sub(r'https?://\S+|www\.\S+', '', data)
  # Removing HTML Tags
  data = re.sub(r"<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});", '', data)
  #Removing Non-Ascii
  data = re.sub(r'[^\x00-\x7f]','', data)
  # Removing Emojis
  emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
  data = emoji_pattern.sub(r'', data)

  doc = nlp(data)

  # Remove Punctuations
  data = [token for token in doc if token.text not in string.punctuation]

  # Remove stopwords
  data = [token for token in data if not token.is_stop]

  # Lemmatize
  data = ' '.join([token.lemma_ for token in data])

  return data


In [None]:
%%time
df_train['cleaned_text'] = df_train['text'].apply(preprocess)
df_train.head()
df_train.shape

In [None]:
%%time
df_test['cleaned_text'] = df_test['text'].apply(preprocess)
df_test.head()
df_test.shape

In [None]:
# Dataset labels

labels = df_train['target']

## Model Building

### Experiment 1

In [None]:
from sklearn.preprocessing import MaxAbsScaler

In [None]:
scaler1 = MaxAbsScaler()
train_bow_scaled = scaler1.fit_transform(train_bow)

scaler2 = MaxAbsScaler()
train_tfidf_scaled = scaler2.fit_transform(train_tfidf)

In [None]:
from sklearn.metrics import f1_score, accuracy_score

In [None]:
labels

In [None]:
logreg1 =  LogisticRegression(random_state=1)
logreg1.fit(train_tfidf, labels)

y_pred = logreg1.predict(train_tfidf_scaled)
print('Accuracy: ', accuracy_score(labels, y_pred))

### Experiment 2
Bag of Words with Array of ML Models

In [None]:
# Bag-of-Words Model
bow = CountVectorizer()
train_bow = bow.fit_transform(df_train['cleaned_text'])

print('Vocabulary Length : ', len(bow.vocabulary_))

In [None]:
# Model Experimentations
# Machine Learning models

# pipe_lr = make_pipeline(MinMaxScaler(), LogisticRegression(random_state=1)) 
# pipe_svm = make_pipeline(MinMaxScaler(), SVC(random_state=1))


model = {'Logistic Regression' : LogisticRegression(random_state=1),
         'Support Vector Machines' : SVC(random_state=1),
         'Multinomial Naive Bayes' : MultinomialNB(),
         'Decision Trees' : DecisionTreeClassifier(random_state=1),
         'Random Forest Classifier' : RandomForestClassifier(random_state=1),
         'lightGBM': LGBMClassifier(random_state=1),
         'XG Boosting' : XGBClassifier(random_state=1)}


def _model_experimentation_pipeline(X, Y, models):
    model_score = {}
    for name, model in models.items():
        model_ = model
        print("5-Fold Cross-Validation : ", name)
        
        model_score[name] = np.mean(cross_val_score(model_,X, Y,
                                              cv=5,
                                              scoring='accuracy',
                                              verbose=2,
                                              n_jobs=-1))
        
    # Converting model_score to DataFrame
    model_score = {'5-Fold CV Score': model_score}
    model_score_df = pd.DataFrame(model_score)
    model_score_df.rename_axis('Model', inplace=True)
    model_score_df.reset_index(inplace=True)
    model_score_df.sort_values('5-Fold CV Score', ascending=False, inplace=True)
    return model_score_df

In [None]:
# GridSearch
# pipe_lr = make_pipeline(MinMaxScaler(), LogisticRegression(random_state=1, max_iter=1000)
# logreg = GridSearchCV(estimator=pipe_lr, 
#                       param_grid=lr_param_grid,
#                       scoring='accuracy', cv=5)

# grid_arr = [logreg]

In [None]:
%%time
bow_score = _model_experimentation_pipeline(train_bow, labels, model)
bow_score

### Experiment 3
TF-IDF Vectorization with Array of ML models

In [None]:
# TF-IDF Model
tfidf = TfidfVectorizer()
train_tfidf = tfidf.fit_transform(df_train['cleaned_text'])
print('Vocabulary Length : ', len(tfidf.vocabulary_))

In [None]:
%%time
tfidf_score = _model_experimentation_pipeline(train_tfidf, labels, model)
tfidf_score