# Natural Language Processing with Disaster Tweets

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Different things to keep in mind compare to main.ipynb:
- Use all the columns
- Processing pipeline (lowercasing, stopword removal, punctuation removal, lemmatization, tokenization, and padding)
- Use ML classification algorithms

In [1]:
import pandas as pd

import numpy as np

import re
import spacy

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [4]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
submission = pd.read_csv('data/sample_submission.csv')

In [12]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [8]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [9]:
submission.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [14]:
print(f'Shape of train set: {train.shape}.')
print(f'Shape of test set: {test.shape}.')

Shape of train set: (7613, 5).
Shape of test set: (3263, 4).


In [15]:
print(train.isnull().sum()) 

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64


# Exploratory data analysis

- **id** is a unique identifier for each tweet, not important for the prediction task.
- **keyword** 1% of the values are missing, we can complete them with a word like, '<NKW>' (No key word).
- **text** the text of the tweet, appear to be mostly a long sentences, in some cases smaller than that, the text may contain URL's, also it can have mentions to other account people and hashtags.
- **location** 33% of the values are missing, we can use the column. Refill the missing values with 'unknown'.
- **target** is the target variable, 1 means the tweet is about a real disaster and 0 means it's not.

## Preprocessing

In [16]:
nlp = spacy.load("en_core_web_sm")

def preprocessing(df):
    
    df.fillna('', inplace=True) # temprorary, we will fill the missing values better next time
    
    df['text'] = df['text'].apply(lambda x: re.sub(r'http[s]?://\S+|www\.\S+', 'twitterimagelink', x)) # Check if it's better to completely remove the URL or replace it to TwitterImageLink
    # Another idea for the future, replace @username with referring a friend.

    df['combined_text'] = df['keyword'] + ' ' + df['location'] + ' ' + df['text'] # This can be NaN if any of the columns is missing, NaN + something = NaN, that why the fill na above.
    df = df.drop(['id','keyword','location','text'], axis=1)
    
    # Lower case
    df['combined_text'] = df['combined_text'].str.lower()
    
    # Stopword removal
    stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
        
    for stopword in stopwords:
        df['combined_text'] = df['combined_text'].str.replace(f' {stopword} ' , ' ', regex=False)
        
    return df

def create_tokenizer(df):
    tokenizer = Tokenizer(oov_token="<OOV>")
    tokenizer.fit_on_texts(df['combined_text'])
    return tokenizer

def tokenization(df, tokenizer):
    word_index = tokenizer.word_index
    sequences = tokenizer.texts_to_sequences(df['combined_text'])
    #padded = pad_sequences(sequences)
    print(f'-- Current vocab size: {len(word_index)} --')
    return sequences, word_index

def stemming(sequences, word_index):
    stemmed_sequences = []
    stemmed_word_index = {}
    reverse_word_index = {v: k for k, v in word_index.items()}
    
    for sequence in sequences:
        stemmed_seq = []
        for token_id in sequence:
            word = reverse_word_index.get(token_id, '')
            stemmed_word = nlp(word)[0].lemma_
            if stemmed_word not in stemmed_word_index:
                stemmed_word_index[stemmed_word] = len(stemmed_word_index) + 1
            stemmed_seq.append(stemmed_word_index[stemmed_word])
        stemmed_sequences.append(stemmed_seq)
    
    print(f'-- Vocab size after stemming: {len(stemmed_word_index)} --')
    return stemmed_sequences, stemmed_word_index

In [5]:
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")
tokenizer = None

def preprocessing(df):
    
    df.fillna('', inplace=True) # temprorary, we will fill the missing values better next time
    
    df['text'] = df['text'].apply(lambda x: re.sub(r'http[s]?://\S+|www\.\S+', 'twitterimagelink', x)) # Check if it's better to completely remove the URL or replace it to TwitterImageLink
    # Another idea for the future, replace @username with referring a friend.

    df['combined_text'] = df['keyword'] + ' ' + df['location'] + ' ' + df['text'] # This can be NaN if any of the columns is missing, NaN + something = NaN, that why the fill na above.
    df = df.drop(['id','keyword','location','text'], axis=1)
    
    # Lower case
    df['combined_text'] = df['combined_text'].str.lower()
    
    # Stopword removal
    stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
        
    for stopword in stopwords:
        df['combined_text'] = df['combined_text'].str.replace(f' {stopword} ' , ' ', regex=False)
        
    return df

def create_tokenizer(df):
    tokenizer = Tokenizer(oov_token="<OOV>")
    tokenizer.fit_on_texts(df['combined_text'])
    return tokenizer

def tokenization(df, tokenizer):
    word_index = tokenizer.word_index
    sequences = tokenizer.texts_to_sequences(df['combined_text'])
    return sequences, word_index

def stemming(sequences, word_index):
    stemmed_sequences = []
    stemmed_word_index = {}
    reverse_word_index = {v: k for k, v in word_index.items()}
    
    for sequence in sequences:
        stemmed_seq = []
        for token_id in sequence:
            word = reverse_word_index.get(token_id, '')
            stemmed_word = nlp(word)[0].lemma_
            if stemmed_word not in stemmed_word_index:
                stemmed_word_index[stemmed_word] = len(stemmed_word_index) + 1
            stemmed_seq.append(stemmed_word_index[stemmed_word])
        stemmed_sequences.append(stemmed_seq)
    
    print(f'-- Vocab size after stemming: {len(stemmed_word_index)} --')
    return stemmed_sequences, stemmed_word_index

def main_pipeline(df, dataset_name):
    df = preprocessing(df)
    if dataset_name == 'train':
        tokenizer = create_tokenizer(df)
    
    sequences, word_index = tokenization(df, tokenizer)
    stemmed_sequences, stemmed_word_index = stemming(sequences, word_index)
    padded = pad_sequences(stemmed_sequences)
    return padded, stemmed_word_index

padded, stemmed_word_index = main_pipeline(train.drop(['target'], axis=1), 'train')

-- Vocab size after stemming: 16572 --


In [17]:
# Train and validation data
y = train['target']
X = preprocessing(train.drop(['target'], axis=1))

tokenizer = create_tokenizer(X)
#X = tokenization(X, tokenizer)

# Delete if not wokring
sequences, word_index = tokenization(X, tokenizer)
stemmed_sequences, stemmed_word_index = stemming(sequences, word_index)

X_train, X_val, y_train, y_val  = train_test_split(X, y, test_size=0.2, random_state=42)

# Test data
test = preprocessing(test)
test = tokenization(test, tokenizer)
test = np.array(test)

-- Current vocab size: 19979 --


KeyboardInterrupt: 

So far we have Reduce the df from 22701 to:
- 19979

## Model building

In [9]:
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 100

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(1024, activation='relu', name='L2'),
    tf.keras.layers.Dense(512, activation='relu', name='L3'),
    tf.keras.layers.Dense(256, activation='relu', name='L4'),
    tf.keras.layers.Dense(128, activation='relu', name='L5'),
    tf.keras.layers.Dense(64, activation='relu', name='L6'),
    tf.keras.layers.Dense(32, activation='relu', name='L7'),
    tf.keras.layers.Dense(16, activation='relu', name='L8'),
    tf.keras.layers.Dense(8, activation='relu', name='L9'),
    tf.keras.layers.Dense(4, activation='relu', name='L10'),
    tf.keras.layers.Dense(2, activation='relu', name='L11'),
    tf.keras.layers.Dense(1, activation='sigmoid', name='L12'),
])

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['f1_score', 'accuracy'])

history = model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val), verbose=2)

Epoch 1/50
191/191 - 3s - 16ms/step - accuracy: 0.6066 - f1_score: 0.6019 - loss: 0.6411 - val_accuracy: 0.7439 - val_f1_score: 0.5976 - val_loss: 0.5013
Epoch 2/50
191/191 - 2s - 11ms/step - accuracy: 0.8294 - f1_score: 0.6019 - loss: 0.3949 - val_accuracy: 0.7932 - val_f1_score: 0.5976 - val_loss: 0.4550
Epoch 3/50
191/191 - 2s - 11ms/step - accuracy: 0.8998 - f1_score: 0.6019 - loss: 0.2496 - val_accuracy: 0.7984 - val_f1_score: 0.5976 - val_loss: 0.5157
Epoch 4/50
191/191 - 2s - 12ms/step - accuracy: 0.9384 - f1_score: 0.6019 - loss: 0.1541 - val_accuracy: 0.7761 - val_f1_score: 0.5976 - val_loss: 0.5633
Epoch 5/50
191/191 - 2s - 13ms/step - accuracy: 0.9645 - f1_score: 0.6023 - loss: 0.0892 - val_accuracy: 0.7393 - val_f1_score: 0.5976 - val_loss: 0.7381
Epoch 6/50
191/191 - 2s - 12ms/step - accuracy: 0.9731 - f1_score: 0.6067 - loss: 0.0647 - val_accuracy: 0.7498 - val_f1_score: 0.5972 - val_loss: 1.1284
Epoch 7/50
191/191 - 2s - 12ms/step - accuracy: 0.9713 - f1_score: 0.6079 - 

## Prepare upload

In [10]:
predictions = model.predict(test)
predictions = np.round(predictions).astype(int)
predictions = predictions.flatten()

[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step


In [11]:
print(predictions)

[0 0 1 ... 0 0 0]


In [12]:
choosen_model_name = '2048_nn_changed_processing'
choosen_model_predictions = predictions

now = datetime.now()
date_time_str = now.strftime("%Y%m%d_%H%M%S")

submission = pd.DataFrame({
    'id': pd.read_csv('data/test.csv')['id'],
    'target': choosen_model_predictions
})

submission.to_csv(f'output/submission_{choosen_model_name}_{date_time_str}.csv', index=False)