# Natural Language Processing with Disaster Tweets

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Different things to keep in mind compare to main.ipynb:
- Use all the columns
- Processing pipeline (lowercasing, stopword removal, punctuation removal, lemmatization, tokenization, and padding)
- Use ML classification algorithms

In [1]:
import pandas as pd

import numpy as np

import re
import spacy

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [29]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
submission = pd.read_csv('data/sample_submission.csv')

In [12]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [8]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [9]:
submission.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [14]:
print(f'Shape of train set: {train.shape}.')
print(f'Shape of test set: {test.shape}.')

Shape of train set: (7613, 5).
Shape of test set: (3263, 4).


In [15]:
print(train.isnull().sum()) 

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64


# Exploratory data analysis

- **id** is a unique identifier for each tweet, not important for the prediction task.
- **keyword** 1% of the values are missing, we can complete them with a word like, '<NKW>' (No key word).
- **text** the text of the tweet, appear to be mostly a long sentences, in some cases smaller than that, the text may contain URL's, also it can have mentions to other account people and hashtags.
- **location** 33% of the values are missing, we can use the column. Refill the missing values with 'unknown'.
- **target** is the target variable, 1 means the tweet is about a real disaster and 0 means it's not.

## Preprocessing

In [30]:
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")
tokenizer = None
stemmed_word_index = None

def preprocessing(df):
    df.fillna('', inplace=True)
    
    df['text'] = df['text'].apply(lambda x: re.sub(r'http[s]?://\S+|www\.\S+', 'twitterimagelink', x))

    df['combined_text'] = df['keyword'] + ' ' + df['location'] + ' ' + df['text']
    df = df.drop(['id','keyword','location','text'], axis=1)
    
    # Lower case
    df['combined_text'] = df['combined_text'].str.lower()
    
    # Stopword removal
    stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
        
    for stopword in stopwords:
        df['combined_text'] = df['combined_text'].str.replace(f' {stopword} ' , ' ', regex=False)
        
    return df

def create_tokenizer(df):
    global tokenizer
    tokenizer = Tokenizer(oov_token="<OOV>")
    tokenizer.fit_on_texts(df['combined_text'])
    return tokenizer

def tokenization(df):
    global tokenizer
    sequences = tokenizer.texts_to_sequences(df['combined_text'])
    return sequences

def stemming(sequences):
    global stemmed_word_index
    stemmed_sequences = []
    
    if stemmed_word_index is None:
        stemmed_word_index = {}
    
    for sequence in sequences:
        stemmed_seq = []
        for token_id in sequence:
            word = tokenizer.index_word.get(token_id, '')
            stemmed_word = nlp(word)[0].lemma_
            if stemmed_word not in stemmed_word_index:
                stemmed_word_index[stemmed_word] = len(stemmed_word_index) + 1
            stemmed_seq.append(stemmed_word_index[stemmed_word])
        stemmed_sequences.append(stemmed_seq)
    
    print(f'-- Vocab size after stemming: {len(stemmed_word_index)} --')
    return stemmed_sequences

def main_pipeline(df, is_training=False):
    global tokenizer, stemmed_word_index
    
    df = preprocessing(df)
    
    if is_training:
        tokenizer = create_tokenizer(df)
        stemmed_word_index = None
    
    sequences = tokenization(df)
    #stemmed_sequences = stemming(sequences)
    padded = pad_sequences(sequences)
    
    return padded

train_padded = main_pipeline(train.drop(['target'], axis=1), is_training=True)
test_padded = main_pipeline(test, is_training=False)

In [31]:
# Train and validation data
X = train_padded

train = pd.read_csv('data/train.csv')
y = train['target']

X_train, X_val, y_train, y_val  = train_test_split(X, y, test_size=0.2, random_state=42)

test_padded = np.array(test_padded)

In [32]:
print(X_train)
print(y_train)

[[    0     0     0 ...  4815  7047     2]
 [    0     0     0 ...   226    82 13154]
 [    0     0     0 ...     5    56     2]
 ...
 [    0     0     0 ...  8947     2     2]
 [    0     0     0 ... 19978 19979     2]
 [    0     0     0 ...    52 19525     2]]
4996    1
3263    0
4907    1
2855    1
4716    0
       ..
5226    0
5390    0
860     0
7603    1
7270    1
Name: target, Length: 6090, dtype: int64


So far we have Reduce the df from 22701 to:
- 19979
- 16572

## Model building

In [36]:
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 100

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(2048, activation='relu', name='L1'),
    tf.keras.layers.Dense(1024, activation='relu', name='L2'),
    tf.keras.layers.Dense(512, activation='relu', name='L3'),
    tf.keras.layers.Dense(256, activation='relu', name='L4'),
    tf.keras.layers.Dense(128, activation='relu', name='L5'),
    tf.keras.layers.Dense(64, activation='relu', name='L6'),
    tf.keras.layers.Dense(32, activation='relu', name='L7'),
    tf.keras.layers.Dense(16, activation='relu', name='L8'),
    tf.keras.layers.Dense(8, activation='relu', name='L9'),
    tf.keras.layers.Dense(4, activation='relu', name='L10'),
    tf.keras.layers.Dense(2, activation='relu', name='L11'),
    tf.keras.layers.Dense(1, activation='sigmoid', name='L12'),
])

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val), verbose=2)

Epoch 1/100
191/191 - 1s - 7ms/step - accuracy: 0.6374 - loss: 0.6413 - val_accuracy: 0.7741 - val_loss: 0.5502
Epoch 2/100
191/191 - 1s - 4ms/step - accuracy: 0.8222 - loss: 0.4278 - val_accuracy: 0.8030 - val_loss: 0.4526
Epoch 3/100
191/191 - 1s - 4ms/step - accuracy: 0.8828 - loss: 0.2951 - val_accuracy: 0.7932 - val_loss: 0.4618
Epoch 4/100
191/191 - 1s - 4ms/step - accuracy: 0.9192 - loss: 0.2133 - val_accuracy: 0.7656 - val_loss: 0.5078
Epoch 5/100
191/191 - 1s - 4ms/step - accuracy: 0.9478 - loss: 0.1435 - val_accuracy: 0.7800 - val_loss: 0.5432
Epoch 6/100
191/191 - 1s - 4ms/step - accuracy: 0.9654 - loss: 0.1044 - val_accuracy: 0.7695 - val_loss: 0.5946
Epoch 7/100
191/191 - 1s - 4ms/step - accuracy: 0.9762 - loss: 0.0728 - val_accuracy: 0.7768 - val_loss: 0.6507
Epoch 8/100
191/191 - 1s - 4ms/step - accuracy: 0.9806 - loss: 0.0588 - val_accuracy: 0.7466 - val_loss: 0.7269
Epoch 9/100
191/191 - 1s - 5ms/step - accuracy: 0.9836 - loss: 0.0471 - val_accuracy: 0.7242 - val_loss:

## Prepare upload

In [34]:
predictions = model.predict(test_padded)
predictions = np.round(predictions).astype(int)
predictions = predictions.flatten()

[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step


In [35]:
choosen_model_name = '2048_nn_changed_processing'
choosen_model_predictions = predictions

now = datetime.now()
date_time_str = now.strftime("%Y%m%d_%H%M%S")

submission = pd.DataFrame({
    'id': pd.read_csv('data/test.csv')['id'],
    'target': choosen_model_predictions
})

submission.to_csv(f'output/submission_{choosen_model_name}_{date_time_str}.csv', index=False)

# Conclusion

- Best result so far 0.75881 in Kaggle upload. I believe the more I preprocess the text the less accuracy I get.
- Watch videos about NLP
- Idk if Deep learning is the best approach, it's the only I know how to do.
- Explore other ML models.