# Natural Language Processing with Disaster Tweets

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

In [73]:
import pandas as pd
import numpy as np

from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [78]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
submission = pd.read_csv('data/sample_submission.csv')

In [12]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [8]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [9]:
submission.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [14]:
print(f'Shape of train set: {train.shape}.')
print(f'Shape of test set: {test.shape}.')

Shape of train set: (7613, 5).
Shape of test set: (3263, 4).


In [15]:
print(train.isnull().sum()) 

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64


# Exploratory data analysis

- **id** is a unique identifier for each tweet, not important for the prediction task.
- **keyword** 1% of the values are missing, we can complete them with a word like, '<NKW>' (No key word).
- **text** the text of the tweet, appear to be mostly a long sentences, in some cases smaller than that, the text may contain URL's, also it can have mentions to other account people and hashtags.
- **location** 33% of the values are missing, we can use the column. Refill the missing values with 'unknown'.
- **target** is the target variable, 1 means the tweet is about a real disaster and 0 means it's not.

## Preprocessing

In [79]:
y = train['target']
train = train.drop(['id','keyword','location','target'], axis=1)
test = test.drop(['id','keyword','location',], axis=1)

In [80]:
def preprocessing(df):
    tokenizer = Tokenizer(oov_token="<OOV>")
    tokenizer.fit_on_texts(df['text'])
    word_index = tokenizer.word_index
    sequences = tokenizer.texts_to_sequences(df['text'])
    padded = pad_sequences(sequences)
    #print("\nWord Index = " , word_index)
    print(len(word_index))
    #print("\nSequences = " , sequences)
    #print("\nPadded Sequences:")
    #print(padded)
    return padded, tokenizer

In [81]:
X, tokenizer = preprocessing(train)

X_train, X_val, y_train, y_val  = train_test_split(X, y, test_size=0.2, random_state=42)

22701


## Model building

In [68]:
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 100

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(1024, activation='relu', name='L2'),
    tf.keras.layers.Dense(512, activation='relu', name='L3'),
    tf.keras.layers.Dense(256, activation='relu', name='L4'),
    tf.keras.layers.Dense(128, activation='relu', name='L5'),
    tf.keras.layers.Dense(64, activation='relu', name='L6'),
    tf.keras.layers.Dense(32, activation='relu', name='L7'),
    tf.keras.layers.Dense(16, activation='relu', name='L8'),
    tf.keras.layers.Dense(8, activation='relu', name='L9'),
    tf.keras.layers.Dense(4, activation='relu', name='L10'),
    tf.keras.layers.Dense(2, activation='relu', name='L11'),
    tf.keras.layers.Dense(1, activation='sigmoid', name='L12'),
])

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=500, validation_data=(X_val, y_val), verbose=2)

Epoch 1/100
191/191 - 4s - 19ms/step - accuracy: 0.6279 - loss: 0.6410 - val_accuracy: 0.7564 - val_loss: 0.5891
Epoch 2/100
191/191 - 3s - 14ms/step - accuracy: 0.8082 - loss: 0.5256 - val_accuracy: 0.7269 - val_loss: 0.6401
Epoch 3/100
191/191 - 3s - 15ms/step - accuracy: 0.8172 - loss: 0.4582 - val_accuracy: 0.7196 - val_loss: 0.6007
Epoch 4/100
191/191 - 3s - 16ms/step - accuracy: 0.8445 - loss: 0.4146 - val_accuracy: 0.4773 - val_loss: 0.7439
Epoch 5/100
191/191 - 3s - 16ms/step - accuracy: 0.8680 - loss: 0.3767 - val_accuracy: 0.7525 - val_loss: 0.5570
Epoch 6/100
191/191 - 3s - 16ms/step - accuracy: 0.9235 - loss: 0.2787 - val_accuracy: 0.7748 - val_loss: 0.8125
Epoch 7/100
191/191 - 4s - 19ms/step - accuracy: 0.9392 - loss: 0.2381 - val_accuracy: 0.7623 - val_loss: 0.8050
Epoch 8/100
191/191 - 3s - 15ms/step - accuracy: 0.9287 - loss: 0.2482 - val_accuracy: 0.7078 - val_loss: 0.8479
Epoch 9/100
191/191 - 3s - 14ms/step - accuracy: 0.9445 - loss: 0.2071 - val_accuracy: 0.7610 - 

## Prepare upload

In [69]:
test = pd.read_csv('data/test.csv')
test = test.drop(['id','keyword','location',], axis=1)

In [70]:
print(test.shape)

(3263, 1)


In [71]:
testing_sequences = tokenizer.texts_to_sequences(test['text'])
print(testing_sequences)
testing_padded = pad_sequences(testing_sequences)
testing_padded = np.array(testing_padded)

print(testing_padded)

predictions = model.predict(testing_padded)
predictions = np.round(predictions).astype(int)
predictions = predictions.flatten()

[[35, 914, 6, 1952, 131, 93], [475, 57, 264, 12, 1202, 2649, 606, 2322, 246], [78, 12, 6, 190, 46, 20, 826, 3576, 1, 25, 5168, 872, 5, 770, 11, 1415, 506, 98, 41], [480, 3940, 7669, 1501], [218, 796, 478, 2565, 7, 943, 10, 1207], [815, 4423, 69, 44, 264], [3107, 791, 85, 433, 55, 132, 80, 2683, 156, 1791, 3681, 3681], [747, 65, 25, 15], [61, 6, 1253, 768], [411, 100], [40, 11, 71, 29, 1191], [1, 71, 72, 18], [40, 71, 719, 33, 18], [61, 49], [1049], [6279, 2929, 558, 12, 644, 902, 59, 46, 3823, 39, 20, 11721, 2929, 558, 4, 2, 3, 1], [1, 43, 15, 3603, 17966, 13, 3227, 644], [1, 1, 1, 21126, 818, 3011, 1101, 431, 644, 4, 2, 3, 1], [281, 259, 39, 4, 2, 3, 4647, 4, 2, 3, 4648, 4, 2, 3, 4649, 4, 2, 3, 4650, 2930], [5712, 2393, 1, 16, 1, 1, 893, 644, 3, 1, 893, 644], [4486, 113, 644, 1, 1, 30, 1], [91, 105, 644, 21, 1, 1, 4, 2, 3, 1, 54, 1], [37, 6, 3124, 581, 60, 43, 175, 99, 476, 10, 292, 17, 19, 1695, 69, 44, 928, 17670, 788, 64, 12, 57, 56, 273, 5, 385, 644, 9230], [2541, 1318, 996, 31, 2

In [66]:
print(predictions)

[0 1 1 ... 1 0 1]


In [72]:
# Create the submission DataFrame

choosen_model_name = '2048_nn'
choosen_model_predictions = predictions

submission = pd.DataFrame({
    'id': pd.read_csv('data/test.csv')['id'],  # Ensure PassengerId is correctly handled
    'target': choosen_model_predictions  # or log_reg_test_preds, xgb_clf_test_preds
})

# Get the current date and time
now = datetime.now()
# Format the date and time as a string
date_time_str = now.strftime("%Y%m%d_%H%M%S")

# Save the DataFrame to a CSV file with the date and time in the filename
submission.to_csv(f'data/submission_{choosen_model_name}_{date_time_str}.csv', index=False)