# Disaster Tweets

## Introduction

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

So basically we are given a dataset of text indicating its disasterous(1) or not-disasterous(0).

Link for the Kaggle Competition - __[Disaster Tweets Kaggle](https://www.kaggle.com/c/nlp-getting-started/overview)__

## CODE

### Importing required libraries and packages

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D, BatchNormalization, Dropout
from keras.layers.recurrent import LSTM

### Analysing Training Dataset

In [2]:
df=pd.read_csv('train.csv')

In [3]:
df

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [6]:
a=[]
size=-1
for text in df['text']:
    temp = text.split()
    # print(temp)
    size=max(size,len(temp))
    for word in temp:
        if word not in a:
            a.append(word)


In [218]:
# Number of unique words in the dataset
print(len(a))

31924


In [219]:
# Maximum length of the text
print(size)

31


### Creating Training set and Validation set

In [85]:
training_sentences=[]
for text in df['text']:
    training_sentences.append(text)

In [220]:
# Some important initialzations
vocab_size = 31925
embedding_dim = 15 
max_len = 31
oov_token = "<OOV>"

In [87]:
# Creating a tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded_sequences = pad_sequences(sequences, truncating='post', maxlen=max_len)

In [109]:
#Splitting the training and validation sets
from sklearn.model_selection import train_test_split
X_train,X_valid,y_train,y_valid= train_test_split(padded_sequences,df.target,test_size=0.3,random_state=277)

### Creating the Model for training

In [222]:
def get_model():
    inputs = keras.Input(shape=(31,))

    x=Embedding(vocab_size,embedding_dim)(inputs)
    # x=GlobalAveragePooling1D()(x)
    x=Dropout(0.75)(x)

    x=LSTM(64)(x)
    x=Dropout(0.6)(x)

    # x=LSTM(16)(x)
    # x=Dropout(0.6)(x)

    # x=Dense(64)(inputs)
    # x=BatchNormalization()(x)
    # x=keras.activations.relu(x)
    # x=Dropout(0.3)(x)

    # x=Dense(128)(x)
    # x=BatchNormalization()(x)
    # x=keras.activations.relu(x)
    # x=Dropout(0.4)(x)

    # x=Dense(512)(x)
    # x=BatchNormalization()(x)
    # x=keras.activations.relu(x)
    # x=Dropout(0.3)(x)
    
    x=Dense(32)(x)
    x=BatchNormalization()(x)
    x=keras.activations.relu(x)
    x=Dropout(0.3)(x)

    # x=Dense(128)(x)
    # x=BatchNormalization()(x)
    # x=keras.activations.relu(x)
    # x=Dropout(0.4)(x)

    x=Dense(10)(x)
    x=BatchNormalization()(x)
    x=keras.activations.relu(x)
    x=Dropout(0.3)(x)

    outputs=Dense(1,activation='sigmoid')(x)

    model=keras.Model(inputs=inputs, outputs=outputs)
    return model

In [223]:
model=get_model()

In [224]:
model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy'])

In [225]:
epochs = 10
history = model.fit(X_train, y_train, epochs=epochs, batch_size=8, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [226]:
model.evaluate(X_valid,y_valid)



[0.6157493591308594, 0.8042907118797302]

### Making Predictions for the test dataframe

In [19]:
df_test=pd.read_csv('test.csv')

In [227]:
testing_sentences=[]
for text in df_test['text']:
    testing_sentences.append(text)

In [228]:
sequences = tokenizer.texts_to_sequences(testing_sentences)
padded_sequences = pad_sequences(sequences, truncating='post', maxlen=max_len)

In [229]:
result=model.predict([padded_sequences])

In [230]:
result

array([[0.03745618],
       [0.08779734],
       [0.97163355],
       ...,
       [0.6078888 ],
       [0.6590822 ],
       [0.43105322]], dtype=float32)

In [231]:
for index,i in enumerate(result):
    if i>=0.5:
        result[index]=1
    else:
        result[index]=0


### Making the Submission CSV File

In [232]:
submit=np.concatenate([np.array(df_test['id']).reshape(-1,1).astype(np.int64),result],axis=1)

In [233]:
submit=pd.DataFrame(submit)

In [234]:
submit

Unnamed: 0,0,1
0,0.0,0.0
1,2.0,0.0
2,3.0,1.0
3,9.0,0.0
4,11.0,1.0
...,...,...
3258,10861.0,0.0
3259,10865.0,1.0
3260,10868.0,1.0
3261,10874.0,1.0


In [235]:
submit.to_csv('submit.csv')