# Tweets analysis

The purpose of this project is to make a text classifier for Tweets, to class if they are related to a disaster event or not. We work on a Kaggle dataset for this, and regarding the model we will use a Neural Network for text classification.

## Data importations and first analysis

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
train = pd.read_csv('gdrive/MyDrive/Colab Notebooks/deep learning project/train.csv')
test = pd.read_csv('gdrive/MyDrive/Colab Notebooks/deep learning project/test.csv')
submission = pd.read_csv('gdrive/MyDrive/Colab Notebooks/deep learning project/sample_submission.csv')

In [None]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
train.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


In [None]:
train.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [None]:
train.keyword.value_counts()

fatalities               45
deluge                   42
armageddon               42
sinking                  41
damage                   41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: keyword, Length: 221, dtype: int64

In [None]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [None]:
test.describe()

Unnamed: 0,id
count,3263.0
mean,5427.152927
std,3146.427221
min,0.0
25%,2683.0
50%,5500.0
75%,8176.0
max,10875.0


In [None]:
submission.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


We have 3 datasets :
  - train.csv :     
    4 columns :     
      - id : we will not use it so we drop it
      - keyword : a particular keyword from the text that may be blank, we will drop this column in a first time and may add it back later. in this project we try to classify the tweets given the text alone
      - location : the location (may be blank). as for keyword we will not use it.
      - text of the tweet : the text we will use to classify tweets.
      - target : 1 for real disaster, else 0 
  - test.csv : same as train but without the target column. we will not use it in a first time, this is just for kaggle challenge
  - sample_submission.csv : identification between id tweet and target calculated by the model, for kaggle submission.


We will only use for a first time the train.csv file (other files are for kaggle competition), and only columns text and target.

In [None]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
train = train[['text', 'target']]

In [None]:
train.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


Let's read the 10 first tweets of each category (disaster / no disater) :

In [None]:
print("*** No disaster tweets ***")
for i in range(10) :
  print(train[train['target'] == 0].iloc[i, 0])

print()
print("*** Disaster tweets ***")
for i in range(10) :
  print(train[train['target'] == 1].iloc[i, 0])


*** No disaster tweets ***
What's up man?
I love fruits
Summer is lovely
My car is so fast
What a goooooooaaaaaal!!!!!!
this is ridiculous....
London is cool ;)
Love skiing
What a wonderful day!
LOOOOOOL

*** Disaster tweets ***
Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Forest fire near La Ronge Sask. Canada
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
13,000 people receive #wildfires evacuation orders in California 
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas
I'm on top of the hill and I can see a fire in the woods...
There's an emergency evacuation happening now in the building across the street
I'm afraid that the tornad

It may be just about this sample but we can see that :

- vocabulary is (obviously) not the same
- disaster tweets seem to have more characters
- use of # is more important in tweets related to disasters

## Text pre-processing

We have to pre-process text for the model to work better.

In [None]:
# import english initialisation :
import en_core_web_sm
nlp = en_core_web_sm.load()


In [None]:
# import stop words :
from spacy.lang.en.stop_words import STOP_WORDS

In [None]:
# now we add a column 'text_clean' in our dataframe :

import re

train['text_clean'] = train['text'].apply(lambda x : ''.join(ch for ch in x if ch.isalnum() # we want to keep only alpha numeric characters (mostly)
                                                                            or ch==" " # we also need to keep spaces for tokenizing
                                                                            or ch=="#")) # as we are on tweeter, # will have importance so we keep it here

# we replace multiple spaces by one space, we lower text and we delete spaces at begining/end :
train['text_clean'] = train['text_clean'].apply(lambda x : re.sub(' +', ' ', x).lower().strip())

# we lemmatize words and delete stop words
train['text_clean'] = train['text_clean'].apply(lambda x : " ".join([token.lemma_ for token in nlp(x) if (token.lemma_ not in STOP_WORDS) & (token.text not in STOP_WORDS)]))

# finally, as lemmatization separated # from words we have to stuck them again :
train['text_clean'] = train['text_clean'].apply(lambda x : x.replace("# ", "#"))


In [None]:
train.head()

Unnamed: 0,text,target,text_clean
0,Our Deeds are the Reason of this #earthquake M...,1,deed reason #earthquake allah forgive
1,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada
2,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...
3,"13,000 people receive #wildfires evacuation or...",1,13000 people receive #wildfire evacuation orde...
4,Just got sent this photo from Ruby #Alaska as ...,1,send photo ruby #alaska smoke #wildfire pour s...


In [None]:
vocabulary_size = 10000

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocabulary_size,  filters='!"$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n') # instanciate the tokenizer
# we precised filters here  because in tokernizer, # is filtered if nothing is precised
tokenizer.fit_on_texts(train["text_clean"])
train["review_encoded"] = tokenizer.texts_to_sequences(train["text_clean"])

train["len_review"] = train["review_encoded"].apply(lambda x: len(x))

train = train[train["len_review"]!=0]

In [None]:
train.head()

Unnamed: 0,text,target,text_clean,review_encoded,len_review
0,Our Deeds are the Reason of this #earthquake M...,1,deed reason #earthquake allah forgive,"[3737, 399, 657, 2301, 1922]",5
1,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,"[118, 2, 151, 514, 5707, 5708, 1056]",7
2,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...,"[1354, 435, 1676, 309, 5709, 279, 183, 1676, 3...",11
3,"13,000 people receive #wildfires evacuation or...",1,13000 people receive #wildfire evacuation orde...,"[2302, 6, 2303, 930, 183, 280, 40]",7
4,Just got sent this photo from Ruby #Alaska as ...,1,send photo ruby #alaska smoke #wildfire pour s...,"[169, 127, 5710, 3738, 162, 930, 2304, 97]",8


In [None]:
tokenizer.index_word[657]

'#earthquake'

In [None]:
reviews_pad = tf.keras.preprocessing.sequence.pad_sequences(train.review_encoded, padding="post")

In [None]:
full_ds = tf.data.Dataset.from_tensor_slices((reviews_pad, train.target.values))

In [None]:
next(iter(full_ds))

(<tf.Tensor: shape=(25,), dtype=int32, numpy=
 array([3737,  399,  657, 2301, 1922,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0], dtype=int32)>,
 <tf.Tensor: shape=(), dtype=int64, numpy=1>)

In [None]:
# Train Test Split
TAKE_SIZE = int(0.7*train.shape[0])

train_data = full_ds.take(TAKE_SIZE).shuffle(TAKE_SIZE)
train_data = train_data.batch(32)

test_data = full_ds.skip(TAKE_SIZE)
test_data = test_data.batch(32)

In [None]:
for tweet, target in train_data.take(1):
  print(tweet, target)

tf.Tensor(
[[ 127 4871 4872   59 1365  135 1865 2229    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0]
 [9299   41 9300  230 9301 9302 4442    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0]
 [ 236  410  400  219  458   39  151 8790 2086   14  471  727  594  198
  8791    0    0    0    0    0    0    0    0    0    0]
 [ 241 2138 2763  223  934 1247 2637    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0]
 [ 262 3611   66 3612 2223 3613 3614    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0]
 [2947   56 1744  644 1176  630  529  481  100 4696    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0]
 [6053   62    8  776  238  250  482   12 6054    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0]
 [  22 1412  108 7414  714 3057 1008  240    0    0    0    0    0    0
     0 

## Model creation, training, and results observation

In [None]:

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D, GRU, SimpleRNN, LSTM
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [None]:
embedding_dim=64

model = Sequential([
  Embedding(vocabulary_size, embedding_dim, name="embedding", mask_zero=True, input_length = reviews_pad.shape[1]),
  GRU(units=64, return_sequences=True), # maintains the sequential nature
  GRU(units=32, return_sequences=False), # returns the last output
  Dense(1, activation='sigmoid')
])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 64)            640000    
                                                                 
 gru (GRU)                   (None, 25, 64)            24960     
                                                                 
 gru_1 (GRU)                 (None, 32)                9408      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 674,401
Trainable params: 674,401
Non-trainable params: 0
_________________________________________________________________


In [None]:
from tensorflow.keras.optimizers import Adam

In [None]:
opt = Adam(learning_rate = 0.00005)

model.compile(optimizer=opt,
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['binary_accuracy'])

In [None]:
history = model.fit(train_data, validation_data = test_data, epochs = 10) # we could use weights

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
history.history

{'binary_accuracy': [0.5644402503967285,
  0.5753527879714966,
  0.594355583190918,
  0.7900282144546509,
  0.8498588800430298,
  0.8698024749755859,
  0.8867356777191162,
  0.8985888957977295,
  0.9100658297538757,
  0.9194731712341309],
 'loss': [0.6895737648010254,
  0.6816421151161194,
  0.6595494747161865,
  0.5154650807380676,
  0.3892464339733124,
  0.3363129794597626,
  0.2976192235946655,
  0.2667684257030487,
  0.23966412246227264,
  0.21576406061649323],
 'val_binary_accuracy': [0.5583845376968384,
  0.5583845376968384,
  0.6071115136146545,
  0.7392449378967285,
  0.738366961479187,
  0.7352941036224365,
  0.7317822575569153,
  0.7330992221832275,
  0.7221246957778931,
  0.7172958850860596],
 'val_loss': [0.6885481476783752,
  0.6841686367988586,
  0.657689094543457,
  0.5554649233818054,
  0.5449925661087036,
  0.5461546778678894,
  0.5603995323181152,
  0.576607346534729,
  0.599363386631012,
  0.6323287487030029]}

Le modèle est entrainé, reste à faire :
- appliquer le modèle sur chaque élément du dataframe
- faire une colonne prédictions
- extraire les faux négatifs et faux positifs

In [None]:
train.head()

Unnamed: 0,text,target,text_clean,review_encoded,len_review
0,Our Deeds are the Reason of this #earthquake M...,1,deed reason #earthquake allah forgive,"[3737, 399, 657, 2301, 1922]",5
1,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,"[118, 2, 151, 514, 5707, 5708, 1056]",7
2,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...,"[1354, 435, 1676, 309, 5709, 279, 183, 1676, 3...",11
3,"13,000 people receive #wildfires evacuation or...",1,13000 people receive #wildfire evacuation orde...,"[2302, 6, 2303, 930, 183, 280, 40]",7
4,Just got sent this photo from Ruby #Alaska as ...,1,send photo ruby #alaska smoke #wildfire pour s...,"[169, 127, 5710, 3738, 162, 930, 2304, 97]",8


In [None]:
#model(np.array([train.review_encoded[0]]))
def padding_25(list_1) :
  if len(list_1) == 25 :
    return np.array(list_1)
  else :
    while len(list_1) < 25 :
      list_1.append(0)
    return np.array(list_1)




train["review_encoded_padded"] = train["review_encoded"].apply(lambda x : padding_25(x))

In [None]:
train.head()

Unnamed: 0,text,target,text_clean,review_encoded,len_review,review_encoded_padded
0,Our Deeds are the Reason of this #earthquake M...,1,deed reason #earthquake allah forgive,"[3737, 399, 657, 2301, 1922, 0, 0, 0, 0, 0, 0,...",5,"[3737, 399, 657, 2301, 1922, 0, 0, 0, 0, 0, 0,..."
1,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,"[118, 2, 151, 514, 5707, 5708, 1056, 0, 0, 0, ...",7,"[118, 2, 151, 514, 5707, 5708, 1056, 0, 0, 0, ..."
2,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...,"[1354, 435, 1676, 309, 5709, 279, 183, 1676, 3...",11,"[1354, 435, 1676, 309, 5709, 279, 183, 1676, 3..."
3,"13,000 people receive #wildfires evacuation or...",1,13000 people receive #wildfire evacuation orde...,"[2302, 6, 2303, 930, 183, 280, 40, 0, 0, 0, 0,...",7,"[2302, 6, 2303, 930, 183, 280, 40, 0, 0, 0, 0,..."
4,Just got sent this photo from Ruby #Alaska as ...,1,send photo ruby #alaska smoke #wildfire pour s...,"[169, 127, 5710, 3738, 162, 930, 2304, 97, 0, ...",8,"[169, 127, 5710, 3738, 162, 930, 2304, 97, 0, ..."


In [None]:
train['prediction'] = model(
    reviews_pad
).numpy()

In [None]:
train['prediction_rounded'] = train['prediction'].apply(lambda x : 1 if x > 0.5 else 0)

train.head()

Unnamed: 0,text,target,text_clean,review_encoded,len_review,review_encoded_padded,prediction,prediction_rounded
0,Our Deeds are the Reason of this #earthquake M...,1,deed reason #earthquake allah forgive,"[3737, 399, 657, 2301, 1922, 0, 0, 0, 0, 0, 0,...",5,"[3737, 399, 657, 2301, 1922, 0, 0, 0, 0, 0, 0,...",0.898325,1
1,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,"[118, 2, 151, 514, 5707, 5708, 1056, 0, 0, 0, ...",7,"[118, 2, 151, 514, 5707, 5708, 1056, 0, 0, 0, ...",0.995952,1
2,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...,"[1354, 435, 1676, 309, 5709, 279, 183, 1676, 3...",11,"[1354, 435, 1676, 309, 5709, 279, 183, 1676, 3...",0.976372,1
3,"13,000 people receive #wildfires evacuation or...",1,13000 people receive #wildfire evacuation orde...,"[2302, 6, 2303, 930, 183, 280, 40, 0, 0, 0, 0,...",7,"[2302, 6, 2303, 930, 183, 280, 40, 0, 0, 0, 0,...",0.984936,1
4,Just got sent this photo from Ruby #Alaska as ...,1,send photo ruby #alaska smoke #wildfire pour s...,"[169, 127, 5710, 3738, 162, 930, 2304, 97, 0, ...",8,"[169, 127, 5710, 3738, 162, 930, 2304, 97, 0, ...",0.811689,1


In [None]:
train['good_answers'] = (train['prediction_rounded'] == train['target'])

train.head(30)

Unnamed: 0,text,target,text_clean,review_encoded,len_review,review_encoded_padded,prediction,prediction_rounded,good_answers
0,Our Deeds are the Reason of this #earthquake M...,1,deed reason #earthquake allah forgive,"[3737, 399, 657, 2301, 1922, 0, 0, 0, 0, 0, 0,...",5,"[3737, 399, 657, 2301, 1922, 0, 0, 0, 0, 0, 0,...",0.898325,1,True
1,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,"[118, 2, 151, 514, 5707, 5708, 1056, 0, 0, 0, ...",7,"[118, 2, 151, 514, 5707, 5708, 1056, 0, 0, 0, ...",0.995952,1,True
2,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...,"[1354, 435, 1676, 309, 5709, 279, 183, 1676, 3...",11,"[1354, 435, 1676, 309, 5709, 279, 183, 1676, 3...",0.976372,1,True
3,"13,000 people receive #wildfires evacuation or...",1,13000 people receive #wildfire evacuation orde...,"[2302, 6, 2303, 930, 183, 280, 40, 0, 0, 0, 0,...",7,"[2302, 6, 2303, 930, 183, 280, 40, 0, 0, 0, 0,...",0.984936,1,True
4,Just got sent this photo from Ruby #Alaska as ...,1,send photo ruby #alaska smoke #wildfire pour s...,"[169, 127, 5710, 3738, 162, 930, 2304, 97, 0, ...",8,"[169, 127, 5710, 3738, 162, 930, 2304, 97, 0, ...",0.811689,1,True
5,#RockyFire Update => California Hwy. 20 closed...,1,#rockyfire update california hwy 20 close dire...,"[2305, 170, 40, 1232, 436, 361, 818, 819, 264,...",12,"[2305, 170, 40, 1232, 436, 361, 818, 819, 264,...",0.99845,1,True
6,#flood #disaster Heavy rain causes flash flood...,1,#flood #disaster heavy rain cause flash floodi...,"[1491, 1492, 658, 158, 46, 619, 379, 362, 5711...",12,"[1491, 1492, 658, 158, 46, 619, 379, 362, 5711...",0.999233,1,True
7,I'm on top of the hill and I can see a fire in...,1,hill fire wood,"[1057, 2, 1677, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,"[1057, 2, 1677, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.524506,1,True
8,There's an emergency evacuation happening now ...,1,s emergency evacuation happen building street,"[4, 14, 183, 171, 28, 362, 0, 0, 0, 0, 0, 0, 0...",6,"[4, 14, 183, 171, 28, 362, 0, 0, 0, 0, 0, 0, 0...",0.547367,1,True
9,I'm afraid that the tornado is coming to our a...,1,afraid tornado come area,"[1923, 310, 11, 198, 0, 0, 0, 0, 0, 0, 0, 0, 0...",4,"[1923, 310, 11, 198, 0, 0, 0, 0, 0, 0, 0, 0, 0...",0.594123,1,True


In [None]:
train.good_answers.value_counts()

True     6585
False    1008
Name: good_answers, dtype: int64

In [None]:
6590/(6590+1003)

0.8679046490188331

In [None]:
train[ train['good_answers'] == False ]

Unnamed: 0,text,target,text_clean,review_encoded,len_review,review_encoded_padded,prediction,prediction_rounded,good_answers
18,My car is so fast,0,car fast,"[41, 515, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,"[41, 515, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.541776,1,False
150,@mickinyman @TheAtlantic That or they might be...,0,mickinyman theatlantic kill airplane accident ...,"[5960, 5961, 8, 339, 60, 177, 41, 49, 2883, 29...",10,"[5960, 5961, 8, 339, 60, 177, 41, 49, 2883, 29...",0.876917,1,False
167,Statistically I'm at more of risk of getting k...,0,statistically risk kill cop die airplane accident,"[5994, 697, 8, 732, 83, 339, 60, 0, 0, 0, 0, 0...",7,"[5994, 697, 8, 732, 83, 339, 60, 0, 0, 0, 0, 0...",0.572237,1,False
195,when you don't know which way an ambulance is ...,1,know way ambulance come ltlt,"[25, 65, 250, 11, 3822, 0, 0, 0, 0, 0, 0, 0, 0...",5,"[25, 65, 250, 11, 3822, 0, 0, 0, 0, 0, 0, 0, 0...",0.369447,0,False
214,Annihilated Abs . ?? http://t.co/1xPw292tJe,1,annihilate ab httptco1xpw292tje,"[342, 3829, 6079, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...",3,"[342, 3829, 6079, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...",0.480348,0,False
...,...,...,...,...,...,...,...,...,...
7584,These boxes are ready to explode! Exploding Ki...,0,box ready explode explode kitten finally arriv...,"[945, 699, 85, 85, 3156, 535, 752, 0, 0, 0, 0,...",7,"[945, 699, 85, 85, 3156, 535, 752, 0, 0, 0, 0,...",0.521999,1,False
7586,#Sismo DETECTADO #JapÌ_n 15:41:07 Seismic inte...,1,#sismo detectado #japìn 154107 seismic intensi...,"[1221, 2820, 2821, 494, 1845, 1026, 5497, 2822...",8,"[1221, 2820, 2821, 494, 1845, 1026, 5497, 2822...",0.247449,0,False
7592,An IS group suicide bomber detonated an explos...,1,group suicide bomber detonate explosivespacke ...,"[288, 37, 132, 116, 3710, 314, 417, 287, 1090,...",14,"[288, 37, 132, 116, 3710, 314, 417, 287, 1090,...",0.278345,0,False
7598,Father-of-three Lost Control of Car After Over...,1,fatherofthree lose control car overtake collid...,"[266, 542, 41, 82, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",4,"[266, 542, 41, 82, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.489022,0,False


here we can see the tweets that were not well classified by the model.

In [None]:
train[ train['good_answers'] == False ]['target'].value_counts()

1    617
0    391
Name: target, dtype: int64

bad classification is balanced

let's look closer at the bad classified tweet to try and understand

In [None]:
train.head(3)

Unnamed: 0,text,target,text_clean,review_encoded,len_review,review_encoded_padded,prediction,prediction_rounded,good_answers
0,Our Deeds are the Reason of this #earthquake M...,1,deed reason #earthquake allah forgive,"[3737, 399, 657, 2301, 1922, 0, 0, 0, 0, 0, 0,...",5,"[3737, 399, 657, 2301, 1922, 0, 0, 0, 0, 0, 0,...",0.898325,1,True
1,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,"[118, 2, 151, 514, 5707, 5708, 1056, 0, 0, 0, ...",7,"[118, 2, 151, 514, 5707, 5708, 1056, 0, 0, 0, ...",0.995952,1,True
2,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...,"[1354, 435, 1676, 309, 5709, 279, 183, 1676, 3...",11,"[1354, 435, 1676, 309, 5709, 279, 183, 1676, 3...",0.976372,1,True


In [None]:
df_reduced = train[train['good_answers'] == False]
df_reduced = df_reduced[['text', 'text_clean', 'prediction', 'target']]

In [None]:
df_reduced.head(15)

Unnamed: 0,text,text_clean,prediction,target
18,My car is so fast,car fast,0.541776,0
150,@mickinyman @TheAtlantic That or they might be...,mickinyman theatlantic kill airplane accident ...,0.876917,0
167,Statistically I'm at more of risk of getting k...,statistically risk kill cop die airplane accident,0.572237,0
195,when you don't know which way an ambulance is ...,know way ambulance come ltlt,0.369447,1
214,Annihilated Abs . ?? http://t.co/1xPw292tJe,annihilate ab httptco1xpw292tje,0.480348,1
229,Ready to get annihilated for the BUCS game,ready annihilate bucs game,0.437179,1
247,annihilating quarterstaff of annihilation,annihilate quarterstaff annihilation,0.440427,1
251,U.S National Park Services Tonto National Fore...,national park services tonto national forest s...,0.419535,1
269,World Annihilation vs Self Transformation http...,world annihilation vs self transformation http...,0.356114,1
271,U.S National Park Services Tonto National Fore...,national park services tonto national forest s...,0.44684,1


In [None]:
for i in range(50):
  print(df_reduced['text'].iloc[i])
  print(df_reduced['text_clean'].iloc[i])
  print()

My car is so fast
car fast

@mickinyman @TheAtlantic That or they might be killed in an airplane accident in the night a car wreck! Politics at it's best.
mickinyman theatlantic kill airplane accident night car wreck politic good

Statistically I'm at more of risk of getting killed by a cop than I am of dying in an airplane accident.
statistically risk kill cop die airplane accident

when you don't know which way an ambulance is coming from &lt;&lt;
know way ambulance come ltlt

Annihilated Abs . ?? http://t.co/1xPw292tJe
annihilate ab httptco1xpw292tje

Ready to get annihilated for the BUCS game
ready annihilate bucs game

annihilating quarterstaff of annihilation
annihilate quarterstaff annihilation

U.S National Park Services Tonto National Forest: Stop the Annihilation of the Salt River Wild Horse... https://t.co/sW1sBua3mN via @Change
national park services tonto national forest stop annihilation salt river wild horse httpstcosw1sbua3mn change

World Annihilation vs Self Transform

reste à faire dans le prétraitement :

supprimer tous les liens (string qui commencent par http) et les mots commençant par @

## New pre treatment + model training

we want to :
- delete mentions (words beginning with @)
- delete hyperlinks (words beginning with http)
- treat # differently : we want the model to take into account both the word and the #. So when a we have a "#word1", we will replace it by "word1" at its position in the sentence and add "#word1" at the end of the sentence.

In [None]:
# we delete mentions (words beginning with @) : 

def delete_arobase(sentence) :
  last_character_position = len(sentence) - 1
  words_to_delete = []
  new_sentence = sentence

  for i, letter in enumerate(sentence) :
    if letter == "@" :
      word_to_delete = "@"
      j = i+1
      if j <= last_character_position :
        while sentence[j] != " " :
          word_to_delete += sentence[j]
          j = j + 1
          if j>=last_character_position :
            break
      else :
        break
        
      words_to_delete.append(word_to_delete)

  for word in words_to_delete :
    new_sentence = new_sentence.replace(word, "")

  return new_sentence

In [None]:
train['text_clean_2'] = train['text'].apply(lambda x : delete_arobase(x))
train['text_clean_2'] = train['text_clean_2'].apply(lambda x : ''.join(ch for ch in x if ch.isalnum() # we want to keep only alpha numeric characters (mostly)
                                                                            or ch==" " # we also need to keep spaces for tokenizing
                                                                            or ch=="#")) # as we are on tweeter, # will have importance so we keep it here

# we replace multiple spaces by one space, we lower text and we delete spaces at begining/end :
train['text_clean_2'] = train['text_clean_2'].apply(lambda x : re.sub(' +', ' ', x).lower().strip())

# we lemmatize words, delete stop words and delete hyperlinks
train['text_clean_2'] = train['text_clean_2'].apply(lambda x : " ".join([token.lemma_ for token in nlp(x) if (token.lemma_ not in STOP_WORDS) and (token.text not in STOP_WORDS) & ("http" not in token.lemma_)]))
train['text_clean_2'] = train['text_clean_2'].apply(lambda x : x.replace("# ", "#"))


In [None]:
# we want to try to treat the hashtags differently. we want to keep the word alone and add the hashtags at the end of the sentence. that way we keep the sense of the word and add the power of the hashtag
# which tells us a lot about the context
def manage_hashtags(sentence) :
  last_character_position = len(sentence) - 1
  words_with_hashtags = []
  new_sentence = sentence

  for i, letter in enumerate(sentence) :
    if letter == "#" :
      word_with_hashtag = "#"
      j = i+1
      if j <= last_character_position :
        while sentence[j] != " " :
          word_with_hashtag += sentence[j]
          j = j + 1
          if j>=last_character_position :
            break
      else :
        break
        
      words_with_hashtags.append(word_with_hashtag)
  
  for word in words_with_hashtags :
    word_without_hashtag = word.replace("#", "")
    new_sentence = new_sentence.replace(word, word_without_hashtag)
  
  for word in words_with_hashtags :
    new_sentence = new_sentence + " " + word
  
  return new_sentence


In [None]:
train['text_clean_2'] = train['text_clean_2'].apply(lambda x : manage_hashtags(x))

In [None]:
for i in range(60):
  print(train['text'].iloc[i])
  print(train['text_clean'].iloc[i])
  print(train['text_clean_2'].iloc[i])
  print()

Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
deed reason #earthquake allah forgive
deed reason earthquake allah forgive #earthquake

Forest fire near La Ronge Sask. Canada
forest fire near la ronge sask canada
forest fire near la ronge sask canada

All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
resident ask shelter place notify officer evacuation shelter place order expect
resident ask shelter place notify officer evacuation shelter place order expect

13,000 people receive #wildfires evacuation orders in California 
13000 people receive #wildfire evacuation order california
13000 people receive wildfire evacuation order california #wildfire

Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
send photo ruby #alaska smoke #wildfire pour school
send photo ruby alaska smoke wildfire pour school #alaska #wildfire

#RockyFire Update => Califo

In [None]:
vocabulary_size = 10000

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocabulary_size,  filters='!"$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n') # instanciate the tokenizer
# we precised filters here  because in tokernizer, # is filtered if nothing is precised
tokenizer.fit_on_texts(train["text_clean_2"])
train["review_encoded_2"] = tokenizer.texts_to_sequences(train["text_clean_2"])

train["len_review_2"] = train["review_encoded_2"].apply(lambda x: len(x))

train = train[train["len_review_2"]!=0]

In [78]:
reviews_pad_2 = tf.keras.preprocessing.sequence.pad_sequences(train.review_encoded_2, padding="post")

In [79]:
full_ds_2 = tf.data.Dataset.from_tensor_slices((reviews_pad_2, train.target.values))

In [80]:
# Train Test Split
TAKE_SIZE = int(0.7*train.shape[0])

train_data = full_ds_2.take(TAKE_SIZE).shuffle(TAKE_SIZE)
train_data = train_data.batch(32)

test_data = full_ds_2.skip(TAKE_SIZE)
test_data = test_data.batch(32)

In [97]:
embedding_dim=64

model_2 = Sequential([
  Embedding(vocabulary_size, embedding_dim, name="embedding", mask_zero=True, input_length = reviews_pad.shape[1]),
  GRU(units=64, return_sequences=True), # maintains the sequential nature
  GRU(units=32, return_sequences=False), # returns the last output
  Dense(1, activation='sigmoid')
])

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 64)            640000    
                                                                 
 gru (GRU)                   (None, 25, 64)            24960     
                                                                 
 gru_1 (GRU)                 (None, 32)                9408      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 674,401
Trainable params: 674,401
Non-trainable params: 0
_________________________________________________________________


In [98]:
opt = Adam(learning_rate = 0.00005)

model_2.compile(optimizer=opt,
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['binary_accuracy'])

In [99]:
history_2 = model_2.fit(train_data, validation_data = test_data, epochs = 10) 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


It is not significantly better than with our previous try. We reach a better score than previously but it is hard to tell if it is relevant.

As we saw that disaster tweets seems longer than the others, let's try to see if making mask_zeros=False makes a difference :

In [102]:
embedding_dim=64

model_3 = Sequential([
  Embedding(vocabulary_size, embedding_dim, name="embedding", mask_zero=False, input_length = reviews_pad.shape[1]),
  GRU(units=64, return_sequences=True), # maintains the sequential nature
  GRU(units=32, return_sequences=False), # returns the last output
  Dense(1, activation='sigmoid')
])
opt = Adam(learning_rate = 0.00007) # we increase the learning rate a little

model_3.compile(optimizer=opt,
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['binary_accuracy'])

history_3 = model_3.fit(train_data, validation_data = test_data, epochs = 10) 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


It seems a little bit better !

We could look deeper to see what we could improve, we will if we get some time to :)