#  Tweet classification

* L'objectif est d'entrainer un modele afin qu'elle puisse determiner si un tweet est sur une catastrophe ou pas

* telecharger la base de donnees depuis kaggle dans lien suivant :
https://www.kaggle.com/c/nlp-getting-started

In [1]:
#pip install panda

In [2]:
# import the necassary libariries :
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import os

In [3]:
# load the data : 
data = pd.read_csv('train.csv')
print(data.shape)

# Visualisation de l'entete de la base de donnees :
data.head()

(7613, 5)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
# Calcul combien de disaster et non disaster on a :
print("Disaster = ", (data.target == 1).sum())
print("Non disaster = ", (data.target == 0).sum())

Disaster =  3271
Non disaster =  4342


### Preprocessing : 

###### 1 - clean the data

In [5]:
import re
import string

# Enlever les URLs :
def remove_URL(text):
    url = re.compile(r"https?://\S+|www\.\S+")
    return url.sub(r"", text)

# Enlever les ponctuations :
def remove_punct(text):
    translator = str.maketrans("", "", string.punctuation) # string.punctuation contient toutes les ponstuations
    return text.translate(translator)

# Enlever les stopwords = les mots de connexion comme 'then', 'around', 'first' .....
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words("english"))
def remove_stopwords(text):
    mots_filtrer = [word.lower() for word in text.split() if word.lower() not in stop]
    return " ".join(mots_filtrer)


# Enlever ces trois inutiles mots depuis les textes de notre base de donnees :
data['text'] = data.text.map(remove_URL)
data['text'] = data.text.map(remove_punct)
data['text'] = data.text.map(remove_stopwords)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
print(data['text'].shape)
data.text

(7613,)


0            deeds reason earthquake may allah forgive us
1                   forest fire near la ronge sask canada
2       residents asked shelter place notified officer...
3       13000 people receive wildfires evacuation orde...
4       got sent photo ruby alaska smoke wildfires pou...
                              ...                        
7608    two giant cranes holding bridge collapse nearb...
7609    ariaahrary thetawniest control wild fires cali...
7610                      m194 0104 utc5km volcano hawaii
7611    police investigating ebike collided car little...
7612    latest homes razed northern california wildfir...
Name: text, Length: 7613, dtype: object

##### 2 - Bags of word

In [7]:
from keras.preprocessing.text import Tokenizer

t = Tokenizer()
t.fit_on_texts(data['text'])
counter = t.word_index
nbr_mots = len(counter) # nombre des mots uniques

print(nbr_mots)
print(counter)

17971


#### 3 - Split the data into training and validation : 

In [8]:
# on veut 80% pour l'entrainement :
train_size = int((data['text'].shape[0])*0.8)

train = data[:train_size]
test = data[train_size:]
print("train size : ", len(train))
print("test size : ", len(test))

# define train and test data
train_sentences = train["text"]
train_target = train["target"]

test_sentences = test["text"]
test_target = test["target"]

# les convetir en numpy
train_sentences = np.array(train_sentences)
train_target = np.array(train_target)
test_sentences = np.array(test_sentences)
test_target = np.array(test_target)

train size :  6090
test size :  1523


#### 4 - Tokenize : 

c'est comme le principe de one hot :
* tout d'abord, tokenize permet de donner a chaque mot du vocabulaire un indice unique
* Ensuite, elle permet de convetir un sentence (texte) en une liste d'entier (sequences) qui correspond aux indices des mots contenus dans ce texte : "HEllo World oussama" => [ 5, 6, 9] avec 5, 6, 9 les indices des hello world oussama parmi tout les mots du vocabulaire

In [9]:
# Entrainer le tokenizer :
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words = nbr_mots) # nbr_mots est le nombre des mots unique du vocabualire = toutes les textes
tokenizer.fit_on_texts(train_sentences)

# Visualiser les indices des mots :
word_indices = tokenizer.word_index
print("Les indices des mots : ", word_indices)


# convetir les textes de la data en des listes d'entiers (qui correspondent aux indices des mots)
train_seq = tokenizer.texts_to_sequences(train_sentences)
test_seq = tokenizer.texts_to_sequences(test_sentences)



In [10]:
# exemple d'une sequence :
print(train_sentences[0])
print(train_seq[0])

deeds reason earthquake may allah forgive us
[3739, 696, 235, 41, 1282, 3740, 14]


#### 5 - Padding pour avoir la meme taille chez toutes les sequences  

In [11]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Definir la taille maximale que vous voulez pour vos mots :
max_length = 20

# Appliquer ceci sur les sequnces :
train_seq = pad_sequences(train_seq, maxlen=max_length, padding="post", truncating="post")
test_seq = pad_sequences(test_seq, maxlen=max_length, padding="post", truncating="post")
print(train_seq.shape)
print(test_seq.shape)
print(train_seq[0])

(6090, 20)
(1523, 20)
[3739  696  235   41 1282 3740   14    0    0    0    0    0    0    0
    0    0    0    0    0    0]


### decodage : Si on veut avoir le texte en se basant sur la liste des indices  

In [12]:
# On inverse le dictionnaire de bag of words afin d'avoir la cle = l'indice et la valeur = le mot
reverse_counter = dict([(idx, word) for (word, idx) in counter.items()])

# definition d'une fonction pour convetir un ensemble d'indice en des mots
def decode(sequence):
    return " ".join( [ reverse_counter.get(idx, '?') for idx in sequence])

# exemple : conversion de la sequence d'indice 10 
text_decoder = decode([indice_text for indice_text in train_seq[10] if indice_text!=0])
print(train_seq[10])
print(text_decoder)

[520   8 395 156 297 411   0   0   0   0   0   0   0   0   0   0   0   0
   0   0]
person people bioterror next phone thank


### Classification

#### 1 - create LSTM model:

In [14]:
model = keras.Sequential()
model.add(keras.layers.Embedding(input_dim=nbr_mots, output_dim=32, input_length=max_length))
# nbr_mots = nbre de mots unique dans le vocabulaire, max_length = taille de chaque sequence = 20
model.add(keras.layers.LSTM(64, dropout=0.1))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(1, activation='sigmoid'))
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 20, 32)            575072    
                                                                 
 lstm_1 (LSTM)               (None, 64)                24832     
                                                                 
 flatten_1 (Flatten)         (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 599,969
Trainable params: 599,969
Non-trainable params: 0
_________________________________________________________________
None


#### 2 - Compile the model : definit l'optimiseur et loss

In [15]:
# Defintion des hyperparametres :
nbr_epochs = 20
lr = 0.001
batch_size = 32

# Defintion d'optimiseur
optim = keras.optimizers.Adam(learning_rate=lr)

# Defintion du loss
loss = keras.losses.BinaryCrossentropy(from_logits=False) 
# from_logits=False prq on a deja definit la fonction d'activation dans la derniere couche du modele

# Compiler le model
model.compile(optimizer=optim, loss=loss, metrics=['accuracy'])

#### 3 - Entrainer le modele :

In [16]:
model.fit(train_seq, train_target, epochs=nbr_epochs, validation_data=(test_seq, test_target))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x13ccb369400>

In [17]:
# D'apres les resultats d'entrainement, on remarque un overfitting

#### 4 -  Pour l'evaluation du modele :

In [18]:
model.evaluate(test_seq, test_target)



[1.4328607320785522, 0.7406434416770935]

##### 5 -  Prediction 

In [19]:
# On fait l'hypothese qu'on sait pas les targets de test data et on veut donner une prediction
predictions = model.predict(test_seq) # ceci donne des valeurs de proba entre 0 et 1
predictions = [ 1 if p> 0.5 else 0 for p in predictions]

# visualisation de quelque predictions :
print("tweet = ", decode([indice_text for indice_text in train_seq[100] if indice_text!=0]))
print("prediction", "Disaster" if predictions[100]==1 else "Not Disaster")
print("truth", "Disaster" if test_target[100]==1 else "Not Disaster")

tweet =  duck chick burning tells bioterror around see collapse thunderstorm
prediction Not Disaster
truth Not Disaster
