# GloVe & CNN

GLOVE fonctionne de la même manière que Word2Vec. Alors que vous pouvez voir ci-dessus que Word2Vec est un modèle «prédictif» qui prédit un mot donné par le contexte, GLOVE apprend en construisant une matrice de cooccurrence (mots X contexte) qui compte essentiellement la fréquence d'apparition d'un mot dans un contexte. Comme il s'agira d'une matrice gigantesque, nous factorisons cette matrice pour obtenir une représentation de dimension inférieure. Il y a beaucoup de détails dans GLOVE mais c'est l'idée approximative.

In [83]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Embedding, Activation, Dropout
from tensorflow.keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D
from tensorflow.keras.optimizers import Adam

from sklearn.model_selection import train_test_split
import numpy as np
from numpy import array
import pandas as pd

In [47]:
from NLP_Natural_Disasters.data import get_data, clean_data

In [48]:
train_df = get_data()
cleaned_df = clean_data(train_df)

In [49]:
cleaned_df

Unnamed: 0,id,text,target
0,1,deed reason earthquake may allah forgive u,1
1,4,forest fire near la ronge sask canada,1
2,5,resident asked shelter place notified officer ...,1
3,6,people receive wildfire evacuation order calif...,1
4,7,got sent photo ruby alaska smoke wildfire pour...,1
...,...,...,...
7608,10869,two giant crane holding bridge collapse nearby...,1
7609,10870,ariaahrary thetawniest control wild fire calif...,1
7610,10871,utckm volcano hawaii,1
7611,10872,police investigating ebike collided car little...,1


## Data Cleaning to add eventually 

Pour éviter les mots rares qui n'apparaissent qu'une fois type mot avec faute, contraction etc

In [50]:
#def transform_to_list_word(serie):
#    split_list=[]  
#    for sentence in serie.str.split():
#        for word in sentence:
#            split_list.append(word)
#    return split_list

In [51]:
#list_word = transform_to_list_word(cleaned_df['text'])
#len(list_word)

69795

**POUR DATA CLEANING**

In [52]:
twitts = ' '.join(cleaned_df['text'])
twitts = twitts.split()
freq_comm = pd.Series(twitts).value_counts()
rare = freq_comm[freq_comm.values == 1]
rare

peterhowenecn      1
cadusd             1
wiwnpfxa           1
wlandslide         1
ofclans            1
                  ..
selfdestruction    1
takecare           1
cinla              1
windowgatribble    1
symptom            1
Length: 9896, dtype: int64

In [53]:
#list_word_new = [word for word in list_word if word not in rare]
#len(list_word_new)

59899

In [54]:
#cleaned_df

Unnamed: 0,id,text,target
0,1,deed reason earthquake may allah forgive u,1
1,4,forest fire near la ronge sask canada,1
2,5,resident asked shelter place notified officer ...,1
3,6,people receive wildfire evacuation order calif...,1
4,7,got sent photo ruby alaska smoke wildfire pour...,1
...,...,...,...
7608,10869,two giant crane holding bridge collapse nearby...,1
7609,10870,ariaahrary thetawniest control wild fire calif...,1
7610,10871,utckm volcano hawaii,1
7611,10872,police investigating ebike collided car little...,1


**POUR DATA CLEANING**

In [55]:
%%time
def get_clean_text(sentence):
    if type(sentence) is str:
        sentence = " " .join(word for word in sentence.split() if word not in rare)
        return sentence

cleaned_df['tweet'] = cleaned_df['text'].apply(lambda x: get_clean_text(x))

CPU times: user 84.3 ms, sys: 2.92 ms, total: 87.2 ms
Wall time: 87.1 ms


## Preprocessing

In [11]:
cleaned_df

Unnamed: 0,id,text,target,tweet
0,1,deed reason earthquake may allah forgive u,1,deed reason earthquake may allah forgive u
1,4,forest fire near la ronge sask canada,1,forest fire near la canada
2,5,resident asked shelter place notified officer ...,1,resident asked shelter place officer evacuatio...
3,6,people receive wildfire evacuation order calif...,1,people receive wildfire evacuation order calif...
4,7,got sent photo ruby alaska smoke wildfire pour...,1,got sent photo alaska smoke wildfire school
...,...,...,...,...
7608,10869,two giant crane holding bridge collapse nearby...,1,two giant crane holding bridge collapse nearby...
7609,10870,ariaahrary thetawniest control wild fire calif...,1,ariaahrary thetawniest control wild fire calif...
7610,10871,utckm volcano hawaii,1,utckm volcano hawaii
7611,10872,police investigating ebike collided car little...,1,police investigating ebike collided car little...


### Token

In [56]:
cleaned_df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [124]:
text = cleaned_df['tweet'].tolist()
text

['deed reason earthquake may allah forgive u',
 'forest fire near la canada',
 'resident asked shelter place officer evacuation shelter place order expected',
 'people receive wildfire evacuation order california',
 'got sent photo alaska smoke wildfire school',
 'rockyfire update california hwy closed direction due lake county fire cafire wildfire',
 'flood disaster heavy rain cause flash flooding street colorado spring area',
 'im top hill see fire wood',
 'there emergency evacuation happening building across street',
 'im afraid tornado coming area',
 'three people died heat wave far',
 'haha south tampa getting flooded hah wait second live south tampa gon na gon na flooding',
 'raining flooding florida tampa day ive lost count',
 'flood bago myanmar arrived bago',
 'damage school bus multi car crash breaking',
 'whats man',
 'love fruit',
 'summer lovely',
 'car fast',
 '',
 'ridiculous',
 'london cool',
 'love',
 'wonderful day',
 '',
 'cant eat shit',
 'nyc last week',
 'love gir

In [58]:
y = cleaned_df['target']

In [59]:
token = Tokenizer()
token.fit_on_texts(text)

In [60]:
vocab_size = len(token.word_index) + 1
vocab_size

5912

In [61]:
dict_token_tweet = token.index_word
dict_token_tweet

{1: 'fire',
 2: 'like',
 3: 'amp',
 4: 'im',
 5: 'get',
 6: 'u',
 7: 'new',
 8: 'via',
 9: 'one',
 10: 'people',
 11: 'news',
 12: 'dont',
 13: 'time',
 14: 'video',
 15: 'emergency',
 16: 'disaster',
 17: 'year',
 18: 'body',
 19: 'day',
 20: 'building',
 21: 'police',
 22: 'home',
 23: 'family',
 24: 'would',
 25: 'still',
 26: 'say',
 27: 'life',
 28: 'go',
 29: 'crash',
 30: 'storm',
 31: 'got',
 32: 'california',
 33: 'back',
 34: 'look',
 35: 'burning',
 36: 'know',
 37: 'bomb',
 38: 'suicide',
 39: 'world',
 40: 'train',
 41: 'flood',
 42: 'see',
 43: 'car',
 44: 'man',
 45: 'death',
 46: 'attack',
 47: 'rt',
 48: 'first',
 49: 'love',
 50: 'pm',
 51: 'going',
 52: 'cant',
 53: 'nuclear',
 54: 'make',
 55: 'two',
 56: 'today',
 57: 'war',
 58: 'youtube',
 59: 'dead',
 60: 'killed',
 61: 'accident',
 62: 'want',
 63: 'need',
 64: 'let',
 65: 'full',
 66: 'woman',
 67: 'hiroshima',
 68: 'think',
 69: 'may',
 70: 'take',
 71: 'weapon',
 72: 'good',
 73: 'watch',
 74: 'way',
 75: 'm

In [123]:
encoded_text = token.texts_to_sequences(text)
encoded_text

[[3994, 452, 156, 69, 1399, 3995, 6],
 [107, 1, 149, 504, 1067],
 [1529, 1400, 1877, 453, 319, 162, 1877, 453, 362, 956],
 [10, 3996, 76, 162, 362, 32],
 [31, 1068, 111, 1690, 188, 76, 97],
 [2542, 190, 32, 1300, 732, 957, 454, 898, 302, 1, 3997, 76],
 [41, 16, 733, 163, 124, 701, 180, 431, 899, 809, 191],
 [4, 141, 1150, 42, 1, 1878],
 [218, 15, 162, 1069, 20, 734, 431],
 [4, 2153, 303, 164, 191],
 [505, 10, 525, 219, 135, 526],
 [735,
  545,
  2543,
  144,
  2544,
  3100,
  527,
  363,
  127,
  545,
  2543,
  199,
  82,
  199,
  82,
  180],
 [2545, 180, 1691, 2543, 19, 233, 637, 3101],
 [41, 3998, 766, 1530, 3998],
 [136, 97, 192, 3999, 43, 29, 260],
 [506, 44],
 [49, 1531],
 [234, 1532],
 [43, 664],
 [],
 [2546],
 [900, 432],
 [49],
 [2154, 19],
 [],
 [52, 1692, 145],
 [1151, 77, 235],
 [49, 1879],
 [],
 [2, 4000],
 [193],
 [2547, 288, 528],
 [209, 665, 850, 733, 1070, 47],
 [260, 589, 165, 528, 1009],
 [810, 165, 528],
 [1533, 590, 34, 736, 77, 166, 528],
 [4001, 2155, 1880, 114, 3

In [125]:
max_length = 120 (#modifier par le nombre max de mots dans le dataset)
X = pad_sequences(encoded_text, maxlen=max_length, padding='post')
X

array([[3994,  452,  156, ...,    0,    0,    0],
       [ 107,    1,  149, ...,    0,    0,    0],
       [1529, 1400, 1877, ...,    0,    0,    0],
       ...,
       [3734,  446, 1388, ...,    0,    0,    0],
       [  21,  991, 2826, ...,    0,    0,    0],
       [ 131,   22,  451, ...,    0,    0,    0]], dtype=int32)

In [81]:
X.shape

(7613, 120)

### GloVe Vectors

In [65]:
glove_vectors = dict()

In [66]:
%%time
file = open('../glove/glove.twitter.27B.200d.txt', encoding='utf-8')

for line in file:
    values = line.split()
    word = values[0]
    vectors = np.asarray(values[1:])
    glove_vectors[word] = vectors
file.close()

CPU times: user 34.2 s, sys: 2.34 s, total: 36.6 s
Wall time: 38.2 s


In [67]:
len(glove_vectors.keys())

1193514

In [70]:
glove_vectors.get('random').shape

(200,)

In [73]:
word_vector_matrix = np.zeros((vocab_size,200))
to_delete = []

for word, index in token.word_index.items():
    vector = glove_vectors.get(word)
    if vector is not None:
        word_vector_matrix[index] = vector
    else :
        to_delete.append(word)

In [174]:
to_delete

['\x89û',
 'legionnaire',
 '\x89ûò',
 'bioterror',
 'prebreak',
 're\x89û',
 '\x89ûó',
 'typhoondevastated',
 'bioterrorism',
 'bestnaijamade',
 'soudelor',
 'disea',
 'reddits',
 'funtenna',
 'don\x89ûªt',
 'udhampur',
 'sensorsenso',
 '\x89ûïwhen',
 'selfimage',
 'spos',
 'irandeal',
 'rea\x89û',
 'it\x89ûªs',
 'inundation',
 'mediterran',
 'icemoon',
 'djicemoon',
 'ices\x89û',
 'microlight',
 'mhtwfnet',
 'rì©union',
 'linkury',
 'canaanite',
 'animalrescue',
 'china\x89ûªs',
 'you\x89ûªve',
 'can\x89ûªt',
 'let\x89ûªs',
 'chicagoarea',
 'read\x89û',
 'mikeparractor',
 'wheavenly',
 'standuser',
 'i\x89ûªm',
 'prophetmuhammad',
 'by\x89û',
 'sinjar',
 'meatloving',
 'be\x89û',
 'viralspell',
 'gtgtgt',
 '\x89û÷politics',
 'grief\x89ûª',
 'usagov',
 'collisionno',
 'summerfate',
 'here\x89ûªs',
 'sittwe',
 'strategicpatience',
 'of\x89û',
 'explosionproof',
 'socialnews',
 'america\x89ûªs',
 'injuryi',
 'youngheroesid',
 'pantherattack',
 'nasahurricane',
 'naved',
 'twia',
 'kerric

### TF2.0 and Keras Model Building

In [84]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size= 0.2, stratify = y)
vec_size = 200

In [86]:
model = Sequential()

#EarlyStopping + augmenter epochs 
model.add(Embedding(vocab_size, vec_size, input_length=max_length, weights = [word_vector_matrix], trainable = False))

model.add(Conv1D(64, 8, activation ='relu'))
model.add(MaxPooling1D(2))
model.add(Dropout(0.5))

model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(16, activation='relu'))

model.add(GlobalMaxPooling1D())

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer=Adam(learning_rate = 0.0001), loss = 'binary_crossentropy', metrics = ['accuracy'])
model.fit(X_train, y_train, epochs = 30, validation_data = (X_test, y_test))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x1c9eaaac0>

In [169]:
def get_encode(phrase):
    phrase = get_clean_text(phrase)
    phrase = token.texts_to_sequences([phrase])
    phrase = pad_sequences(phrase, maxlen=max_length, padding='post')
    return phrase

In [170]:
get_encode("hi how are you")

array([[1346,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0]],
      dtype=int32)

In [173]:
model.predict_classes(get_encode("forest fire near la ronge sask canada"))

array([[1]], dtype=int32)