# GloVe & CNN

GLOVE fonctionne de la même manière que Word2Vec. Alors que vous pouvez voir ci-dessus que Word2Vec est un modèle «prédictif» qui prédit un mot donné par le contexte, GLOVE apprend en construisant une matrice de cooccurrence (mots X contexte) qui compte essentiellement la fréquence d'apparition d'un mot dans un contexte. Comme il s'agira d'une matrice gigantesque, nous factorisons cette matrice pour obtenir une représentation de dimension inférieure. Il y a beaucoup de détails dans GLOVE mais c'est l'idée approximative.

In [5]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense, Flatten, Embedding, Activation, Dropout, Masking
from tensorflow.keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping

from sklearn.model_selection import train_test_split

import numpy as np
from numpy import array
import pandas as pd

from NLP_Natural_Disasters.data import get_data, clean_data

In [6]:
pd.set_option('display.max_colwidth', None)

# Preprocessing

In [7]:
train_df = get_data()
X_train = clean_data(train_df)

y_train = X_train['target']

X_train_text = X_train['text']
X_train_text.tail()

7608                                                                two giant crane holding bridge collapse nearby home
7609                             ariaahrary thetawniest control wild fire california even northern part state troubling
7610                                                                                               utckm volcano hawaii
7611    police investigating ebike collided car little portugal ebike rider suffered serious nonlife threatening injury
7612                                                            latest home razed northern california wildfire abc news
Name: text, dtype: object

###  👉 Set id column appart, that will be join after in the output of prediction

In [8]:
X_train_id = X_train['id']

In [9]:
X_train['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

## 😲 Apply changes to test.csv

In [10]:
X_test = pd.read_csv('/home/dianavo/code/AxelCatelan/NLP_Natural_Disasters/raw_data/test.csv')
X_test = clean_data(X_test)
X_test.tail()

Unnamed: 0,id,text
3258,10861,earthquake safety los angeles ûò safety fastener xrwn
3259,10865,storm ri worse last hurricane cityampothers hardest hit yard look like bombed around k still without power
3260,10868,green line derailment chicago
3261,10874,meg issue hazardous weather outlook hwo
3262,10875,cityofcalgary activated municipal emergency plan yycstorm


> New clean_data() version : 
Enleve les mots qui apparaissent qu'une fois |
Enleve les Twitter handles |
Transforme les accent, etc. en characters ascii |
Drop les lignes vides

### Data Cleaning to add eventually : done

Pour éviter les mots rares qui n'apparaissent qu'une fois type mot avec faute, contraction etc

In [11]:
#def transform_to_list_word(serie):
#    split_list=[]  
#    for sentence in serie.str.split():
#        for word in sentence:
#            split_list.append(word)
#    return split_list

In [12]:
#list_word = transform_to_list_word(cleaned_df['text'])
#len(list_word)

**POUR DATA CLEANING**

In [13]:
# twitts = ' '.join(cleaned_df['text'])
# twitts = twitts.split()
# freq_comm = pd.Series(twitts).value_counts()
# rare = freq_comm[freq_comm.values == 1]
# rare

In [14]:
#list_word_new = [word for word in list_word if word not in rare]
#len(list_word_new)

In [15]:
#cleaned_df

**POUR DATA CLEANING**

In [16]:
# %%time
# def get_clean_text(sentence):
#     """ split """
#     if type(sentence) is str:
#         sentence = " " .join(word for word in sentence.split() if word not in rare)
#         return sentence

# cleaned_df['tweet'] = cleaned_df['text'].apply(lambda x: get_clean_text(x))

In [17]:
# phrase = 'i love chocolate'
# phrase.split()

In [18]:
# token = Tokenizer()
# token.fit_on_texts(phrase)

In [19]:
# encoded_text = token.texts_to_sequences(phrase)
# encoded_text

### Token

In [20]:
# modify cleaned_df['text'] -> ['tweet']
text = X_train_text.tolist()
text

['deed reason earthquake may allah forgive u',
 'forest fire near la ronge sask canada',
 'resident asked shelter place notified officer evacuation shelter place order expected',
 'people receive wildfire evacuation order california',
 'got sent photo ruby alaska smoke wildfire pours school',
 'rockyfire update california hwy closed direction due lake county fire cafire wildfire',
 'flood disaster heavy rain cause flash flooding street manitou colorado spring area',
 'im top hill see fire wood',
 'there emergency evacuation happening building across street',
 'im afraid tornado coming area',
 'three people died heat wave far',
 'haha south tampa getting flooded hah wait second live south tampa gon na gon na fvck flooding',
 'raining flooding florida tampabay tampa day ive lost count',
 'flood bago myanmar arrived bago',
 'damage school bus multi car crash breaking',
 'whats man',
 'love fruit',
 'summer lovely',
 'car fast',
 '',
 'ridiculous',
 'london cool',
 'love skiing',
 'wonderf

In [21]:
# #install autocorrect
# !pip install autocorrect
# from autocorrect import Speller 

In [22]:
# #create function to spell check strings
# def spell_check(sentence):
#     spell = Speller(lang='en')
#     return " ".join([spell(word) for word in sentence.split()])

# #showcase spellcheck 
# mispelled = 'Pleaze spelcheck this sentince'
# spell_check(mispelled)
# phrase= 'realized dude onlyftf way blew tusky game'
# spell_check(phrase)

In [23]:
token = Tokenizer()
token.fit_on_texts(text)

In [24]:
vocab_size = len(token.word_index) + 1
vocab_size

15688

In [25]:
dict_token_tweet = token.index_word
dict_token_tweet

{1: 'fire',
 2: 'like',
 3: 'amp',
 4: 'im',
 5: 'get',
 6: 'u',
 7: 'new',
 8: 'via',
 9: 'one',
 10: 'people',
 11: 'news',
 12: 'dont',
 13: 'time',
 14: 'video',
 15: 'emergency',
 16: 'disaster',
 17: 'year',
 18: 'body',
 19: 'day',
 20: 'building',
 21: 'police',
 22: 'home',
 23: 'family',
 24: 'would',
 25: 'still',
 26: 'say',
 27: 'life',
 28: 'go',
 29: 'crash',
 30: 'storm',
 31: 'got',
 32: 'california',
 33: 'back',
 34: 'look',
 35: 'burning',
 36: 'know',
 37: 'bomb',
 38: 'suicide',
 39: 'world',
 40: 'train',
 41: 'flood',
 42: 'see',
 43: 'car',
 44: 'man',
 45: 'death',
 46: 'attack',
 47: 'rt',
 48: 'first',
 49: 'love',
 50: 'pm',
 51: 'going',
 52: 'cant',
 53: 'nuclear',
 54: 'make',
 55: 'two',
 56: 'today',
 57: 'war',
 58: 'youtube',
 59: 'dead',
 60: 'killed',
 61: 'accident',
 62: 'want',
 63: 'need',
 64: 'let',
 65: 'full',
 66: 'woman',
 67: 'hiroshima',
 68: 'think',
 69: 'may',
 70: 'take',
 71: 'weapon',
 72: 'good',
 73: 'watch',
 74: 'way',
 75: 'm

In [26]:
encoded_text = token.texts_to_sequences(text)
X_train_encoded = encoded_text

Max length

In [27]:
 def find_max_len(lst):
    """
    Python3 program to Find maximum length list in a nested list
    """
    maxList = max(lst, key = len)
    maxLength = max(map(len, lst))
      
    return maxList, maxLength

print(find_max_len(encoded_text))

([8416, 44, 8417, 8418, 8419, 333, 412, 590, 1747, 2813, 4745, 3200, 1821, 160, 8420, 8421, 1821, 267, 3416, 4745, 8422, 204, 490], 23)


In [28]:
max_length = 23 #modifier par le nombre max de mots dans le dataset

### Word2vec Vectors

In [31]:
!pip install gensim python-Levenshtein
import gensim.downloader as api



In [32]:
# load a word2vec embedding
word2vec_transfer = api.load("glove-wiki-gigaword-50")

def embed_sentence_with_TF(word2vec, sentence):
    """convert a sentence (list of words) into a matrix representing the words in the embedding space"""
    embedded_sentence = []
    for word in sentence:
        if word in word2vec:
            embedded_sentence.append(word2vec[word])
        
    return np.array(embedded_sentence)



def embedding(word2vec, sentences):
    """converts a list of sentences into a list of matrices"""
    embed = []
    
    for sentence in sentences:
        embedded_sentence = embed_sentence_with_TF(word2vec, sentence)
        embed.append(embedded_sentence)
        
    return embed

# Embed the training and test sentences
X_train_embed_2 = embedding(word2vec_transfer, X_train_encoded)
X_test_embed_2 = embedding(word2vec_transfer, X_test)

# Pad the training and test embedded sentences , use in 'weights'
X_train_pad_2 = pad_sequences(X_train_embed_2, dtype='float32', padding='post', maxlen=25)
X_test_pad_2 = pad_sequences(X_test_embed_2, dtype='float32', padding='post', maxlen=25)

In [33]:
X_train_pad_2.shape
# X.shape = (n_sentences, max_sentence_length, embedding_dim)

(7613, 25, 50)

### TF2.0 and Keras Model Building

In [34]:
# included in model.fit() X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size= 0.2, stratify = y)
maxlen = 25  # new vec_size 
embedding_dim = 50

### CNN

In [35]:
# TEST #

In [36]:
from keras import backend as K

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

# compile the model
# model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc',f1_m,precision_m, recall_m])

In [40]:
# keras metric precision :
precision = tf.keras.metrics.Precision(thresholds=None, top_k=None, class_id=None, name=None, dtype=None)

### Model RNN

In [41]:
def model_init():
    model = Sequential()
    model.add(layers.Masking())
    model.add(layers.LSTM(10, return_sequences=True))
    model.add(layers.LSTM(10))
    model.add(layers.Dense(30, activation='relu'))
    model.add(layers.Dropout(0.15))
    model.add(layers.Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy', precision])
    
    return model

In [38]:
model2 = model_init()

In [None]:
#  restore_best_weights=True

In [39]:
es = EarlyStopping(patience=5)

history = model2.fit(X_train_pad_2, y_train, 
          batch_size = 16,
          epochs=400,
          validation_split=0.3,
          callbacks=[es]
         )

Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400


> F1 : F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives).

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
test_loss, test_acc = model_cnn.evaluate(X_test) # test.csv
print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

In [None]:
# print("Test Loss", history[0])
# print("Test Accuracy", history[1])


## Test on Camille's model

Prediction👌😍

In [None]:
y_new = model_cnn.predict(X_test)