<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Workshop-Skills" data-toc-modified-id="Workshop-Skills-1"><span class="toc-item-num">1&nbsp; &nbsp;</span>-Workshop-Skills</a></span></li><li><span><a href="#Objectives" data-toc-modified-id="Objectives-2"><span class="toc-item-num">2&nbsp; &nbsp;</span>Objectives</a></span></li><li><span><a href="#Philosophy-and-recommendations" data-toc-modified-id="Philosophy-and-recommendations-3"><span class="toc-item-num">3&nbsp; &nbsp;</span>Philosophy and recommendations</a></span></li><li><span><a href="#Pretreatment-and-environmental-preparation" data-toc-modified-id="Pretreatment-and-environmental-preparation-4"><span class="toc-item-num">4&nbsp; &nbsp;</span>Environmental pretreatment and preparation</a></span><ul class="toc-item"><li><span><a href="#Cleaning-tweets" data-toc-modified-id="Cleaning-tweets-4. 1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Tweets cleanup</a></span></li><span><a href="#Tokenization-or-segmentation" data-toc-modified-id="Tokenization-or-segmentation-4.2"><span class="toc-item-num">4. 2&nbsp;&nbsp;</span>Tokenization or segmentation</a></span></li><li><span><a href="#Word-Embedding:-convertir-un-document-en-une-matrice-de-nombres" data-toc-modified-id="Word-Embedding:-convertir-un-document-en-une-matrice-de-nombres-4. 3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Word Embedding: convert a document into a matrix of numbers</a></span></li><li><span><a href="#Pipeline-Preparation" data-toc-modified-id="Pipeline-Preparation-4.4"><span class="toc-item-num">4. 4&nbsp;&nbsp;</span>Pipeline preparation</a></span></li></ul></li><span><a href="#Model-training" data-toc-modified-id="Model-training-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Model-training</a></span><ul class="toc-item"><li><span><a href="#RNN: -Simple-recurrent-neuron-networks" data-toc-modified-id="RNN:-Simple-recurrent-neuron-networks-5. 1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>RNN: Simple recurrent neural networks</a></span></li><span><a href="#GRU:-Gated-Recurrent-Units" data-toc-modified-id="GRU:-Gated-Recurrent-Units-5. 2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>GRU: Gated Recurrent Units</a></span></li><span><a href="#LSTM" data-toc-modified-id="LSTM-5.3"><span class="toc-item-num">5. 3&nbsp;&nbsp;</span>LSTM</a></span></li></ul></li><li><span><a href="#Evaluation-of-model-performance" data-toc-modified-id="Evaluation-of-model-performance-6"><span class="toc-item-num">6&nbsp; &nbsp;</span>Performance evaluation of models</a></span><ul class="toc-item"><li><span><a href="#Evaluation-of-models-by-the-dataset-test" data-toc-modified-id="Evaluation-of-models-by-the-dataset-test-6. 1"><span class="toc-item-num">6. 1&nbsp;&nbsp;</span>Evaluation of models by the test dataset</a></span></li><li><span><a href="#Evaluation-of-LSTM-results" data-toc-modified-id="Evaluation-of-LSTM-results-6.2"><span class="toc-item-num">6. 2&nbsp;&nbsp;</span>Evaluation of LSTM results</a></span></li><li><span><a href="#Pour-aller-plus-loin-avec-NLP" data-toc-modified-id="Pour-aller-plus-loin-avec-NLP-6. 3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>To go further with NLP</a></span></li></ul></ul></div>


# Natural language processing and classification using Recurrent Neural Networks (RNNs)

![GPI-CESI.jpg](attachment:GPI-CESI.jpg)
                    
<table>
<thead>
  <tr><th>Author</th><th>Reader</th><th>Center</th><th>Editor</th>
</thead>
<tbody>
  <tr><td>Genane YOUNESS</td><td>Benjamin COHEN BOULAKIA</td><td>Nanterre</td><td>2023-02-07</td></tr>
</tbody>
</table>

The aim of this Workshop is to introduce you to the basics of NLP (Natural Language Processing)(https://lbourdois.github.io/blog/nlp/) and to build a binary classification model using recurrent neural networks (RNNs).

Unlike traditional neural networks, recurrent neural networks span spatial and temporal sequences. In other words, the hidden layers of the present moment and the next moment are linked.
The application you are about to carry out involves building a model to identify Twitter posts ("tweets") announcing a disaster. This learning is part of natural language processing, NLP.
This classification of binary text is important, as it could help state agencies to quickly identify and respond to disasters.
The data available are tagged tweets reporting a disaster or not.
First, we'll clean up the text data before moving on to binary classification with RNN.

The dataset used in this Workshop is that of the Kaggle challenge [_Natural Language Processing with Disaster Tweets_](https://www.kaggle.com/competitions/nlp-getting-started/data). This dataset is provided to you in an archive <code>nlp-getting-started.zip</code>, unzip its contents into a directory <code>nlp-getting-started</code> which you will place in the same directory as this Workshop.

## Pre-treatment and environmental preparation

In [14]:
import numpy as np
import pandas as pd
import tensorflow as tf
import keras

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

We start by downloading the Twitter publications ("tweets"). We retrieve the columns we're interested in, which are the <code>texts</code> and <code>target</code> columns. The text is a sentence or tweet of type _text_, and the _target_ is of type _int64_ labeled <code>1</code> for catastrophe, <code>0</code> otherwise.

In [41]:
import os

# Change according to your dataset download path
# os.chdir('')
train_data = pd.read_csv('train.csv', usecols=['text', 'target'], dtype={'text': str, 'target': np.int64})

train_data.shape

(7613, 2)

In [42]:
train_data.iloc[:5, :]

# train_data = train_data.dropna()

train_data.iloc[:4, :]

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1


We will also download test data to evaluate the performance of the classification model.

In [43]:
train2 = pd.read_csv('./train.csv')
train2.iloc[:3, :]

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1


In [44]:
test_data = pd.read_csv('./test.csv')
test_data.iloc[:3, :]

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."


In [45]:
x = train_data['text']
y = train_data['target']

x[:5], y[:5]

(0    Our Deeds are the Reason of this #earthquake M...
 1               Forest fire near La Ronge Sask. Canada
 2    All residents asked to 'shelter in place' are ...
 3    13,000 people receive #wildfires evacuation or...
 4    Just got sent this photo from Ruby #Alaska as ...
 Name: text, dtype: object,
 0    1
 1    1
 2    1
 3    1
 4    1
 Name: target, dtype: int64)

In [46]:
train_data=train_data.drop(train_data.index[[4415, 4400, 4399,4403,4397,4396, 4394,4414, 4393,4392,4404,4407,4420,
                                             4412,4408,4391,4405,6840,6834,6837,6841,6816,6828,6831,601,576,584,608,
                                             606,603,592,604,591, 587,3913,3914,3936,3921,3941,3937,3938,3136,3133,
                                             3930,3933,3924,3917,246,270,266,259,253,251,250,271,6119,6122,6123,6131,
                                             6160,6166,6167,6172,6212,6221,6230,6091,6108,7435,7460,7464,7466,7469,
                                             7475,7489,7495,7500,7525,7552,7572,7591,7599]])

# Class distribution
print(train_data.target.value_counts())

target
0    4308
1    3223
Name: count, dtype: int64


In [47]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

x_train.shape, y_train.shape, x_test.shape, y_test.shape

train_data = pd.DataFrame(x_train, y_train)

x_train.shape, y_train.shape, train_data.iloc[:5, :]

((6090,),
 (6090,),
                                           text
 target                                        
 1       Forest fire near La Ronge Sask. Canada
 0                                          NaN
 1       Forest fire near La Ronge Sask. Canada
 1       Forest fire near La Ronge Sask. Canada
 0                                          NaN)

Let's have a look at the first texts:

In [33]:
x_train.head().values

array(['Courageous and honest analysis of need to use Atomic Bomb in 1945. #Hiroshima70 Japanese military refused surrender. https://t.co/VhmtyTptGR',
       '@ZachZaidman @670TheScore wld b a shame if that golf cart became engulfed in flames. #boycottBears',
       "Tell @BarackObama to rescind medals of 'honor' given to US soldiers at the Massacre of Wounded Knee. SIGN NOW &amp; RT! https://t.co/u4r8dRiuAc",
       'Worried about how the CA drought might affect you? Extreme Weather: Does it Dampen Our Economy? http://t.co/fDzzuMyW8i',
       '@YoungHeroesID Lava Blast &amp; Power Red #PantherAttack @JamilAzzaini @alifaditha'],
      dtype=object)

Certaines données sont mal étiquetées, nous allons les éliminer du Dataset (Pour plus de détails sur ces tweets, consultez les [travaux de Dmitri Kalyaevs](https://www.kaggle.com/code/dmitri9149/transformer-svm-semantically-identical-tweets/notebook)).

IndexError: index 6840 is out of bounds for axis 0 with size 6090

### Cleaning tweets

As with all datasets, natural language data such as tweets require a great deal of cleaning. This step is part of what we call pre-processing. What do you think tweets need to be cleaned up? What elements need to be removed so that tweets are ready for the training phase?

<em>PLEASE COMPLETE</em>

First of all, we're going to set up the pipelines: preparing the libraries and tools for cleaning up tweets. We're using The _NLTK_ , or Natural Language Toolkit, which is one of the most powerful natural language processing libraries, designed for symbolic and statistical natural language processing in Python. We use it for tokenization, stemming, lemmatization and stopword loading. We'll need to download a few pre-built tokenizer databases for this purpose ([Punkt](https://www.nltk.org/api/nltk.tokenize.punkt.html) and OMW...).

Here's what our pipeline looks like:

In [None]:
# Preparing libraries and tools for cleaning up tweets
# Regular expressions
import re
# Punctuactions
import string

#Tokenization
import nltk
nltk.download('punkt')
nltk.download('omw-1.4')
from nltk.tokenize import word_tokenize

# Lemmatization
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()

# Load stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords


# Load Stemming, which consists in reducing a word to its "root" form
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')


We're going to start by writing a function to remove blank spaces, numbers, replace uppercase characters with lowercase ones, and remove special characters. To do this, we'll use _RegEx_ or regular expression, which is a sequence of characters that forms a search pattern. For more details, see https://www.w3schools.com/python/python_regex.asp or https://docs.python.org/fr/3/howto/regex.html

    

In [None]:
# Function for cleaning each document: nlp_pipeline
# Tweet= tweet corpus = document
# A RegEx, or regular expression, is a sequence of characters that forms a search pattern.
def nlp_pipeline(text):
    # Convert uppercase letters to lowercase
    text = text.lower()

    # Replace new line with a space
    text = text.replace('\n', ' ').replace('\r', '')
    text = ' '.join(text.split())

    # Remove uppercase letters, all strings that are not letters or numbers
    #PLEASE COMPLETE

    # Remove special characters
    text = re.sub(r"(\s\-|-$)", "", text)
    #PLEASE COMPLETE
    text = re.sub(r"\x89û", "", text)
    return text

One of the most important steps in cleaning up tweets is to remove URLs and HTML tags, and then use _stemming_ to remove the end of the word and keep only the root.

In [None]:
# Remove https
def remove_url(sentence):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', sentence)

def remove_html(sentence):
    html = re.compile(r'<.*?>')
    return html.sub(r'', sentence)

#Stemming: stemming deletes the end of the word, leaving only the root.
# Example: "find" becomes "find".
stemmer = SnowballStemmer('english')

def stem_words(sentence):
    words = sentence.split()
    words = [stemmer.stem(word) for word in words ]

    return ' '.join(words)

# Apply stemming to the word "fired".
print(stemmer.stem('fired'))

# Apply stemming to word "emergency
print(stemmer.stem('emergency'))

Use the [stopword] module doc (https://pythonspot.com/nltk-stop-words/) to find these words in the English language, then delete them.

In [None]:
mots_vides=  #PLEASE COMPLETE
print('\n')
print(mots_vides)

In [None]:
# Function that removes stopwords: words that are very common in the language studied but don't make sense.
# as in French, the words: et, à,le, la, etc... (https://pythonspot.com/nltk-stop-words/ )
mots_vides=stopwords.words('english')

def remove_stopwords(sentence):
    words = sentence.split()
    words = #PLEASE COMPLETE

    return ' '.join(words)

Observe punctuation using [string](https://docs.python.org/3/library/string.html) and then delete it. To do this, we recommend you search on the word _punctuation_ by keyword on the page.

In [None]:
punctuations = #PLEASE COMPLETE
print(punctuations)

In [None]:
# Remove punctuation
def remove_punctuation(sentence):
    #PLEASE COMPLETE

    words = #PLEASE COMPLETE

    return ' '.join(words)

Delete emojis used in tweets.

Note that the different types of pictographic elements are often contiguous in the UTF encoding, and we should be able to take advantage of this in RegEx. Each emoji has a unique Unicode assigned to it. When using Unicode with Python, replace "+" with the Unicode "000". Then prefix the Unicode with "\". For example, U+1F605 will be used as \U0001F605. Here, "+" is replaced by "000" and "\" is prefixed by Unicode.

In [None]:
# Remove emojis
def remove_emoji(sentence):
    emoji_pattern = re.compile("["
                                # emoticons
                           u"\U0001F600-\U0001F64F"
                               # symbols & pictographs
                           #PLEASE COMPLETE
                                # transport & map symbols
                           #PLEASE COMPLETE
                               # flags (iOS)
                           #PLEASE COMPLETE
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "+", flags=re.UNICODE)

    return emoji_pattern.sub(r'', sentence)


Write a function to remove endings and isolate the canonical form of the word, also known as the lemma, which is often its [radical](https://www.espacefrancais.com/radicaux-prefixes-et-suffixes/), but not systematically. Particularly in the case of verbs, which must be passed to the infinitive. For example, the word "find" accepts the lemma "find".

In [None]:
from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()

def lem_word(sentence):
    words = sentence.split()
    words = #PLEASE COMPLETE
    return ' '.join(words)

Sometimes, words less than or equal to two characters in length don't provide important information and it's better to remove them.

Write the function that removes words with less than two characters:

In [None]:
# Remove words with less than two characters
def remove_small(sentence):

    words = sentence.split()
    words = #PLEASE COMPLETE
    return ' '.join(words)

Let's write the _clean_text_ function that joins all these different functions to clean up tweets cleanly.

In [None]:
 def clean_text(data):
    data['text'] = data['text'].apply(lambda x : remove_url(x))
    data['text'] = data['text'].apply(lambda x : remove_html(x))
    #data['text'] = data['text'].apply(lambda x : stem_words(x))
    data['text'] = data['text'].apply(lambda x : remove_punctuation(x))
    data['text'] = data['text'].apply(lambda x : remove_stopwords(x))
    data['text'] = data['text'].apply(lambda x : remove_emoji(x))
    data['text'] = data['text'].apply(lambda x : remove_small(x))
    data['text'] = data['text'].apply(lambda x : lem_word(x))
    data['text'] = data['text'].apply(lambda x : nlp_pipeline(x))
    return data

Now that the pre-processing pipeline is ready, let's apply it to the two types of data: the training dataset and the test dataset.

In [None]:
# Apply cleaning to both types of data: the training data set and the test data set
D = clean_text(train_data)
test_data_c=clean_text(test_data)

# Data before cleaning
train_data.head()

Check for empty tweets after cleaning and remove them.

In [None]:
print(\
      #PLEASE COMPLETE

# Eliminate empty tweets if any.
D = #PLEASE COMPLETE

print(D.shape)

D.text.head()


We can visually identify certain terms that are most often associated with our topic of interest, which in this case is "disaster". This can be done using a word cloud:

In [None]:
# You need to install wordCloud
#!pip install wordcloud

from wordcloud import WordCloud

# Create Wordcloud from tweets
# https://www.geeksforgeeks.org/generating-word-cloud-python/
D.text
all_words = ' '.join([text for text in D.text])
wordcloud = #PLEASE COMPLETE

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()


We're going to split the training data set of tweets into two parts: training and validation.
As a reminder, what will each part be used for&nbsp;?

<em>PLEASE COMPLETE</em>

In [None]:
# Learning-testing partition
from sklearn.model_selection import train_test_split
dtrain, dtest = #PLEASE COMPLETE

# Checking the split
print(dtrain.shape)
print(dtest.shape)

The output shows a total of 7530 tweets, 5648 of which belong to training and 1883 to validation. Now that the dataset is ready, it's time to build the dictionary from the training sample.

### Tokenization or segmentation  

A model won't understand what to do with a string representing a sentence. Instead, it needs to be converted into an array of numbers representing the words in the sentence. A tokenizer should come in handy. How does it work?

<em>PLEASE COMPLETE</em>
    
To find out more about the various Tokenizer modules, see [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer/).

In [None]:
# Tokenization with Keras
from keras. preprocessing.text import Tokenizer

def define_tokenizer(train_sentences, val_sentences, test_sentences):
    sentences = pd.concat([train_sentences, val_sentences, test_sentences])
    tokenizer = tf.keras.preprocessing.text.Tokenizer()

    ## Creation of the dictionary from the sample documents
    tokenizer.fit_on_texts(sentences)

    return tokenizer

def encode(sentences, tokenizer):
    encoded_sentences = tokenizer.texts_to_sequences(sentences)
    encoded_sentences = #PLEASE COMPLETE

    return encoded_sentences

What is the role of the <code class="cm-s-ipython language-python"><span class="cm-variable">padding</span><span class="cm-operator">=</span><span class="cm-variable">post</span></code>) option in the second function?

<em>PLEASE COMPLETE</em>

Apply tokenization to the 3 data types, use the Tokenizer and encode the phrases in an array of index numbers representing the phrase. Name them respectively: `encoded_sentences`, `val_encoded_sentences` and `encoded_test_sentences` :

In [None]:
tokenizer = #PLEASE COMPLETE

encoded_sentences = encode(dtrain['text'], tokenizer)
val_encoded_sentences = #PLEASE COMPLETE
encoded_test_sentences = #PLEASE COMPLETE

# Number of documents processed
print(tokenizer.document_count)

The tokenizer provides some interesting information about the phrases it encodes. To get the index number assigned to a word, with `word_index`, we can look up the word in the Tokenizer's word index (which is just a Python dictionary with the words as keys and the index numbers as values), and look at some other information too.

In [None]:
print(tokenizer.word_index['disaster'])
print(tokenizer.word_index['target'])

# Vocabulary size
print(len(tokenizer.word_index))

# List of words and their frequencies
print(list(tokenizer.word_counts.items())[:10])

#List sorted in order of decreasing frequency
print(sorted(list(tokenizer.word_counts.items()),key=lambda x: -x[1])[:20])

The most frequently used important term is <em>like</em>, appearing 487 times in one or more documents. The term <em>fire</em> appears 484 times, <em>emergency</em> appears 226 times, <em>disaster</em> appears 220 times.

Right, let's move on to the document matrix (text to matrix).

### Word Embedding: convert a document into a matrix of numbers

After preprocessing the text data and creating the dictionary, we need to do some <em>Word Embedding</em>.

Why do we need to use Word Embedding in NLP?

<em>PLEASE COMPLETE</em>
   
There are various techniques available, depending on the model use case and dataset. We cite One Hot Encoding, TF-IDF, Word2Vec and FastText (https://towardsdatascience.com/word-embedding-techniques-word2vec-and-tf-idf-explained-c5d02e34d08). We have chosen [GloVe](https://datamahadev.com/nlp-stanfords-glove-for-word-embedding/) for this task. GloVe (<em>Global Vectors for Word Representation</em>), was created by Stanford University. As its name suggests, it helps preserve global contexts, as it creates a global co-occurrence matrix by estimating the probability that a given word is co-occurring with other words. It therefore handles tasks requiring analogical reasoning about words and tasks requiring the capture of word similarity. It has predefined dense vectors for around 6 billion words in English literature, as well as many other general-purpose characters such as commas, braces and semicolons.

Once we've pre-processed the text data and created the dictionary, we need to go through the Glove file of a specific dimension and compare each word with all the words in the dictionary, and if there's a match, copy the equivalent vector from the Glove and paste it into `embedding_matrix` at the corresponding index. The first thing to do, then, is to load the embedding.

First, we'll download and decompress the GloVe embeddings. Specifically, we're going to retrieve the [Glove pre-entrapped representation](http://nlp.stanford.edu/data/glove.6B.zip). Please take your time to explore this data!


In [None]:
from zipfile import ZipFile

zf = ZipFile('glove.6B.zip', 'r')
zf.extractall('glove.6B')
zf.close()

Now let's compute an index that maps words to known embeddings, by analyzing the database of pre-trained embeddings:

In [None]:
embedding_dict = {}

f=open(os.path.join('glove.6B', 'glove.6B.100d.txt'),'r',encoding='utf-8')
for line in f:
        values = line.split()
        word = values[0]
        vectors = np.asarray(values[1:],'float32')
        # Transform each word into a vector of dimension 100.
        #PLEASE COMPLETE

f.close()


# Check number of terms
print('Found %s word vectors.' % len(embedding_dict))

Testons la présence de certains termes, et leur similarité :

In [None]:
# Coordinates of the terms good and nice
print(embedding_dict['good'])
print(embedding_dict['nice'])

# Similarity between good and nice: if the value is close to 1, then there's a strong similarity
import scipy

from scipy.spatial import distance
print(1.0-scipy.spatial.distance.cosine(embedding_dict['good'],embedding_dict['nice']))

We'll use the function below to ensure that the term we're looking for in our dictionary is present in GloVe's pre-trained representation. At this point, we can use our `embedding_dict` dictionary and our `word_dict` to calculate our embedding matrix:

In [None]:
# Initial emmbedding matrix for our dataset
hit=0
misses=[]

# Number of tokens ( numpy is zero-based)
num_words = #PLEASE COMPLETE

# Dimension of representation =100 according to Glove chosen
embedding_matrix = np.zeros((num_words, 100))

# Fill the matrix with the coordinates from the pre-trained representation
# Provided that the dictionary term we're looking for is present in GloVe's pre-trained representation
for word, i in tokenizer.word_index.items():
    if i > num_words:
        continue

    emb_vec = embedding_dict.get(word)

    if emb_vec is not None:
        embedding_matrix[i] = emb_vec
        hit = hit+1
    else:
        misses.append(word)

# Control display: the number of terms found and not found in GloVe's pre-trained representation.

print( #PLEASE COMPLETE

We find 6009 non-referenced terms, i.e. 6009 terms present in the tweets but not in GloVe. The corresponding rows are transformed into zero in our matrix. This is a loss of information, as they are never trained.

In [None]:
print(misses[:10])

We can check that these terms are not present in the chosen Glove document, as there are terms that have been incorrectly transformed in the pre-processing phase.


### Preparing the pipeline

With the sentences encoded, they can now be prepared for input into the model. TensorFlow provides an API for formatting data in its own format. Although data can be inserted in a more common format (such as numpy arrays), TensorFlow seems to prefer its own format and provides some handy features as incentives.

As a first step, therefore, we're going to convert coded sentences and labels into tensors.

In [None]:
tf_data = tf.data.Dataset.from_tensor_slices((encoded_sentences, dtrain['target'].values))

Now that the data is in TensorFlow format, a few practical methods can be added to improve training. These include shuffling the data at each stage of training, processing the next batch of data for training while the current batch of data is being trained, and defining each batch as a padded batch.

In [None]:
def pipeline(tf_data, buffer_size=100, batch_size=32):
    tf_data = tf_data.shuffle(buffer_size)
    tf_data = tf_data.prefetch(tf.data.experimental.AUTOTUNE)

    tf_data = #PLEASE COMPLETE

    return tf_data

tf_data = pipeline(tf_data, buffer_size=1000, batch_size=32)

print(tf_data)

Let's define a similar pipeline for the test dataset. The difference is the absence of shuffling to speed up validation.

In [None]:
tf_val_data = #PLEASE COMPLETE

def val_pipeline(tf_data, batch_size=1):
    tf_data = tf_data.prefetch(tf.data.experimental.AUTOTUNE)
    tf_data = #PLEASE COMPLETE

    return tf_data

tf_val_data = val_pipeline(tf_val_data, batch_size=len(dtest))

print(tf_val_data)

## Training the model

We're now ready to define and train the model. First, let's define the model!!!!

![graphRNN1.png](attachment:graphRNN1.png)

Three layers need to be defined: an embedding layer, an RNN layer and a dense layer.
Explain how each layer works.

<em>PLEASE COMPLETE</em>

In [None]:
# Embedding layer
embedding = tf.keras.layers.Embedding(
    len(tokenizer.word_index)+1,
    100,
    embeddings_initializer = tf.keras.initializers.Constant(embedding_matrix),
    trainable=False # this embedding layer is fixed, it must not be
                    # it must not be included in the training layer
)

### RNN: Simple recurrent neural networks

Simple recurrent neural networks are not suitable for natural language data, and are more commonly used for sequential data.
For the RNN layer, simple RNNs don't perform well; indeed, one of their main problems is vanishing gradients. RNNs can be quite long, and may have difficulty backpropagating gradients to the first layer of the network. When this happens, the network cannot learn the relationships between distant tokens.

What do you think is the consequence of the Vanishing problem?

<em>PLEASE COMPLETE</em>

If you feel like it, you can try out simple RNNs and see how they work in our case. Don't do it during the Workshop, as you'll be wasting learning time for nothing.

How can you avoid this problem?

<em>PLEASE COMPLETE</em>

Which of the three RNN architectures can do this?

<em>PLEASE COMPLETE</em>

We're going to apply and compare these types of recurrent neural networks.

### GRU: Gated Recurrent Units

First, we'll try to understand the architecture of GRU (Gated Recurrent Unit) neural networks.
These neural networks are made up of two gates:

![image-2.png](attachment:image-2.png)

Each "gate" corresponds to a small neural network with a sigmoid activation function, the aim of which is to bring the values of the gate's input vectors between 0 and 1.

What are the names of the two gates (A and B in the figure)? How do they work?

<em>PLEASE COMPLETE</em>

A more detailed description of GRU can be found in [Illustrated Guide to LSTM's and GRU's: A step by step explanation](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21).

So let's create Model 1 with the GRU type, consisting of a hidden layer, 128 neurons, and a droupout of 0.2 :

In [None]:
# Creation of model 1 architecture
model1 = tf.keras.Sequential([
    embedding,
   tf.keras.layers.SpatialDropout1D(0.2),
    # GRU with 128 neurons and dropout=0.2
    #PLEASE COMPLETE
    ,tf.keras.layers.Dense(1, activation='sigmoid')
])

model1.summary()

Then we compile the model by defining the learning function (with an Adam of 0.01) and the loss function (here, loss binary crossentropy). We have also added a metric parameter so that the model's accuracy can be printed per epoch.

In [None]:
model1.compile(
    loss= #PLEASE COMPLETE
    , optimizer=tf.keras.optimizers.Adam(0.01),
    metrics=['accuracy']
)

Finally, we start training the model on the train part, with 50 epochs:

In [None]:
history1 = model1.fit(
    tf_data,
    validation_data = tf_val_data,
    epochs = 50
  )


We'll visualize both metrics: error and accuracy, per epoch, during model training to get a better idea of how the training went.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(20, 5))

axs[0].set_title('Loss')
axs[0].plot(history1.history['loss'], label='train')
axs[0].plot(history1.history['val_loss'], label='val')
axs[0].legend()

axs[1].set_title('Accuracy')
axs[1].plot(history1.history['accuracy'], label='train')
axs[1].plot(history1.history['val_accuracy'], label='val')
axs[1].legend()



### LSTM

We're going to test another model using the neural network type LSTM. The use of LSTM is very effective for performing NLP tasks.

An LSTM network is organized similarly to an RNN, but two states are transmitted from one layer to the next: the real state and the hidden vector. At each unit, the hidden vector is combined with the input, and together they control what happens to the state and output via gates. Each gate has a sigmoid activation (output in range), which can be seen as a bit-by-bit mask when multiplied by the state vector. LSTMs then have all three gates.

Note that the horizontal line in the following diagram corresponds to the almost identical propagation of the cell state vector (_cell state_), which ensures the propagation of the initial information.

![LSTM1.jpg](attachment:LSTM1.jpg)

What is the name and function of these different gates?

<em>PLEASE COMPLETE</em>

More detailed explanations of this architecture can be found [here](https://larevueia.fr/quest-ce-quun-reseau-lstm/).

For scenarios that require random access to the input sequence, it makes more sense to run the recurrent calculation in both directions. RNNs that enable two-way computation are called <em>bidirectional</em> RNNs, and they can be created by wrapping the recurrent layer with a special bidirectonal layer.

The bidirectional layer makes two copies of the layer it contains, and sets the `go_backwards` property of one of these copies to <code class="cm-s-ipython language-python"><span class="cm-keyword">True</span></code>, making it go in the opposite direction along the sequence.
Recurrent networks, unidirectional or bidirectional, capture patterns in a sequence, and store them in state vectors or return them as output. As with convolutional networks, we can build another recurrent layer after the first to capture higher-level patterns, built from lower-level patterns extracted by the first layer. This leads us to the notion of a multi-layer RNN, which consists of two or more recurrent networks, where the output of the previous layer is passed on to the next layer as input.
![image.png](attachment:image.png)
                                Figure by Fernando López.

Let's build a bidirectional single-hidden-layer LSTM with 128 neurons and dropout equal to 0.3.

In [None]:
model2 = tf.keras.Sequential([
    embedding,
    tf.keras.layers.SpatialDropout1D(0.4),
    # bi-directional single hidden layer LSTM with 128 neurons and dropout equal to 0.3
    #PLEASE COMPLETE
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model2.compile(
    # loss function binary cross entropy
    loss= #PLEASE COMPLETE
    optimizer=tf.keras.optimizers.Adam(0.01),
    # accuracy metric
    #PLEASE COMPLETE

history2 = model2.fit( tf_data, validation_data = tf_val_data, epochs = 50 )

fig, axs = plt.subplots(1, 2, figsize=(20, 5))

axs[0].set_title('Loss')
axs[0].plot(history2.history['loss'], label='train')
axs[0].plot(history2.history['val_loss'], label='val')
axs[0].legend()

axs[1].set_title('Accuracy')
# Accuracy representation for axis 2 data train
#PLEASE COMPLETE
# Accuracy representation for data val in axis 2
#PLEASE COMPLETE
axs[1].legend()


Compare the results found by LSTM with those of GRU. What can we conclude?

<em>PLEASE COMPLETE</em>

## Model performance evaluation

We now know which architecture is more suitable, but it's interesting to understand why. Let's see which sentences our two models misinterpreted. To do this, the models need to produce predictions for the dataset. This requires a slightly different pipeline.

In [None]:
prediction1 = model1.predict(tf_val_data)
prediction1 = np.concatenate(prediction1).round().astype(int)
dtest['prediction1'] = prediction1

prediction2 = model2.predict(tf_val_data)
prediction2 = np.concatenate(prediction2).round().astype(int)
dtest['prediction2'] = prediction2

### Model evaluation using the test dataset

Let's take a look at the results of Model 1 and Model 2 to get an idea of their performance. The quickest and simplest method of evaluation is to examine the metrics produced by the model. The final metrics can be extracted using the `evaluate` method on the validation dataset.

In [None]:
# GRU model evaluation on the test part
score1=model1.evaluate(dtest['prediction1'],dtest['target'],verbose=1)

print("Validation Score model1:", score1[0])
print("Validation Accuracy model1:", score1[1])

Display the evaluation of the LSTM model on the test part. We need the validation error value as well as the accuracy or precision.

In [None]:
# Evaluation of the LSTM model on the test part
score2 = #PLEASE COMPLETE

# Display error: Validation Score model1 and accuracy of model 2: Validation Accuracy model2
print( #PLEASE COMPLETE
print( #PLEASE COMPLETE

What do you think of these values? What do you conclude?

<em>PLEASE COMPLETE</em>

Let's now try to go into a little more detail about the performance of this architecture.

### Evaluation of LSTM results
Let's start by looking at the false positives. What situation do these false positives correspond to?

<em>PLEASE COMPLETE</em>
So let's count these false positives, and take a look at the first 10 tweets.

In [None]:
false_positives = #PLEASE COMPLETE

print('Count of false positives: ' + str(len(false_positives)))

false_positives.head(10)

Then do the same with false negatives. What network behavior do they correspond to?

<em>PLEASE COMPLETE</em>

In [None]:
false_negatives = #PLEASE COMPLETE

print('Count of false negatives: ' + str(len(false_negatives)))

false_positives.tail(10)

Once the model is trained, there are just a few steps left to load the test data and use the model to label test sentences as disaster or not. First, convert the data into a TensorFlow dataset and apply the pipeline methods. The pipeline has been adjusted slightly to take account of the fact that we don't want shuffling and the different shape of the input (without labels).

In [None]:
tf_test_data = tf.data.Dataset.from_tensor_slices((encoded_test_sentences))


def test_pipeline(tf_data, batch_size=1):
    tf_data = tf_data.prefetch(tf.data.experimental.AUTOTUNE)
    tf_data = tf_data.padded_batch(batch_size, padded_shapes=([None]))

    return tf_data

tf_test_data = test_pipeline(tf_test_data)

print(len(tf_test_data))

# Use the template to label phrases in test data as disastrous or not
predictions = model2.predict(tf_test_data)

predictions = np.concatenate(predictions).round().astype(int)

Submit the sentences from the test dataset and save the submission in a CSV file so that you can classify each tweet with a disaster evoked or not.

In [None]:

submission = #PLEASE COMPLETE
submission.index = #PLEASE COMPLETE
submission.to_csv('submission.csv')


submission.head()

As a result of this study, we were able to predict the sentiment analysis of tweets by RNN.


### Going further with NLP

The training, validation and test datasets are likely to contain words that the other datasets do not. If the model is only trained on the words in the training dataset, there may be a problem of overfitting when the model tries to read words it doesn't recognize in the validation and test datasets.

The question is _how much of a problem is this?
The function below takes two datasets and counts how words match and don't match, it will give us an idea.

In [None]:
def compare_words(train_words, test_words):
    unique_words = len(np.union1d(train_words, test_words))
    matching = #PLEASE COMPLETE
    not_in_train = len(np.setdiff1d(test_words, train_words))
    not_in_test = len(np.setdiff1d(train_words, test_words))

    print('Number of words in both parts dtrain and dtest: ' + str(unique_words))
    print('Number of words in matching: ' + str(matching))
    print('Number of words in the dtrain dataset and not in the test dataset: ' + str(not_in_test))
    print('Number of words in the test dataset and not in the dtrain dataset: ' + str(not_in_train))

# Comparison between training data and validation data
compare_words(encoded_sentences, val_encoded_sentences)

This shows that only 25% of words are found in both datasets, and that the dataset used for validation contains 16% of words not found in the dataset used for training. Clearly, this adversely affects model performance.

In [None]:
# Comparison of training and test data
compare_words(encoded_sentences, encoded_test_sentences)

28% are found in both datasets.
47% are in the training dataset and not in the test dataset.
25% are in the test dataset and not in the training dataset.

We can use the lexical rule-based approach (https://datapeaker.com/fr/Big-Data/analyse-des-sentiments-bas%C3%A9e-sur-des-r%C3%A8gles-en-python-pour-les-scientifiques-des-donn%C3%A9es/). These simple approaches, widely used in NLP, include TextBlob, VADER and SentiWordNet. They look for opinion words in a text and then classify them according to the number of words announcing a disaster or not.

# Conclusion

Congratulations, you've just completed your first analysis of natural language processing or NLP! You've analyzed Tweets using the GRU and LSTM recurrent neural networks.

You've seen how to clean up corpora, visualize Twitters results and then you've been able to implement, train and evaluate the RNN as well as improve its performance by trying out GRU and LSTM. Congratulations!

But you still need to improve the model's performance. First of all, the accuracy is only 79.3%. Obviously, increasing the size of the network will have a significant impact on learning speed. There are certainly other approaches to improving results. Here in the word embedding step, we used GloVe, an unsupervised learning algorithm that matches words in a space where semantic similarity between words is observed by the distance between words. Other word embedding techniques, such as Word2vec or Se2seq, can also be used. There are also preprocessed architectures, called Transformers, that you can use, such as BERT(Devlin et al., 2018), and GPT-2 (de Radford et al 2019). To learn more about these approaches, please visit this site [link](https://france.devoteam.com/paroles-dexperts/lstm-transformers-gpt-bert-guide-des-principales-techniques-en-nlp/).