# TP4 Deep Learning NLP : 
> Ce TP est effectué par :
**Sandra Mourali**- **Anas Chaibi** - **Salma Ghabri** - **Aziz Bellaaj** - **Louay Badri**
---

# Importation des bibliothèques


In [None]:
import pandas as pd
import numpy as np
import re
import nltk
import tensorflow
from nltk.corpus import stopwords
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot

# from keras_preprocessing.sequence import pad_sequences
from keras.models import Sequential

# from keras.layers.core import Activation, Dropout, Dense
from keras.utils import pad_sequences
from keras.layers import Activation, Dropout, Dense
from keras.layers import Flatten
from keras.layers import Conv1D
from keras.layers import GlobalMaxPooling1D
from keras.layers import LSTM, Embedding

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer

# Importation du dataset


In [None]:
!wget https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv

In [None]:
imdb = pd.read_csv("IMDB-Dataset.csv")

In [None]:
imdb.info()
imdb.head()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.countplot(x="sentiment", data=imdb)
plt.title("Class Distribution")
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.show()

In [None]:
imdb.iloc[3]

# Prétraitement


## Nettoyer les données textuelles


In [None]:
def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)
    # Remove punctuations and numbers
    sentence = re.sub("[^a-zA-Z]", " ", sentence)
    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", " ", sentence)
    # Removing multiple spaces
    sentence = re.sub(r"\s+", " ", sentence)
    return sentence


TAG_RE = re.compile(r"<[^>]+>")


def remove_tags(text):
    return TAG_RE.sub("", text)

In [None]:
X = []
sentences = list(imdb["review"])
for sen in sentences:
    X.append(preprocess_text(sen))

In [None]:
X[3]

## Convertir les étiquettes en chiffres


In [None]:
y = imdb["sentiment"]
y = np.array(list(map(lambda x: 1 if x == "positive" else 0, y)))

## Diviser le dataset en training et test sets


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

# La couche de l’embedding


## Dictionnaire word-to-index


In [None]:
tokenizer = Tokenizer(num_words=5000)  # top 5000 most frequent words
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

In [None]:
X_train[3]

La taille de chaque liste est différente car les phrases possèdent des tailles différentes.
Le script ci-dessous trouve la taille du vocabulaire, puis effectue un remplissage sur
l'ensemble d'entraînement et de test.


In [None]:
# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1
maxlen = 100
X_train = pad_sequences(X_train, padding="post", maxlen=maxlen)
X_test = pad_sequences(X_test, padding="post", maxlen=maxlen)

In [None]:
vocab_size

Les listes ont la même longueur, c'est-à-dire 100. De plus, la variable vocabulary_size contient maintenant une valeur 92547, ce qui signifie que le corpus contient 92547 mots uniques.


## GloVe : Global Vectors for Word Representation


Créons un dictionnaire qui contiendra des mots en tant que clés et leur liste d’embeddings correspondante en tant que valeurs.


In [None]:
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()
glove_file = open("glove.6B.100d.txt", encoding="utf8")
for i in range(5):
    line = glove_file.readline()
    print(line)
for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype="float32")
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

Enfin, nous allons créer une matrice d’embeddings où chaque numéro de ligne
correspondra à l'index du mot dans le corpus. La matrice aura 100 colonnes où chaque
colonne contiendra les embeddings GloVe pour les mots de notre corpus.


In [None]:
embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [None]:
embedding_matrix[7]

# Modelisation


## A. Classification avec un simple réseau de neurones


In [None]:
# from keras.models import Sequential
# from keras.layers import Embedding, Flatten, Dense

# Création du modèle séquentiel
ann_model = Sequential()

# Ajout de la couche d'embedding
ann_model.add(
    Embedding(
        vocab_size,
        100,
        weights=[embedding_matrix],
        input_length=maxlen,
        trainable=False,
    )
)

# Ajout de la couche Flatten
ann_model.add(Flatten())

# Ajout de la couche Dense
ann_model.add(Dense(units=1, activation="sigmoid"))

# Compilation du modèle
ann_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])

# Entraînement du modèle
history = ann_model.fit(
    X_train, y_train, epochs=6, batch_size=128, verbose=1, validation_split=0.2
)

In [None]:
score = ann_model.evaluate(X_test, y_test, verbose=1)
print("Test Score:", score[0])
print("Test Accuracy:", score[1])

In [None]:
plt.plot(history.history["acc"])
plt.plot(history.history["val_acc"])
plt.title("model accuracy")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()

## B. Classification avec un réseau de neurones convolutionnel


In [None]:
from keras.layers import Conv1D, MaxPooling1D

cnn_model = Sequential()

cnn_model.add(
    Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=maxlen)
)

cnn_model.add(Conv1D(filters=128, kernel_size=5, activation="relu"))

cnn_model.add(MaxPooling1D())

cnn_model.add(Flatten())

cnn_model.add(Dense(units=1, activation="sigmoid"))

cnn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])

history = cnn_model.fit(
    X_train, y_train, epochs=6, batch_size=128, verbose=1, validation_split=0.2
)

In [None]:
score = cnn_model.evaluate(X_test, y_test, verbose=1)
print("Test Score:", score[0])
print("Test Accuracy:", score[1])

In [None]:
plt.plot(history.history["acc"])
plt.plot(history.history["val_acc"])
plt.title("model accuracy")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()

## C. Classification avec un réseau de neurones récurrent (LSTM)


In [None]:
lstm_model = Sequential()

lstm_model.add(
    Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=maxlen)
)

lstm_model.add(LSTM(units=128))

lstm_model.add(Dense(units=1, activation="sigmoid"))

lstm_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])

history = lstm_model.fit(
    X_train, y_train, epochs=6, batch_size=128, verbose=1, validation_split=0.2
)

In [None]:
score = lstm_model.evaluate(X_test, y_test, verbose=1)
print("Test Score:", score[0])
print("Test Accuracy:", score[1])

In [None]:
plt.plot(history.history["acc"])
plt.plot(history.history["val_acc"])
plt.title("model accuracy")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()

- l'accuracy du 1er classifieur (un simple réseau de neurones): 0.843999981880188
- l'accuracy du 2eme classifieur (CNN): 0.8684999942779541
- l'accuracy du 3eme classifieur (LSTM): 0.8684999942779541

In [None]:
single_input_data = X_test[4]
input_data = np.array([single_input_data])
predicted_prob = lstm_model.predict(input_data)

predicted_class = int(np.round(predicted_prob))

print("Predicted Probability:", predicted_prob)
print("Predicted Class:", predicted_class)
print("correct Class:", y_test[4])

# Comparaison embedding / classifieur


## Utils


In [None]:
def plot_results(history):
    plt.plot(history.history["acc"])
    plt.plot(history.history["val_acc"])
    plt.title("model accuracy")
    plt.ylabel("accuracy")
    plt.xlabel("epoch")
    plt.legend(["train", "test"], loc="upper left")
    plt.show()
    plt.plot(history.history["loss"])
    plt.plot(history.history["val_loss"])
    plt.title("model loss")
    plt.ylabel("loss")
    plt.xlabel("epoch")
    plt.legend(["train", "test"], loc="upper left")
    plt.show()


def evaluate_model(model):
    score = model.evaluate(X_test, y_test, verbose=0)
    print("Test Score:", score[0])
    print("Test Accuracy:", score[1])

## Models


In [None]:
from keras.layers import GRU

from keras.callbacks import EarlyStopping


def train_lstm(out_dim):
    lstm_model = Sequential()

    lstm_model.add(
        Embedding(vocab_size, out_dim, weights=[embedding_matrix], input_length=maxlen)
    )

    lstm_model.add(LSTM(units=128))

    lstm_model.add(Dense(units=1, activation="sigmoid"))

    lstm_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])

    early_stopping = EarlyStopping(
        monitor="val_loss", patience=3, restore_best_weights=True
    )

    history = lstm_model.fit(
        X_train,
        y_train,
        epochs=6,
        batch_size=128,
        verbose=1,
        validation_split=0.2,
        callbacks=[early_stopping],
    )

    return lstm_model, history


def train_cnn_rnn(out_dim):
    model = Sequential()

    model.add(
        Embedding(vocab_size, out_dim, weights=[embedding_matrix], input_length=maxlen)
    )

    model.add(Conv1D(filters=128, kernel_size=5, activation="relu"))

    model.add(MaxPooling1D())
    model.add(LSTM(units=128))

    model.add(Flatten())

    model.add(Dense(units=1, activation="sigmoid"))

    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])

    early_stopping = EarlyStopping(
        monitor="val_loss", patience=3, restore_best_weights=True
    )

    history = model.fit(
        X_train,
        y_train,
        epochs=6,
        batch_size=128,
        verbose=1,
        validation_split=0.2,
        callbacks=[early_stopping],
    )

    return model, history


def train_gru(out_dim):
    model = Sequential()

    model.add(
        Embedding(vocab_size, out_dim, weights=[embedding_matrix], input_length=maxlen)
    )

    model.add(GRU(units=128))

    model.add(Flatten())

    model.add(Dense(units=1, activation="sigmoid"))

    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])

    early_stopping = EarlyStopping(
        monitor="val_loss", patience=3, restore_best_weights=True
    )

    history = model.fit(
        X_train,
        y_train,
        epochs=6,
        batch_size=128,
        verbose=1,
        validation_split=0.2,
        callbacks=[early_stopping],
    )

    return model, history

In [None]:
def experience(out_dim, train):
    model, history = train(out_dim)
    plot_results(history)
    evaluate_model(model)

In [None]:
experience(100, train_cnn_rnn)

In [None]:
experience(100, train_gru)


| Modèle    | Glove              |
| --------- | ------------------ |
| LSTM      | .8655 |
| CNN + RNN | .8672 |
| GRU       | .8768 |


## word2vec


In [None]:
!wget https://media.githubusercontent.com/media/eyaler/word2vec-slim/master/GoogleNews-vectors-negative300-SLIM.bin.gz

In [None]:
!gunzip GoogleNews-vectors-negative300-SLIM.bin.gz


In [None]:
from gensim.models import KeyedVectors

# Load word2vec model
word2vec_model = KeyedVectors.load_word2vec_format(
    "GoogleNews-vectors-negative300-SLIM.bin", binary=True
)

# Create an empty dictionary to store word vectors
embeddings_dictionary = {}

# Iterate through the word2vec model vocabulary and store word vectors in the dictionary
for word in word2vec_model.key_to_index:
    embeddings_dictionary[word] = word2vec_model[word]

# Initialize the embedding matrix with zeros
embedding_matrix = zeros((vocab_size, word2vec_model.vector_size))

# Iterate through tokenizer word index and update the embedding matrix with word vectors
for word, index in tokenizer.word_index.items():
    if word in embeddings_dictionary:
        embedding_matrix[index] = embeddings_dictionary[word]

In [None]:
from scipy.spatial.distance import cosine

# Calculate cosine similarity between "dog" and "cat"
dog_embedding = embedding_matrix[tokenizer.word_index["dog"]]
cat_embedding = embedding_matrix[tokenizer.word_index["cat"]]
similarity = 1 - cosine(dog_embedding, cat_embedding)
print(f"Similarity between 'dog' and 'cat': {similarity:.2f}")

In [None]:
embedding_matrix.shape[1]

In [None]:
experience(embedding_matrix.shape[1], train_lstm)

In [None]:
experience(embedding_matrix.shape[1], train_cnn_rnn)

In [None]:
experience(embedding_matrix.shape[1], train_gru)

| Modèle    | Word2vec |
| --------- | ----- |
| LSTM      | 0.863 |
| CNN + RNN | 0.872 |
| GRU       | 0.861 |


## [fasttext](https://www.kaggle.com/code/vsmolyakov/keras-cnn-with-fasttext-embeddings)


In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec

In [None]:
import codecs

embeddings_dictionary = {}
f = codecs.open("wiki.en.vec", encoding="utf-8")
for line in f:
    values = line.rstrip().rsplit(" ")
    word = values[0]
    coef = np.asarray(values[1:], dtype="float32")
    embeddings_dictionary[word] = coef
f.close()

In [None]:
embedding_matrix = zeros((vocab_size, 300))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        if embedding_vector.shape == (0,):
            continue
        embedding_matrix[index] = embedding_vector

In [None]:
embedding_matrix[200].shape

In [None]:
# Calculate cosine similarity between "dog" and "cat"
dog_embedding = embedding_matrix[tokenizer.word_index["dog"]]
cat_embedding = embedding_matrix[tokenizer.word_index["cat"]]
similarity = 1 - cosine(dog_embedding, cat_embedding)
print(f"Similarity between 'dog' and 'cat': {similarity:.2f}")

In [None]:
# import fasttext.util

# fasttext.util.download_model('en', if_exists='ignore')
# ft = fasttext.load_model('cc.en.300.bin')

In [None]:
experience(embedding_matrix.shape[1], train_lstm)

In [None]:
experience(embedding_matrix.shape[1], train_cnn_rnn)

In [None]:
experience(embedding_matrix.shape[1], train_gru)

| Modèle    | FastText |
| --------- | ----- |
| LSTM      | 0.861 |
| CNN + RNN | 0.873 |
| GRU       | 0.874 |


## TF-IDF


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(X)
tfidf_matrix = tfidf_matrix.toarray()

In [None]:
tfidf_matrix[17].shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    tfidf_matrix, y, test_size=0.2, random_state=42
)

In [None]:
from keras.layers import Reshape


def train_lstm_with_tfidf(output_dim=128):
    lstm_model = Sequential()
    lstm_model.add(Dense(64, activation="relu", input_shape=(X_train.shape[1],)))
    lstm_model.add(Reshape((64, 1)))

    lstm_model.add(LSTM(128))
    lstm_model.add(Dense(1, activation="sigmoid"))

    lstm_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])

    early_stopping = EarlyStopping(
        monitor="val_loss", patience=3, restore_best_weights=True
    )

    history = lstm_model.fit(
        X_train,
        y_train,
        epochs=6,
        batch_size=128,
        verbose=1,
        validation_split=0.2,
        callbacks=[early_stopping],
        validation_data=(X_test, y_test),
    )

    return lstm_model, history


def train_cnn_rnn_with_tfidf(output_dim=128):
    model = Sequential()
    model.add(Dense(64, activation="relu", input_shape=(X_train.shape[1],)))
    model.add(Reshape((64, 1)))
    model.add(Conv1D(filters=128, kernel_size=5, activation="relu"))
    model.add(MaxPooling1D())
    model.add(LSTM(units=128))
    model.add(Flatten())
    model.add(Dense(units=1, activation="sigmoid"))

    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])

    early_stopping = EarlyStopping(
        monitor="val_loss", patience=3, restore_best_weights=True
    )

    history = model.fit(
        X_train,
        y_train,
        epochs=6,
        batch_size=128,
        verbose=1,
        validation_split=0.2,
        callbacks=[early_stopping],
        validation_data=(X_test, y_test),
    )

    return model, history


def train_gru_with_tfidf(output_dim=128):

    model = Sequential()
    model.add(Dense(64, activation="relu", input_shape=(X_train.shape[1],)))
    model.add(Reshape((64, 1)))
    model.add(GRU(units=128))
    model.add(Flatten())
    model.add(Dense(units=1, activation="sigmoid"))

    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])

    early_stopping = EarlyStopping(
        monitor="val_loss", patience=3, restore_best_weights=True
    )

    history = model.fit(
        X_train,
        y_train,
        epochs=6,
        batch_size=128,
        verbose=1,
        validation_split=0.2,
        callbacks=[early_stopping],
        validation_data=(X_test, y_test),
    )

    return model, history

The embedding layer is typically used when dealing with text data where words are represented as discrete categorical variables. It transforms these variables into continuous vectors that capture semantic meaning. However, in our case, we’re using TF-IDF vectors as input, which are already continuous and capture some level of semantic meaning. Therefore, the embedding layer might not be necessary and could even distort the TF-IDF features.


In [None]:
model, history = train_lstm_with_tfidf(128)

In [None]:
score = model.evaluate(X_test, y_test, verbose=1)
print("Test Score:", score[0])
print("Test Accuracy:", score[1])

In [None]:
from keras.layers import Conv1D, MaxPooling1D

In [None]:
model, history = train_cnn_rnn_with_tfidf(128)
score = model.evaluate(X_test, y_test, verbose=1)
print("Test Score:", score[0])
print("Test Accuracy:", score[1])

In [None]:
model, history = train_gru_with_tfidf(128)
score = model.evaluate(X_test, y_test, verbose=1)
print("Test Score:", score[0])
print("Test Accuracy:", score[1])

| Modèle    | Score              |
| --------- | ------------------ |
| LSTM      | 0.8693000078201294 |
| CNN + RNN | 0.8691999912261963 |
| GRU       | 0.8708000183105469 |


| Modèle    | Glove  | Word2vec | FastText | tf-idf  |
| --------- | ------ | -------- | -------- | ------- |
| LSTM      | .8655  | 0.863    | 0.861    | **0.8693**  |
| CNN + RNN | .8672  | 0.872    | **0.873**    | 0.8692  |
| GRU       | **.8768**  | 0.861    | 0.874   | 0.8708  |

GRU combined with Glove embeddings appears to be the most effective combination based on the provided results.