**COURSE: PRDL/MLLB**

**PROJECT: Deep Learning**

**TEACHER: Luis Hernández Gómez**

**AUTHORS: MARONE Mamadou / RACHIDI Inass**

**NOTEBOOK: CNN & FFNN**

# SETUP

## INSTALLING MODULES

In [None]:
%%capture
!pip install tensorflow
!pip install tqdm

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
import matplotlib.pyplot as plt
from tensorflow.keras.models import Model
import random
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import precision_recall_fscore_support
import seaborn as sns




#  Load and Prepare Data

In [None]:
# os.chdir(r"C:\Users\maron\OneDrive\02-Documents\00.ETUDES\00.ECOLE_D_INGE\00.CYCLE_ING_FORMATION_INIT\00.3EME_ANNEE_INIT\00.A_COURS\00.PRDL\06.PROJECTS")

In [None]:
df_cleaned = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/PROJET_DL_MLLB/DATA/CLEANED/corpus_cleaned.csv")

Unnamed: 0,category,title,body,text
0,ARTS & CULTURE,model agenc enabl sexual predat year former ag...,octob carolyn kramer receiv disturb phone call...,model agenc enabl sexual predat year former ag...
1,ARTS & CULTURE,actor jeff hiller talk bright color bold patte...,week talk actor jeff hiller hit broadway play ...,actor jeff hiller talk bright color bold patte...
2,ARTS & CULTURE,new yorker cover put trump hole racist comment,new yorker take presid donald trump ask u woul...,new yorker cover put trump hole racist comment...


To make our model be able to understand the categories we will transform it inot numbers. This action is called label encoding. We will use the LabelEncoder tool provided by Scikit learn to perform it automatically

In [None]:
# Encode the labels
label_encoder = LabelEncoder()
df_cleaned['encoded_labels'] = label_encoder.fit_transform(df_cleaned['category'])

Now, we can create and separate the features and labels

In [None]:
# Split the data into features and labels
X = df_cleaned['text']
y = df_cleaned['encoded_labels']

Then we split the dataset into training and testing sets. The testing set will also be used for validation. 

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# TEXT PRE-PROCESSING

## Tokenize and Pad Text Data

In [None]:
# Tokenize the text
max_words = 5000
tokenizer = Tokenizer(num_words = max_words, oov_token = '<OOV>')
tokenizer.fit_on_texts(X_train)

In [None]:
# Convert text to sequences
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

In [None]:
# Pad sequences for equal length
max_length = 200
X_train_padded = pad_sequences(X_train_sequences, maxlen = max_length, padding = 'post', truncating = 'post')
X_test_padded = pad_sequences(X_test_sequences, maxlen = max_length, padding = 'post', truncating = 'post')

#  Build the Model

In [None]:
model = Sequential()




## Embedding

In [None]:
embedding_dim = 128

# Embedding layer
model.add(Embedding(input_dim = max_words, output_dim = embedding_dim, input_length = max_length))

## One dimensionnal Convolutionnal layer

In [None]:
num_filters = 256 #(2**8)
filter_size = 3

# Convolutional layer
model.add(Conv1D(num_filters, filter_size, activation = 'relu'))

# Global max pooling layer
model.add(GlobalMaxPooling1D())

## FFNN Layers

In [None]:
# Dense layers
model.add(Dense(256, activation='relu'))

# Add a dropout to reduce overfitting
model.add(Dropout(0.5))

# Output layer
num_classes = len(label_encoder.classes_)
model.add(Dense(num_classes, activation = 'softmax'))

## Additionnal parameters setting & summary of the model

In [None]:
# Compile the model
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

# Display the model summary
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 100)          500000    
                                                                 
 conv1d (Conv1D)             (None, 198, 256)          77056     
                                                                 
 global_max_pooling1d (Glob  (None, 256)               0         
 alMaxPooling1D)                                                 
                                                                 
 dense (Dense)               (None, 256)               65792     
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_1 (Dense)             (None, 14)                3598      
                                                       

#  Train the Model

In [None]:
epochs = 7
batch_size = 5 #32 #10

from tensorflow.keras.callbacks import EarlyStopping

# Define the EarlyStopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Fit the model with one-hot encoded labels and EarlyStopping callback
history = model.fit(X_train_padded, y_train, epochs=epochs, batch_size=batch_size,validation_split = 0.2)

Epoch 1/7


Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


In [None]:
# Plot training and validation loss values
plt.plot(history.history['loss'], label = 'Training Loss')
plt.plot(history.history['val_loss'], label = 'Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

# EVALUATION

In [None]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test_padded, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

Test Loss: 0.9543549418449402
Test Accuracy: 0.7311046719551086


In [None]:
# Evaluate the model on the test set
y_pred = model.predict_classes(X_test_padded)

# Calculate additional metrics
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("Test Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)

# Extract precision, recall, and F1 score
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred, average='weighted')

print("Weighted Precision:", precision)
print("Weighted Recall:", recall)
print("Weighted F1 Score:", f1_score)

# Visualize the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# INTERPRETATION

Text-to-Image Conversion: Convert the input text (news article) into an image-like representation. You can use techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings to represent each word or sequence of words as a vector.

1D CNN Feature Extraction: Apply a 1D CNN to capture local features in the text representation. This is similar to how a traditional image-based CNN captures features in different regions of an image.

Global Average Pooling (GAP): Use Global Average Pooling to condense the extracted features into a single vector. This step is crucial for connecting the CNN features to the subsequent FFNN.

Visualization: Visualize the weights of the connections between the last CNN layer and the FFNN layer. Higher weights signify the importance of the corresponding regions in the text. You can overlay these weights onto the original text or create a heatmap.

In [None]:
# Replace this with your actual input text
news_article_text = random.choice(X_train)

# Tokenize and pad the input text
tokenizer = Tokenizer(num_words=max_words, oov_token='<OOV>')
tokenizer.fit_on_texts([news_article_text])
text_sequence = tokenizer.texts_to_sequences([news_article_text])
padded_sequence = pad_sequences(text_sequence, maxlen=max_length, padding='post', truncating='post')

# Extract the weights from the convolutional layer
conv_layer = model.get_layer('conv1d')  # Replace with the actual name of your Conv1D layer
weights = model.get_layer('conv1d').get_weights()[0]

# Create a model to get intermediate layer outputs
activation_model = Model(inputs=model.input, outputs=model.get_layer('conv1d').output)  # Replace with the actual name of your Conv1D layer

# Get the intermediate layer output for the input text
activations = activation_model.predict(padded_sequence)

# Calculate the importance by multiplying activations with weights
cam_output = np.dot(activations, weights)

# Visualize the result
plt.imshow(cam_output[0].T, cmap='viridis')  # Transpose for better visualization
plt.show()