## 1. Lecture CSV

In [110]:
import numpy as np
import pandas as pd
import plotly.express as px
from plotly import graph_objects as go
from sklearn.metrics import classification_report, confusion_matrix
import tiktoken
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, random_split
from torchinfo import summary

In [111]:
df_spam = pd.read_csv("../datas/spam_clean.csv", encoding="iso-8859-1")
df_spam.head()

Unnamed: 0,label_text,message,label
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


---

## 2. Tokenization

Je tokenize les messages avec le tokenizer "cl100k_base" (basé sur le byte pair coding)

In [112]:
tokenizer = tiktoken.get_encoding("cl100k_base")

def encode_texts(texts):
    return [tokenizer.encode(text) for text in texts]

tokens = encode_texts(df_spam["message"])

In [113]:
tokens[0][:10]

[11087, 3156, 16422, 647, 1486, 11, 14599, 497, 16528, 1193]

In [114]:
tokens[1][:10]

[11839, 45555, 1131, 622, 10979, 289, 333, 577, 389, 72]

Les modèles de NLP exigent souvent des séquences (liste de tokens) de tailles uniformes.

Calcul de la taille moyenne des séquences

In [115]:
seq_lens = [len(seq) for seq in tokens]
np.mean(seq_lens)

np.float64(22.893933955491743)

Distribution de la taille des séquences

In [116]:
fig = px.histogram(seq_lens, nbins=30, color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(title="Distribution de la taille des séquences", 
                  yaxis_title="Nombre de séquences", xaxis_title="Taille des séquences", title_x=0.5, showlegend=False)
fig.show()

Taille moyenne des séquences : 22 tokens. Nous allons garder des séquences à 30 tokens.

In [117]:
def pad_sequences(sequences, max_length=30):
    return [seq[:max_length] + [0] * (max_length - len(seq)) for seq in sequences]

tokens = pad_sequences(tokens)

---

## 3. Dataset and split datas

Création du Dataset, des DataLoader et split des messages : train et validation. (80% - 20%)

In [118]:
# Class ATTDataset
class ATTDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = torch.tensor(texts, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.float32)

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

df_dataset = ATTDataset(tokens, df_spam["label"])

# Split dataset into training (80%) and validation (20%)
train_size = int(0.8 * len(df_dataset))
val_size = len(df_dataset) - train_size
train_dataset, val_dataset = random_split(df_dataset, [train_size, val_size])

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

In [119]:
text, label = next(iter(train_loader))
print(label)
print(text)

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.])
tensor([[ 5519,   577,  2103,   617, 42272,  1941, 17401,   323,   264, 60588,
           584,  1436, 17636,    30,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [11649,    13,  1472,  1440,  1148,   602,  3152,    13,  2991,   287,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [57744,   816,    46,   816,    46,  7866,  9060,  8662,  5045,  3202,
            30,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [19701,    11,   358,  3358,  1650,  3010,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     

---

## 4. First prediction model (Classification)

#### Définition du modèle

1 couche embedding (transforme les tokens en vecteurs)

1 couche pooling (réduit les outputs)

1 couche Linear

Activation Sigmoid car nous sommes sur un problème de classification

In [120]:
vocab_size = tokenizer.n_vocab

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.pooling = nn.AdaptiveAvgPool1d(1)
        self.fc = nn.Linear(embed_dim, num_class)

    def forward(self, text):
        embedded = self.embedding(text)
        pooled = self.pooling(embedded.permute(0, 2, 1)).squeeze(2)
        return torch.sigmoid(self.fc(pooled))

model = TextClassifier(vocab_size=vocab_size,
                      embed_dim=16,
                      num_class=1)

In [121]:

print(model)

# Print model summary
summary(model, input_data=text)

TextClassifier(
  (embedding): Embedding(100277, 16, padding_idx=0)
  (pooling): AdaptiveAvgPool1d(output_size=1)
  (fc): Linear(in_features=16, out_features=1, bias=True)
)


Layer (type:depth-idx)                   Output Shape              Param #
TextClassifier                           [32, 1]                   --
├─Embedding: 1-1                         [32, 30, 16]              1,604,432
├─AdaptiveAvgPool1d: 1-2                 [32, 16, 1]               --
├─Linear: 1-3                            [32, 1]                   17
Total params: 1,604,449
Trainable params: 1,604,449
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 51.34
Input size (MB): 0.01
Forward/backward pass size (MB): 0.12
Params size (MB): 6.42
Estimated Total Size (MB): 6.55

#### Entrainement

Fonction de coût : Binary Cross Entropy pour la Classification

Optimiser : Adam

On entraine le modèle sur 20 epochs

In [122]:
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train(model, train_loader, val_loader, criterion, optimizer, epochs=100):

    # Dictionary to store training & validation loss and accuracy over epochs
    history = {"loss": [], "val_loss": [], "accuracy": [], "val_accuracy": []}

    for epoch in range(epochs):  # Loop over the number of epochs
        model.train()  # Set model to training mode
        total_loss, correct = 0, 0  # Initialize total loss and correct predictions

        # Training loop
        for inputs, labels in train_loader:
            optimizer.zero_grad()  # Reset gradients before each batch
            outputs = model(inputs).squeeze()  # Forward pass
            loss = criterion(outputs, labels)  # Compute loss
            loss.backward()  # Backpropagation (compute gradients)
            optimizer.step()  # Update model parameters

            total_loss += loss.item()  # Accumulate batch loss
            correct += ((outputs > 0.5) == labels).sum().item()  # Count correct predictions

        # Compute average loss and accuracy for training
        train_loss = total_loss / len(train_loader)
        train_acc = correct / len(train_loader.dataset)

        # Validation phase (without gradient computation)
        model.eval()  # Set model to evaluation mode
        val_loss, val_correct = 0, 0
        with torch.no_grad():  # No need to compute gradients during validation
            for inputs, labels in val_loader:
                outputs = model(inputs).squeeze()  # Forward pass
                loss = criterion(outputs, labels)  # Compute loss
                val_loss += loss.item()  # Accumulate validation loss
                val_correct += ((outputs > 0.5) == labels).sum().item()  # Count correct predictions

        # Compute average loss and accuracy for validation
        val_loss /= len(val_loader)
        val_acc = val_correct / len(val_loader.dataset)

        # Store metrics in history dictionary
        history["loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["accuracy"].append(train_acc)
        history["val_accuracy"].append(val_acc)

        # Print training progress
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {train_loss:.4f}, Acc: {train_acc:.4f}, "
              f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

    return history  # Return training history

history = train(model,
                train_loader=train_loader,
                val_loader=val_loader,
                criterion=criterion,
                optimizer=optimizer,
                epochs=20)

Epoch [1/20], Loss: 0.6295, Acc: 0.8212, Val Loss: 0.5860, Val Acc: 0.8717
Epoch [2/20], Loss: 0.5290, Acc: 0.8773, Val Loss: 0.4817, Val Acc: 0.8825
Epoch [3/20], Loss: 0.4233, Acc: 0.8932, Val Loss: 0.3840, Val Acc: 0.9013
Epoch [4/20], Loss: 0.3350, Acc: 0.9177, Val Loss: 0.3069, Val Acc: 0.9247
Epoch [5/20], Loss: 0.2671, Acc: 0.9385, Val Loss: 0.2492, Val Acc: 0.9363
Epoch [6/20], Loss: 0.2164, Acc: 0.9547, Val Loss: 0.2060, Val Acc: 0.9570
Epoch [7/20], Loss: 0.1777, Acc: 0.9652, Val Loss: 0.1735, Val Acc: 0.9695
Epoch [8/20], Loss: 0.1492, Acc: 0.9722, Val Loss: 0.1497, Val Acc: 0.9776
Epoch [9/20], Loss: 0.1274, Acc: 0.9771, Val Loss: 0.1315, Val Acc: 0.9794
Epoch [10/20], Loss: 0.1106, Acc: 0.9807, Val Loss: 0.1175, Val Acc: 0.9812
Epoch [11/20], Loss: 0.0973, Acc: 0.9843, Val Loss: 0.1062, Val Acc: 0.9839
Epoch [12/20], Loss: 0.0860, Acc: 0.9863, Val Loss: 0.0974, Val Acc: 0.9848
Epoch [13/20], Loss: 0.0768, Acc: 0.9883, Val Loss: 0.0902, Val Acc: 0.9848
Epoch [14/20], Loss: 

#### Sauvegarde du modèle

In [123]:
checkpoint_path = "../models/AT_T_DeepLearning__Model.pth"
torch.save({
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "history": history,
}, checkpoint_path)

#### Analyse des résultats

Visualisation de la fonction de coût et de l'accuracy

In [124]:
color_chart = ["#4B9AC7", "#4BE8E0", "#9DD4F3", "#97FBF6", "#2A7FAF", "#23B1AB", "#0E3449", "#015955"]

fig = go.Figure(data=[
                      go.Scatter(
                          y=history["loss"],
                          name="Training loss",
                          mode="lines",
                          marker=dict(
                              color=color_chart[0]
                          )),
                      go.Scatter(
                          y=history["val_loss"],
                          name="Validation loss",
                          mode="lines",
                          marker=dict(
                              color=color_chart[1]
                          ))
])
fig.update_layout(
    title="Training and val loss across epochs",
    xaxis_title="epochs",
    yaxis_title="Cross Entropy"
)
fig.show()

In [125]:
color_chart = ["#4B9AC7", "#4BE8E0", "#9DD4F3", "#97FBF6", "#2A7FAF", "#23B1AB", "#0E3449", "#015955"]

fig = go.Figure(data=[
                      go.Scatter(
                          y=history["accuracy"],
                          name="Training Accuracy",
                          mode="lines",
                          marker=dict(
                              color=color_chart[0]
                          )),
                      go.Scatter(
                          y=history["val_accuracy"],
                          name="Validation Accuracy",
                          mode="lines",
                          marker=dict(
                              color=color_chart[1]
                          ))
])
fig.update_layout(
    title="Training and val Accuracy across epochs",
    xaxis_title="epochs",
    yaxis_title="Cross Entropy"
)
fig.show()

In [126]:
final_history = {key: valeur[-1] for key, valeur in history.items()}
print(final_history)

{'loss': 0.038253845780023506, 'val_loss': 0.0624382901138493, 'accuracy': 0.9943908458604442, 'val_accuracy': 0.989237668161435}


A première vue, le modèle classe plutôt bien les spams et hams.

Sur le set de train la loss est de 0.038 et l'accuracy de 0.994.

Sur le set de validation la loss est de 0.062 et l'accuracy de 0.989.

#### Analyse des erreurs : là où le modèle s'est trompé

In [127]:
# Function to evaluate the model and get worst predictions
def evaluate_worst_predictions(model, dataloader, tokenizer):
    # Set model to evaluation mode to disable dropout and batch normalization
    model.eval()

    # Lists to store all predictions, labels, errors, and inputs for analysis
    list_predictions = []
    list_labels = []
    list_errors = []
    list_inputs = []

    # No gradients needed during evaluation for efficiency
    with torch.no_grad():
        for batch in dataloader:
            # Extract inputs and labels from the batch
            inputs, labels = batch
            outputs = model(inputs) # Forward pass: Get model predictions

            # Convert outputs to predicted class for classification problems
            #preds = torch.argmax(outputs, dim=1)
            preds = (outputs >= 0.5).int().squeeze()
            errors = (preds != labels).float()  # Misclassified observations
            
            # Store predictions, labels, errors, and raw inputs for further analysis
            list_predictions.extend(preds.cpu().numpy())
            list_labels.extend(int(x) for x in labels.cpu().numpy())
            list_errors.extend(errors.cpu().numpy())
            list_inputs.extend(inputs.cpu().numpy())

    # Convert stored results into a Pandas DataFrame for easy analysis
    # Decode tokenized text back into human-readable text
    df_results = pd.DataFrame({
        "True_Label": list_labels,
        "Predicted": list_predictions,
        "Error": list_errors,
        "Inputs": list_inputs,
        "Text" : [tokenizer.decode(input) for input in list_inputs]
    })

    # Sort the DataFrame by highest error to identify the worst predictions
    df_results_sorted = df_results.sort_values(by="Error", ascending=False)

    # Return the sorted DataFrame containing worst predictions
    return df_results_sorted

# Evaluate worst predictions on validation set
worst_predictions_val = evaluate_worst_predictions(model, val_loader, tokenizer)

# Evaluate worst predictions on training set
worst_predictions_train = evaluate_worst_predictions(model, train_loader, tokenizer)

In [128]:
worst_predictions_train["Predicted"].value_counts()

Predicted
0    3882
1     575
Name: count, dtype: int64

Analyse des pires prédictions sur le set d'entrainement

In [129]:
worst_predictions_train.tail(10)

Unnamed: 0,True_Label,Predicted,Error,Inputs,Text
1482,0,0,0.0,"[2460, 13305, 1903, 2523, 315, 757, 3432, 13, ...",All boys made fun of me today. Ok i have no pr...
1490,0,0,0.0,"[2675, 2751, 2663, 264, 5507, 30, 0, 0, 0, 0, ...",You got called a tool?!!!!!!!!!!!!!!!!!!!!!!!!
1489,1,1,0.0,"[81710, 0, 4718, 220, 1049, 18, 8785, 22504, 3...",PRIVATE! Your 2003 Account Statement for 07808...
1488,0,0,0.0,"[1829, 480, 715, 45, 1428, 31949, 52, 5745, 79...",IM GONNAMISSU SO MUCH!!I WOULD SAY IL SEND U A...
1487,0,0,0.0,"[40, 1440, 29011, 1661, 1193, 497, 29651, 376,...",I know complain num only..bettr directly go to...
1486,0,0,0.0,"[40, 2751, 2500, 2683, 0, 578, 832, 520, 279, ...",I got another job! The one at the hospital doi...
1485,0,0,0.0,"[40, 50116, 270, 74, 602, 3358, 17257, 3686, 1...","I dun thk i'll quit yet... Hmmm, can go jazz ?..."
1484,0,0,0.0,"[24220, 656, 0, 4418, 33895, 126, 231, 19321, ...",Yeah do! DonÂÃÃ·t stand to close tho- youÂÃ...
1483,0,0,0.0,"[32576, 353, 374, 1790, 4228, 3041, 701, 2759,...",Perhaps * is much easy give your account ident...
1491,0,0,0.0,"[51537, 706, 9670, 369, 2500, 1938, 11, 6693, ...","Night has ended for another day, morning has c..."


classification report

In [130]:
print(classification_report(worst_predictions_train["True_Label"], worst_predictions_train["Predicted"]))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      3860
           1       1.00      0.96      0.98       597

    accuracy                           0.99      4457
   macro avg       1.00      0.98      0.99      4457
weighted avg       0.99      0.99      0.99      4457



Matrice de confusion

In [131]:
mat = confusion_matrix(worst_predictions_train["True_Label"], worst_predictions_train["Predicted"])

labels = df_spam["label_text"].unique()
df_mat = pd.DataFrame(mat, index=labels, columns=labels)

px.imshow(df_mat, text_auto=True, color_continuous_scale=px.colors.sequential.Aggrnyl)

Analyse des pires prédictions sur le set de validation

In [132]:
worst_predictions_val.tail(10)

Unnamed: 0,True_Label,Predicted,Error,Inputs,Text
365,1,1,0.0,"[1539, 38, 1863, 0, 4718, 6505, 2360, 220, 258...",URGENT! Your mobile No 07xxxxxxxxx won a Ã¥Â£2...
373,0,0,0.0,"[2460, 2884, 11, 682, 23415, 304, 13, 4418, 95...","All done, all handed in. Don't know if mega sh..."
372,0,0,0.0,"[2181, 1120, 5084, 1093, 16682, 18912, 430, 27...",It just seems like weird timing that the night...
371,0,0,0.0,"[19321, 126, 234, 19321, 126, 237, 2751, 30125...",ÃÃ got wat to buy tell us then Ã_ no need t...
370,0,0,0.0,"[42, 358, 3358, 387, 2771, 311, 636, 709, 1603...",K I'll be sure to get up before noon and see w...
369,0,0,0.0,"[96714, 11, 8009, 1167, 582, 3250, 956, 2559, ...","Damn, poor zac doesn't stand a chance!!!!!!!!!..."
368,0,0,0.0,"[50, 894, 3067, 497, 73, 267, 63632, 1193, 602...",Sry da..jst nw only i came to home..!!!!!!!!!!...
367,0,0,0.0,"[2746, 358, 574, 358, 5828, 956, 12798, 6666, ...",If I was I wasn't paying attention!!!!!!!!!!!!...
366,0,0,0.0,"[44, 76, 779, 499, 4691, 757, 539, 311, 1650, ...",Mm so you asked me not to call radio!!!!!!!!!!...
374,1,1,0.0,"[31192, 31338, 473, 3145, 13, 2057, 3802, 4433...",Sunshine Hols. To claim ur med holiday send a ...


classification report

In [133]:
print(classification_report(worst_predictions_val["True_Label"], worst_predictions_val["Predicted"]))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       965
           1       0.99      0.93      0.96       150

    accuracy                           0.99      1115
   macro avg       0.99      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115



Matrice de confusion

In [134]:
mat = confusion_matrix(worst_predictions_val["True_Label"], worst_predictions_val["Predicted"])

labels = df_spam["label_text"].unique()
df_mat = pd.DataFrame(mat, index=labels, columns=labels)

px.imshow(df_mat, text_auto=True, color_continuous_scale=px.colors.sequential.Aggrnyl)

---

## Conclusion

Le modèle arrive plutôt bien à identifier les spams / hams.

Toutefois, sur le set de validation : Le recall pour les spams est de 0.93, ce qui signifie que 7 % des spams ne sont pas détectés.

Plusieurs facteurs peuvent expliquer cela :
- Le Label 1 (Spam) est moins bien représenté dans le dataset, il est donc plus difficile à prédire.
- Notre dataset comporte peu de données au départ (environ 5500)

Afin d'améliorer la détection de spams, nous allons nous appuyer sur des modèles pré-existants, plus sophistiqués et entraînés sur des jeux de données plus importants afin de voir si la classification ham / spam s'améliore.