#  AT&T Spam Detector


# <span style="color:red"><b>TODO & Ideas - TO BE COMMENTED</b></span>

* ~~In 03_bert.ipynb make a test with ``bert-base-uncased`` then ``bert-base-cased`` and compare~~
* ~~Analyze results of the model~~
* ~~Explain the number of trainable parameters of the model~~
* ~~Use plotly to draw a confusion matrix per epoch. Can plotly do that ?~~
* ~~Try over-under sampling (see 02_att_02.ipynb and 02_att_03.ipynb)~~
* Make a test with an additional Dropout layer in the Baseline model to prevent overfitting?
* See ``07_deep_learning\08_Word_Embedding\04-Embedding_for_sentiment_analysis.ipynb``
* https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/
* https://medium.com/@samia.khalid/bert-explained-a-complete-guide-with-theory-and-tutorial-3ac9ebc8fa7c
* https://medium.com/@yashvardhanvs/classification-using-pre-trained-bert-model-transfer-learning-2d50f404ed4c
* https://blog.mirkopeters.com/bert-based-transfer-learning-in-nlp-6887b4538bd0
* https://thesai.org/Downloads/Volume12No1/Paper_64-Comparison_of_Deep_and_Traditional_Learning_Methods.pdf
* https://medium.com/towards-data-science/spam-filtering-system-with-deep-learning-b8070b28f9e0
* https://towardsdatascience.com/handling-overfitting-in-deep-learning-models-c760ee047c6e
* https://www.tensorflow.org/tensorboard/get_started
* https://www.geeksforgeeks.org/sms-spam-detection-using-tensorflow-in-python/
* https://medium.com/@Coursesteach/deep-learning-part-37-bias-variance-3837b4f6caa1
* https://www.kaggle.com/code/bishowlamsal/spam-detector-using-bert
* https://www.kaggle.com/code/kshitij192/spam-email-classification-using-bert


Since we will use TensorFlow for this project, I highly recommend to create a virtual environnement named ``tf_cpu1`` (python 3.10 is important for TF)

```
conda deactivate
conda create --name tf_cpu1 --file ./assets/requirements.txt -c conda-forge -y
conda activate tf_cpu1
code .

```


# Summary of the specs

* https://app.jedha.co/course/projects-deep-learning-ft/att-spam-detector-ft

## Goals
* Your goal is to build a spam detector, that can automatically flag spams as they come based solely on the sms" content.

## Start simple
* A good <span style="color:orange"><b>deep learing model</b></span> does not necessarily have to be super complicated!

## Transfer learning
* You do not have access to a whole lot of data, perhaps channeling the power of a more sophisticated model trained on billions of observations might help!



# EDA

In [70]:
# prelude

import pandas as pd
import re
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import confusion_matrix
import plotly.graph_objects as go

from pathlib import Path
# k_CurrentDir  = Path(__file__).parent    # __file__ is not known in Jupyter context
k_Current_dir   = Path.cwd()
k_AssetsDir     = "assets"
k_Gold          = 1.618                    # gold number for ratio
k_Width         = 12
k_Height        = k_Width/k_Gold
k_WidthPx       = 1024
k_HeightPx      = k_WidthPx/k_Gold
k_random_state  = 42
k_test_size     = 0.3
k_num_words     = 1_000                    # the number of most freq words to keep during tokenization
k_epochs        = 50                       # I tried 10, 20, 50 and 100

In [28]:
# -----------------------------------------------------------------------------
def quick_View(df: pd.DataFrame) -> pd.DataFrame:

    """
    Generates a summary DataFrame for each column in the input DataFrame.

    This function analyzes each column in the given DataFrame and creates a summary that includes
    data type, number of null values, percentage of null values, number of non-null values, 
    number of distinct values, min and max values, outlier bounds (for numeric columns),
    and the frequency of distinct values.

    Args:
        df (pd.DataFrame): The input DataFrame to analyze.

    Returns:
        pd.DataFrame: A DataFrame containing the summary of each column from the input DataFrame. 
                      Each row in the resulting DataFrame represents a column from the input DataFrame
                      with the following information:
                      - "name": Column name
                      - "dtype": Data type of the column
                      - "# null": Number of null values
                      - "% null": Percentage of null values
                      - "# NOT null": Number of non-null values
                      - "distinct val": Number of distinct values
                      - "-3*sig": Lower bound for outliers (mean - 3*std) for numeric columns
                      - "min": Minimum value for numeric columns
                      - "max": Maximum value for numeric columns
                      - "+3*sig": Upper bound for outliers (mean + 3*std) for numeric columns
                      - "distinct val count": Dictionary of distinct value counts or top 10 values for object columns
    """

    summary_lst = []
  
    for col_name in df.columns:
        col_dtype               = df[col_name].dtype
        num_of_null             = df[col_name].isnull().sum()
        percent_of_null         = num_of_null/len(df)
        num_of_non_null         = df[col_name].notnull().sum()
        num_of_distinct_values  = df[col_name].nunique()

        if num_of_distinct_values <= 10:
            distinct_values_counts = df[col_name].value_counts().to_dict()
        else:
            top_10_values_counts    = df[col_name].value_counts().head(10).to_dict()
            distinct_values_counts  = {k: v for k, v in sorted(top_10_values_counts.items(), key=lambda item: item[1], reverse=True)}

        if col_dtype != "object":
            max_of_col = df[col_name].max()
            min_of_col = df[col_name].min()
            outlier_hi = df[col_name].mean() + 3*df[col_name].std()
            outlier_lo = df[col_name].mean() - 3*df[col_name].std()
        else:
            max_of_col = -1
            min_of_col =  1
            outlier_hi = -1
            outlier_lo =  1

        summary_lst.append({
            "name"                : col_name,
            "dtype"               : col_dtype,
            "# null"              : num_of_null,
            "% null"              : (100*percent_of_null).round(2),
            "# NOT null"          : num_of_non_null,
            "distinct val"        : num_of_distinct_values,
            "-3*sig"              : round(outlier_lo,2) ,
            "min"                 : round(min_of_col,2),
            "max"                 : round(max_of_col,2),
            "+3*sig"              : round(outlier_hi,2) ,
            "distinct val count"  : distinct_values_counts
        })

    df_tmp = pd.DataFrame(summary_lst)
    return df_tmp

In [29]:
# -----------------------------------------------------------------------------
# drop empty cols and duplicates, rename cols...
def cleaner(df:pd.DataFrame) -> pd.DataFrame:
    df.drop(columns="Unnamed: 2", inplace=True)
    df.drop(columns="Unnamed: 3", inplace=True)
    df.drop(columns="Unnamed: 4", inplace=True)

    df.drop_duplicates(inplace=True)

    df.columns = df.columns.str.lower()
    df.columns = df.columns.str.replace("/", "_")

    df.rename(columns={"v1": "target"}, inplace=True)
    df.rename(columns={"v2": "text"}, inplace=True)

    return df

In [None]:
# df = pd.read_csv("https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/project/spam.csv", encoding="cp1252")
df = pd.read_csv(k_Current_dir / k_AssetsDir / "spam.csv", encoding="cp1252")
print(f"\n\nPreview of the dataset (raw) :")
display(df.head())

df = cleaner(df)
# print(df.shape)

print(f"\n\nPreview of the initial dataset :")
display(df.head())

print(f"\n\nThe dataset consists of :")
print(f"\t{len(df.shape):>9_} dimensions")
print(f"\t{df.shape[0]:>9_} observations")
print(f"\t{df.shape[1]:>9_} features    ")

print(f"\n\nIt's a binary classification problem")

df_types = pd.DataFrame ({
  "types" : df.dtypes.value_counts()
})
df_types["as_%"] = (100 * df_types["types"]/df_types["types"].sum()).round(2)

print(f"\n\n% of data type :")
display(df_types)

df_tmp = quick_View(df)
print(f"\n\nQuickView :")
display(df_tmp.sort_values(by="# null", ascending=False))   

print(f"\n\n% of missing values :")
display(round(df.isnull().sum()/len(df)*100, 2))

## <span style="color:orange"><b>Comments :</b></span>
* There is no missing values
* 5k observations. Will it be enough ?
* Unbalanced target

## Spam & ham balance

In [None]:
counts = df["target"].value_counts()
print(f"Nb spam : {counts['spam']:>7_}")
print(f"Nb ham  : {counts['ham']:>7_}")

_ = counts.plot.pie(title="Weight as %", autopct="%1.1f%%", figsize=(k_Width, k_Height))

### <span style="color:orange"><b>Comments :</b></span>
* No surprise, the target is heavily unbalanced

# Text processing


## How ham and spam text look like ?

In [None]:
print(f"\n\nHAM : ")
# pd.set_option("display.max_colwidth", 1000)
# print(df[df["target"]=="ham"].head(20))
print(df[df["target"]=="ham"].head(20).to_string())

In [None]:
print(f"\n\nSPAM : ")
print(df[df["target"]=="spam"].head(20).to_string())

## Cleaning

In [None]:
# The 2 lines below can help to print the punctuation signs
# import string 
# string.punctuation

# Remove punctuation  
df["clean_docs"] = df["text"].apply(lambda x: re.sub("[!\"#$%&()*+,-./:;<=>?@\[\]^_`{|}~\\\]+"," ", x)) 

# fillna() makes sure NA is replaced with "" so that lowering case do not generate error
df["clean_docs"] = df["clean_docs"].fillna("").apply(lambda x: x.lower())

# df["clean_docs"].head(20)
df


## Tokenization

In [None]:
nlp = en_core_web_sm.load()

# Tokenize the cleaned document
df["tokenized_docs"] = df["clean_docs"].fillna("").apply(lambda x: nlp(x))

# remove stop-words, replace words with their lemma
df["tokenized_docs"] = df["tokenized_docs"].apply(lambda x: [token.lemma_ for token in x if token.text not in STOP_WORDS])
# df["tokenized_docs"].head(20)

# clean up tokenized documents
df["clean_tokens"] = [" ".join(x) for x in df["tokenized_docs"]]

# set the target as boolean value (spam=1) 
df["target"] = df["target"].map({"ham":0,"spam":1})
df


## Word Cloud

In [None]:
for i in set(df["target"]):
    words = ""
    for doc in df[df["target"] == i]["text"]:
        words += doc + " "
    wordcloud = WordCloud().generate(words)
    plt.figure(figsize = (k_Height, k_Width))
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.title(f"WordCloud for {'spam' if i==1 else 'ham'}")
    plt.show()

### <span style="color:orange"><b>Comments :</b></span>

* With no surprise "free", "call", "now"... are dominant in spams sms
* Surprisingly "sex" and "viagra" are missing 😁

## Word occurrences

In [42]:
def word_occurences_report(target):
    df_tmp = " ".join(df[df["target"] == target]["clean_tokens"])
    df_tmp = pd.DataFrame(df_tmp.split(" "))
    df_tmp = df_tmp.value_counts(ascending=False)

    df_tmp = df_tmp.reset_index()  
    df_tmp.columns = ["word", "occurrences"]  
    df_tmp.set_index("word", inplace=True)  
    display(df_tmp)

    # print(f"About words in ham :")
    print(f"{len(df_tmp):>6_} differents words")
    print(f"{df_tmp['occurrences'].sum():>6_} occurrences")
    pareto = int(df_tmp['occurrences'].sum()*0.8)
    print(f"{pareto:>6_} = 80% of thoses occurrences")

    i = 100
    while df_tmp['occurrences'].head(i).sum()<pareto:
        i+=100

    print(f"{i:6_} ({100*i/len(df_tmp):.0f} %) words are needed to cover 80% of the occurrences")
    print(f"\n\n\n")
    return

In [None]:
print(f"About words in spam :")
word_occurences_report(1)

print(f"About words in ham :")
word_occurences_report(0)


## Encoding

In [None]:
# oov_token= out of vocabulary token
# When the oov_token is specified in the tokenizer, words not present in the learned vocabulary will be replaced by this token. 
# This enables the model to handle new words that appear in test or inference data, while reducing the risk of errors or inaccuracies.
# In a new sentence, 2 unknown words will be represented by the OOV token, preserving information even if the model hasn't seen these words before

# If oov_token not specified any word not in the vocabulary will be ignored and not tokenized
# This may result in the loss of important information during inference or testing.

# keep the 1_000 most frequents words during tokenization
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=k_num_words, oov_token="_UNKNOWN_") 
tokenizer.fit_on_texts(df["clean_tokens"])
df["sms_encoded"] = tokenizer.texts_to_sequences(df["clean_tokens"])
df


## Padding

In [None]:
# Tensorflow cannot create a tensor dataset based on lists
# We have to store encoded sms into a numpy array before creating the tensorflow dataset
# However, not all the sequences are the same length
# This is where `tf.keras.preprocessing.sequence.pad_sequences()` comes in 
# It will add zero padding at the beginning (`padding="pre"`) or at the end (`padding="post"`) of our sequences so they all have equal length
sms_padded = tf.keras.preprocessing.sequence.pad_sequences(df["sms_encoded"], padding="post")
print(sms_padded)
print(sms_padded.shape)

# max_words = df['clean_tokens'].apply(lambda x: len(x.split())).max()
# print(max_words)
# 77

## Split

In [50]:
# stratify
X_train, X_test, y_train, y_test = train_test_split(sms_padded, df["target"], test_size = k_test_size, random_state = k_random_state, stratify = df["target"])

### <span style="color:orange"><b>Comments :</b></span>

* No over/under sampling on ``X_train``, ``y_train``
* See ``02_att_02.ipynb`` or ``02_att_03.ipynb``

## Create Tensor

In [51]:
# creates a TensorFlow Dataset object to be used for data loading, batching, shuffling, and preprocessing when training the model
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))

In [52]:
# Create generators, making sure that data are blended/mixed and divided into batches of 64 observations 
train_batch = train_ds.shuffle(len(train_ds)).batch(64)
test_batch = test_ds.shuffle(len(test_ds)).batch(64)

# for sms, ham_spam in train_batch.take(1):
#   print(sms, ham_spam)

# print()
# print(sms.shape)

# Using a Baseline Model


In [None]:
sequence_length = sms_padded.shape[1]

model = tf.keras.Sequential([
    # This Embedding layer converts integer-encoded words (from the tokenizer) into dense vectors of size 8 
    # tokenizer.num_words should be equal to k_num_words
    # +1 because TensorFlow reserves an index for padding or the OOV token
    # sms.shape[1] is the length of the input sequence. See the output of the previous cell (77)
    # For the output I tried : 16 8 4 and 2
    # I like 8
    
    # tf.keras.layers.Embedding(tokenizer.num_words + 1, 8, input_shape=[sms.shape[1],], name="embedding"),
    tf.keras.layers.Embedding(tokenizer.num_words + 1, 8, input_shape=[sequence_length], name="embedding"),

    # Global average pooling
    # Reduces the dimensionality by averaging the vectors across the sequence length
    # The model loose the order of the words but we don't care  
    tf.keras.layers.GlobalAveragePooling1D(),

    # Fully connected (Dense) layer with 16 neurons. 
    # relu activation function introduces non-linearity (this helps the model to capture more complex patterns)
    tf.keras.layers.Dense(16, activation="relu"),

    # Since this is a binary classification problem (based on the sigmoid activation), there's a single output neuron. 
    tf.keras.layers.Dense(1, activation="sigmoid")
])

model.summary()

In [None]:
path = Path(f"{k_Current_dir/k_AssetsDir/'baseline_arch.png'}")
tf.keras.utils.plot_model(model, path, show_shapes=True)

## <span style="color:orange"><b>Comments :</b></span>

About the number of trainable parameters of the model

```
Inputs                 : 1000 words    (see k_num_words in prelude)
Embedding layer        :  8 neurons => (1000 + 1) * 8 = 8008 params
GlobalAveragePooling1D :    average => no params      =    0 params
Dense layer            : 16 neurons => 8  * 16 + 16   =  144 params
Dense layer            :  1 neuron  => 16 *  1 +  1   =   17 params
                                                TOTAL = 8169 params
```


In [55]:
model.compile(
    optimizer="adam",
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[tf.keras.metrics.Recall(name="recall"), tf.keras.metrics.Precision(name="precision"), "accuracy"]       # name=... avoid recall_1 for example
)

In [None]:
confusion_matrices = []

def get_data_from_dataset(dataset):
    features = []
    labels = []
    for batch_features, batch_labels in dataset:
        features.append(batch_features.numpy())
        labels.append(batch_labels.numpy())
    return np.concatenate(features), np.concatenate(labels)

class ConfusionMatrixCallback(tf.keras.callbacks.Callback):             # inherit from tf.keras.callbacks.Callback
    def __init__(self, val_data):
        # store data once
        self.val_data = val_data                                        

    def on_epoch_end(self, epoch, logs=None):
        # extract features and labels from validation set
        val_features, val_labels = get_data_from_dataset(self.val_data)
        
        # make prediction with validation features
        val_pred = (self.model.predict(val_features) > 0.5).astype("int32")
        
        # compute the associated confusion matrix
        cm = confusion_matrix(val_labels, val_pred)

        # flip the 2 lines just to make sure it looks like sklearn.metrics.confusion_matrix
        cm_flipped = np.flipud(cm)
        
        confusion_matrices.append(cm_flipped)

history = model.fit(
    train_batch,
    epochs=k_epochs,
    validation_data=test_batch,
    callbacks=[ConfusionMatrixCallback(test_batch)]
)

## Evaluating

### <span style="color:orange"><b>Sync point :</b></span>

<p align="center">
<img src="./assets/metrics.png" alt="drawing" width="400"/>
<p>

* Sms identified as spam (=1)
* Can I accept spam in my sms? I don't know...
* Can I accept to see an sms from my beloved CEO classified as spam? No! 
* So I want FP to tend towards 0 and precision $\frac{TP}{TP+FP}$ towards 1 (even if the recall, $\frac{TP}{TP+FN}$, is not that great )
* I decide to favour precision over recall

Just to make sure...
* **Overfitting**
    * If accuracy continues to increase on training but starts to decrease or stagnate on validation
    * this indicates overfitting.

* **Underfitting** 
    * If the loss curves do not decrease 
    * or 
    * If accuracy remains low
    * this could indicate that the model has not yet learned sufficiently.


In [62]:
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = Path(f"{k_Current_dir/k_AssetsDir/fig_id}.{fig_extension}")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)
    return

### Loss

In [None]:
# "val_" stands for validation
# history.history.keys()

plt.plot(history.history["loss"], color="b", label="Train Loss")
plt.plot(history.history["val_loss"], color="r", label="Val Loss")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("Baseline : Loss")
plt.legend()
plt.ylim(0,1)
save_fig("baseline_loss", "png")
plt.show()


# display(history.history['loss'][-10:])
display([[round(f, 6) for f in history.history['loss'][-10:]]])
display([[round(f, 6) for f in history.history['val_loss'][-10:]]])


#### <span style="color:orange"><b>Comments :</b></span>
* Interpretation
    * The loss measures the error of the model on the training data
    * The learning rate seems OK because the curve shows a drop and then a plateau
    * The 2 curves don't separate immediately. The model doesn't learn "by heart". This is a good indicator
    * Validation loss rises while train loss remains on its plateau. This is also a good sign
* Analysis
    * A loss of 0.05 is relatively low, indicating that the model has learned the characteristics of the data pretty well 
    * However, a low loss does not necessarily mean that the model is perfect 
    * See comments about other metrics below

### Accuracy

In [None]:
plt.plot(history.history["accuracy"], color="b", label="Train Accuracy")
plt.plot(history.history["val_accuracy"], color="r", label="Val Accuracy")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("Baseline : Accuracy")
plt.legend()
plt.ylim(0,1)
save_fig("baseline_accuracy", "png")
plt.show()

display([[round(f, 6) for f in history.history['accuracy'][-10:]]])
display([[round(f, 6) for f in history.history['val_accuracy'][-10:]]])


#### <span style="color:orange"><b>Comments :</b></span>
* Interpretation
    * Accuracy measures the total proportion of correct predictions (TP + TN) out of all predictions made (all cells of the matrix above)
    * Diagonal of the matrix (see above, synch point)
    * An accuracy of 0.985 means that the model correctly classifies 98.5% of emails
* Analysis: 
    * High accuracy is generally a good sign
    * **BUT** in a spam detection problem where the classes are unbalanced (much more non-spam than spam), high accuracy can be misleading 
    * For example, predicting that all emails are non-spam could give a good accuracy if spam is rare

### Precision

In [None]:
plt.plot(history.history["precision"], color="b", label="Train Precision")
plt.plot(history.history["val_precision"], color="r", label="Val Precision")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("Baseline : Precision")
plt.legend()
plt.ylim(0,1)
save_fig("baseline_precision", "png")
plt.show()

display([[round(f, 6) for f in history.history['precision'][-10:]]])
display([[round(f, 6) for f in history.history['val_precision'][-10:]]])


#### <span style="color:orange"><b>Comments :</b></span>
* Interpretation
    * Precision measures the proportion of true positives (sms correctly identified as spam) versus all sms classified as spam by the model (TP + FP). 
    * Right hand side column of the matrix above
    * A precision of 0.994 means that when the model predicts that an sms is spam, it is right 99.4% of the time.
* Analysis
    * Precision is excellent and this is what we want
    * The model is very effective at minimizing false positives, i.e. non-spam sms wrongly classified as spam.

### Recall

In [None]:
plt.plot(history.history["recall"], color="b", label="Train Recall")
plt.plot(history.history["val_recall"], color="r", label="Val Recall")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("Baseline : Recall")
plt.legend()
plt.ylim(0,1)
save_fig("baseline_recall", "png")
plt.show()

display([[round(f, 6) for f in history.history['recall'][-10:]]])
display([[round(f, 6) for f in history.history['val_recall'][-10:]]])


#### <span style="color:orange"><b>Comments :</b></span>


* Interpretation
    * Recall (AKA sensitivity) measures the proportion of true positives (correctly identified spam sms) to all true spam (TP + FN)
    * Bottom line of the matrix above
    * A recall of 0.892 means that the model correctly detects 89.2% of sms that are really spam
* Analysis
    * Although the recall is relatively high, it is still missing around 10.8% of spam (false negatives)

In [67]:
def f1_calculus(name, rec, prec):
    df_tmp=pd.DataFrame()
    df_tmp[name] = 2*np.array(rec)*np.array(prec)/(np.array(rec)+np.array(prec)+tf.keras.backend.epsilon()) # epsilon avoid runtimeWarning: divide by zero encountered in divide...
    return df_tmp

### F1 Score

In [None]:
df_tmp = f1_calculus("f1", history.history["recall"], history.history["precision"])
df_val_tmp = f1_calculus("val_f1", history.history["val_recall"], history.history["val_precision"])

plt.plot(df_tmp["f1"], color="b", label="Train F1")
plt.plot(df_val_tmp["val_f1"], color="r", label="Val F1")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("Baseline : F1")
plt.legend()
plt.ylim(0,1)
save_fig("baseline_f1", "png")
plt.show()

display(df_tmp.tail(10))
display(df_val_tmp.tail(10))




#### <span style="color:orange"><b>Comments :</b></span>
* Interpretation: 
    * F1 score provides a balanced measure of the model when there's a trade-off between precision and recall 
    * It is the harmonic mean between precision and recall
    * An F1 score of 0.94 shows that the model has a good balance between precision and recall
* Analysis: 
    * The F1 score confirms that the model handles the trade-off between precision and recall pretty well
    * This is crucial in spam detection problems where false negatives (undetected spam sms) can be annoying. 
        * It is great to NOT have true sms in spam but it could be great to have no spam at all in our sms

### Confusion matrix (animated)

In [None]:
def plot_confusion_matrices(confusion_matrices):
    epochs = range(1, k_epochs + 1)
    frames = []

    # Create a frame per epoch
    for epoch, cm in zip(epochs, confusion_matrices):
        annotations = [[f'{value}' for value in row] for row in cm]
        frame = go.Frame(data=[go.Heatmap(
            z=cm, 
            text=annotations,
            colorscale='Viridis',                       # mimic default color of sklearn.metrics.confusion_matrix()
            hoverinfo='skip',                           # ! Hide hover info (x, y, z)
            showscale=True,                             # colorbar visible
            texttemplate="%{text}",                     # string to be displayed in the cells
            textfont={"size": 14},  
        )], name=f"Epoch {epoch}")
        frames.append(frame)

    # Initialization with the first confusion matrix
    fig = go.Figure(
        data=[go.Heatmap(
            z=confusion_matrices[0], 
            text=[[f'{value}' for value in row] for row in confusion_matrices[0]], 
            colorscale='Viridis',
            hoverinfo='skip',  
            showscale=True,  
            texttemplate="%{text}",  
            textfont={"size": 14},  
        )],
        layout=go.Layout(
            autosize=False,
            width=1000,
            height=1000,
            title="Animated confusion matrix",
            xaxis_title="Predicted label",
            yaxis_title="True Label",
            # Make sure 0 is on top and 1 is bottom on y axis 
            # Make sure we read "Ham" and "Spam"
            xaxis=dict(
                tickvals=[0, 1], 
                ticktext=["Ham", "Spam"]
            ),
            yaxis=dict(
                tickvals=[0, 1],  
                ticktext=["Spam", "Ham"],  
                categoryorder="array",  
                categoryarray=[1, 0]  
            ),
            updatemenus=[{
                'type': 'buttons',
                'buttons': [{
                    'label': 'Play',
                    'method': 'animate',
                    'args': [None, {
                        'frame': {'duration': 500, 'redraw': True},
                        'fromcurrent': True,
                        'mode': 'immediate',
                    }]
                }]
            }]
        ),
        frames=frames  # assign frames to the Figure
    )

    # Add a slider
    fig.update_layout(
        sliders=[{
            'steps': [{
                'args': [[f"Epoch {epoch}"], {'frame': {'duration': 500, 'redraw': True}}],
                'label': f'Epoch {epoch}',
                'method': 'animate'
            } for epoch in epochs],
            'currentvalue': {'prefix': 'Epoch: '}
        }]
    )
    fig.show()

plot_confusion_matrices(confusion_matrices)

### <span style="color:orange"><b>Conclusion (baseline model evaluation) :</b></span>
* Overall, the baseline model performs well and have good metrics
* However, there are a few areas for improvement
    * Recall (89% => 11%) 
    * With the others model, let's try to improve the recall without loosing to much on the precision
    * Is it possible with such small dataset ?
    * Should we consider techniques to balance the classes (SMOTE...)?
    * Should we put the model in production and collect feedback and see how it goes?

# Using RNN with LSTM

* Recurrent Neural Networks
* Long Short-Term Memory
* Advantages:
    * Efficiently capture word relationships
    * Handles long, complex sequences well
* Disadvantages:
    * Much slower to train than CNN-type models (see below)
    * Risk of overlearning if data is limited (which is the case here. 5K sms)

In [None]:
model = tf.keras.Sequential([
    # slow : 12 min for 100 epochs
    # tf.keras.layers.Embedding(tokenizer.num_words + 1, 64, input_shape=[sms.shape[1],], name="embedding"),
    # tf.keras.layers.LSTM(units=64, return_sequences=True, name="lstm_1"),  
    # tf.keras.layers.LSTM(units=32, return_sequences=False, name="lstm_2"), 
    # tf.keras.layers.Dense(16, activation='relu', name="dense_1"),
    # tf.keras.layers.Dense(1, activation="sigmoid", name="dense_2")

    tf.keras.layers.Embedding(tokenizer.num_words + 1, 16, input_shape=[sequence_length], name="embedding"),
    tf.keras.layers.LSTM(units=16, return_sequences=True, name="lstm_1"),  
    tf.keras.layers.LSTM(units=8, return_sequences=False, name="lstm_2"), 
    tf.keras.layers.Dense(8, activation='relu', name="dense_1"),
    tf.keras.layers.Dense(1, activation="sigmoid", name="dense_2")

    # ! Never tried see https://medium.com/towards-data-science/spam-detection-in-emails-de0398ea3b48
    # Creating an embedding layer to vectorize
    # model.add(Embedding(max_feature, embedding_vector_length, input_length=max_len))
    # Addding Bi-directional LSTM
    # model.add(Bidirectional(tf.keras.layers.LSTM(64)))
    # Relu allows converging quickly and allows backpropagation
    # model.add(Dense(16, activation='relu'))
    # Deep Learninng models can be overfit easily, to avoid this, we add randomization using drop out
    # model.add(Dropout(0.1))
    # Adding sigmoid activation function to normalize the output
    # model.add(Dense(1, activation='sigmoid'))

])
model.summary()

In [None]:
path = Path(f"{k_Current_dir/k_AssetsDir/'lstm_arch.png'}")
tf.keras.utils.plot_model(model, path, show_shapes=True)

In [74]:
model.compile(
    optimizer="adam",
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[tf.keras.metrics.Recall(name="recall"), tf.keras.metrics.Precision(name="precision"), "accuracy"]       # name=... avoid recall_1 for example
)

In [None]:
history = model.fit(
    train_batch,
    epochs = k_epochs,
    validation_data=test_batch,
)

In [None]:
plt.plot(history.history["loss"], color="b", label="Train Loss")
plt.plot(history.history["val_loss"], color="r", label="Val Loss")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("LSTM : Loss")
plt.legend()
plt.ylim(0,1)
save_fig("lstm_loss", "png")
plt.show()

# display(history.history['loss'][-10:])
display([[round(f, 6) for f in history.history['loss'][-10:]]])
display([[round(f, 6) for f in history.history['val_loss'][-10:]]])

In [None]:
plt.plot(history.history["accuracy"], color="b", label="Train Accuracy")
plt.plot(history.history["val_accuracy"], color="r", label="Val Accuracy")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("LSTM : Accuracy")
plt.legend()
plt.ylim(0,1)
save_fig("lstm_accuracy", "png")
plt.show()

display([[round(f, 6) for f in history.history['accuracy'][-10:]]])
display([[round(f, 6) for f in history.history['val_accuracy'][-10:]]])

### <span style="color:orange"><b>Comments :</b></span>
* Let's try another model...


# Using a CNN model 

* CNNs can be used for text classification since they can capture local patterns (phrases or word combinations).
* Advantages:
    * Quick to train
    * Effective for capturing local features
* Disadvantages:
    * Can be less effective at capturing long-distance relationships in text
    * We should not have issue here since sms are short

In [None]:
model = tf.keras.Sequential([
    # tf.keras.layers.Embedding(input_dim=k_num_words, output_dim=embedding_dim, input_length=sms.shape[1]),
    # tf.keras.layers.Embedding(tokenizer.num_words + 1, 8, input_shape=[sms.shape[1],], name="embedding"),
    
    # Might be too complex
    # tf.keras.layers.Embedding(tokenizer.num_words + 1, 128, input_shape=[sms.shape[1],], name="embedding"),
    # tf.keras.layers.Conv1D(filters=64, kernel_size=5, activation='relu'),
    # tf.keras.layers.GlobalMaxPooling1D(),
    # tf.keras.layers.Dense(64, activation='relu'),
    # tf.keras.layers.Dropout(0.5),
    # tf.keras.layers.Dense(1, activation='sigmoid') 

    # tf.keras.layers.Embedding(tokenizer.num_words + 1, 16, input_shape=[sms.shape[1],], name="embedding"),
    tf.keras.layers.Embedding(tokenizer.num_words + 1, 16, input_shape=[sequence_length], name="embedding"),
    tf.keras.layers.Conv1D(filters=8, kernel_size=5, activation='relu'),
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')  
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

In [None]:
path = Path(f"{k_Current_dir/k_AssetsDir/'cnn_arch.png'}")
tf.keras.utils.plot_model(model, path, show_shapes=True)

In [80]:
model.compile(
    optimizer='adam', 
    loss=tf.keras.losses.BinaryCrossentropy(), 
    # metrics=['accuracy']),
    metrics=[tf.keras.metrics.Recall(name="recall"), tf.keras.metrics.Precision(name="precision"), "accuracy"]       # name=... avoid recall_1 for example
)


In [None]:
history = model.fit(
    train_batch,
    epochs = k_epochs,
    validation_data=test_batch,
)

## Evaluating

In [None]:
plt.plot(history.history["loss"], color="b", label="Train Loss")
plt.plot(history.history["val_loss"], color="r", label="Val Loss")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("CNN : Loss")
plt.legend()
plt.ylim(0,1)
save_fig("cnn_loss")
plt.show()

# display(history.history['loss'][-10:])
display([[round(f, 6) for f in history.history['loss'][-10:]]])
display([[round(f, 6) for f in history.history['val_loss'][-10:]]])


### <span style="color:orange"><b>Comments :</b></span>
* Note how to model has been simplyfied
* Also note the a dropout layer (randomly remove some features) help to fight overfitting
* We could try to apply L1 or L2 regularization (add a cost to the loss function) 


In [None]:
plt.plot(history.history["accuracy"], color="b", label="Train Accuracy")
plt.plot(history.history["val_accuracy"], color="r", label="Val Accuracy")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("CNN : Accuracy")
plt.legend()
plt.ylim(0,1)
save_fig("cnn_accuracy")
plt.show()

display([[round(f, 6) for f in history.history['accuracy'][-10:]]])
display([[round(f, 6) for f in history.history['val_accuracy'][-10:]]])


In [None]:
plt.plot(history.history["precision"], color="b", label="Train Precision")
plt.plot(history.history["val_precision"], color="r", label="Val Precision")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("CNN : Precision")
plt.legend()
plt.ylim(0,1)
save_fig("cnn_precision")
plt.show()

display([[round(f, 6) for f in history.history['precision'][-10:]]])
display([[round(f, 6) for f in history.history['val_precision'][-10:]]])


In [None]:
plt.plot(history.history["recall"], color="b", label="Train Recall")
plt.plot(history.history["val_recall"], color="r", label="Val Recall")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("CNN : Recall")
plt.legend()
plt.ylim(0,1)
save_fig("cnn_recall")
plt.show()

display([[round(f, 6) for f in history.history['recall'][-10:]]])
display([[round(f, 6) for f in history.history['val_recall'][-10:]]])


In [None]:
df_tmp = f1_calculus("f1", history.history["recall"], history.history["precision"])
df_val_tmp = f1_calculus("val_f1", history.history["val_recall"], history.history["val_precision"])

plt.plot(df_tmp["f1"], color="b", label="Train F1")
plt.plot(df_val_tmp["val_f1"], color="r", label="Val F1")
plt.ylabel("Values")
plt.xlabel("Epochs")
plt.title("CNN : F1")
plt.legend()
plt.ylim(0,1)
save_fig("cnn_f1", "png")
plt.show()

display(df_tmp.tail(10))
display(df_val_tmp.tail(10))


# Transfer learning : BERT

### <span style="color:orange"><b>Comments :</b></span>
* See 03-bert.ipynb

# <span style="color:red"><b>Scrap book - Please ignore</b></span>

In [48]:
# k_Current_dir/k_AssetsDir/fig_id

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = Path(f"{k_Current_dir/k_AssetsDir/fig_id}.{ext}")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [None]:
fig_id = "bob"
ext = "png"
filename = Path(f"{k_Current_dir/k_AssetsDir/fig_id}.{ext}")
print(filename)

In [None]:
_, ax = plt.subplots()  
ax.set(
    title="Confusion Matrix on Train set"
)  
ConfusionMatrixDisplay.from_estimator(
    classifier, 
    X_train, 
    Y_train, 
    ax=ax,
)  
plt.show()


In [None]:
# https://stackoverflow.com/questions/60860121/plotly-how-to-make-an-annotated-confusion-matrix-using-a-heatmap

import plotly.figure_factory as ff

z = [[0.1, 0.3, 0.5, 0.2],
     [1.0, 0.8, 0.6, 0.1],
     [0.1, 0.3, 0.6, 0.9],
     [0.6, 0.4, 0.2, 0.2]]

x = ['healthy', 'multiple diseases', 'rust', 'scab']
y =  ['healthy', 'multiple diseases', 'rust', 'scab']

# change each element of z to type string for annotations
z_text = [[str(y) for y in x] for x in z]

# set up figure 
fig = ff.create_annotated_heatmap(z, x=x, y=y, annotation_text=z_text, colorscale='Viridis')

# add title
fig.update_layout(title_text='<i><b>Confusion matrix</b></i>',
                  #xaxis = dict(title='x'),
                  #yaxis = dict(title='x')
                 )

# add custom xaxis title
fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=0.5,
                        y=-0.15,
                        showarrow=False,
                        text="Predicted value",
                        xref="paper",
                        yref="paper"))

# add custom yaxis title
fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=-0.35,
                        y=0.5,
                        showarrow=False,
                        text="Real value",
                        textangle=-90,
                        xref="paper",
                        yref="paper"))

# adjust margins to make room for yaxis title
fig.update_layout(margin=dict(t=50, l=200))

# add colorbar
fig['data'][0]['showscale'] = True
fig.show()

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'Le chat mange une souris',
    'Le chien aboie fort'
]

new_sentence = ['Le lion rugit']

# --------------------------
tokenizer_no_oov = Tokenizer(num_words=10)
tokenizer_no_oov.fit_on_texts(sentences)
print("Voc sans oov_token      :", tokenizer_no_oov.word_index)

sequence_no_oov = tokenizer_no_oov.texts_to_sequences(new_sentence)
print("Séquence sans oov_token :", sequence_no_oov)




# --------------------------
tokenizer_with_oov = Tokenizer(num_words=10, oov_token='<OOV>')
tokenizer_with_oov.fit_on_texts(sentences)
print("Voc avec oov_token      :", tokenizer_with_oov.word_index)

sequence_with_oov = tokenizer_with_oov.texts_to_sequences(new_sentence)
print("Séquence avec oov_token :", sequence_with_oov)


