<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M08-deep-learning/AT%26T_logo_2016.svg" alt="AT&T LOGO" width="50%" height="50%"/>

# ATT SPAM DETECTOR
## Company's Description ðŸ“‡

AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile telephone services in the U.S. As of 2022, AT&T was ranked 13th on the Fortune 500 rankings of the largest United States corporations, with revenues of $168.8 billion! ðŸ˜®

## Project ðŸš§

One of the main pain point that AT&T users are facing is constant exposure to SPAM messages.

AT&T has been able to manually flag spam messages for a time, but they are looking for an automated way of detecting spams to protect their users.

## Goals ðŸŽ¯

Your goal is to build a spam detector, that can automatically flag spams as they come based solely on the sms' content.

# 1. Importing data

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('spam.csv', encoding='latin-1')

In [None]:
df.shape

In [None]:
df.head(10)

In [None]:
df.tail(10)

In [None]:
df.describe()

In [None]:
df.isnull().sum()



*   We can see that data have 3 columns "Unnamed" with practically only missing values : we we will delete this columns.
*   Let's rename columns v1 and v2 for better understanding: v1 = **label** , v2 = **message**



In [None]:
df = df[['v1', 'v2']].copy()

In [None]:
df.head()

In [None]:
df.rename(columns={'v1': 'label', 'v2': 'message'}, inplace=True)

In [None]:
df.head()

## Now let's check missing values

In [None]:
print(df.isnull().sum())

## Now let's convert values of the column label to numeric

*   **ham: 0**
*   **spam: 1**



In [None]:
df["label"] = df["label"].map({"ham": 0, "spam": 1})

In [None]:
df.head()

## Now we have to remove punctuation, lower case all characters.


In [None]:
import string

In [None]:
string.punctuation

In [None]:
# Lower case
df['lower_message'] = df['message'].fillna('').apply(lambda x: x.lower())

In [None]:
# Remove punctuation
df['clean_message'] = df['lower_message'].str.replace(r"[!\"#$%&()*+,-./:;<=>?@[\\\]^_`{|}~]+", " ", regex=True)

In [None]:
df.head()

## Let's check if data in columns **label** is balance or not

In [None]:
df["label"].value_counts()



*   We can see that we have an unbalances column as expected. It's normal to have more **ham** than **spam** messages
*   We will have to use techniques to balance the dataset



# 2. Preprocessing

We will use the weights technique to counterbalance the unbalanced data set to avoid adding artificial data and avoid deleting data. For this we will use **compute_class_weight** provided by **sklearn**

In [None]:
from sklearn.utils.class_weight import compute_class_weight

classes = np.array([0, 1]) # ham = 0, 1 = spam
weights = compute_class_weight(class_weight='balanced', classes=classes, y=df['label'])
class_weights = dict(zip(classes, weights))
print("class weights:", class_weights)

We have to transform text into number before we can use it for the model. For this we will use **TfidfVectorizer** provided by **sklearn** for **Natural Language Processing**(NLP). TfidfVectorizer is a feature extraction technique for converting a collection of raw text documents into a **matrix** of **TF-IDF** (Term Frequency-Inverse Document Frequency) features.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
X = vectorizer.fit_transform(df['clean_message'])
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

# 3. Deep Learning Model

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    keras.Input(shape=(X_train.shape[1],)),
    layers.Dense(128, activation="relu"),
    layers.Dropout(0.3),
    layers.Dense(64, activation="relu"),
    layers.Dropout(0.3),
    layers.Dense(1, activation="sigmoid")
])

In [None]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

In [None]:
model.summary()

In [None]:
y_train = y_train.astype('int')
y_test = y_test.astype('int')

In [None]:
type(y_train)

In [None]:
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

In [None]:
type(y_train)

In [None]:
# Model training
history = model.fit(
    X_train.toarray(), y_train,  # We convert X_train to an array because TensorFlow does not support sparse matrices
    epochs=10,
    batch_size=32,
    validation_data=(X_test.toarray(), y_test),
    class_weight=class_weights  # use of class_weights calculate before
)

Let's save the model

In [None]:
model.save('/spam_baseline.keras')

Now let's evaluate the model

In [None]:
loss, accuracy = model.evaluate(X_test.toarray(), y_test)
print(f"Test accuracy: {accuracy:.4f} - Test Loss: {loss:.4f}")

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = (model.predict(X_test.toarray()) > 0.5).astype("int32")
print(classification_report(y_test, y_pred))

In [None]:
y_pred = model.predict(X_test.toarray())
y_pred = (y_pred > 0.5).astype(int)

In [None]:
import plotly.express as px

cm1 = confusion_matrix(y_test, y_pred)
fig = px.imshow(
    cm1,
    labels=dict(x="Predict", y="Reel", color="Count"),
    x=["Ham", "Spam"],
    y=["Ham", "Spam"],
    text_auto=True,
    color_continuous_scale='Blackbody'
)

fig.update_layout(
    title="Confusion Matrix",
    xaxis_title="Prediction",
    yaxis_title="Reel"
)

fig.show()

# 4. Transfer Learning

For transfer learning, we will use a model host in Huggingface. The model that we will use is a model made by Mr Michael Shenoda for text classification. This model use RoBerta (Robustly Optimized BERT Pretraining Approach) built on BERT and modifies key hyperparameters, removing the next sentence pretraining objective and trainning with much larger mini batches and learning rates. (https://huggingface.co/docs/transformers/model_doc/roberta)

In [None]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

# Load model and tokenizer from huggingface
model_name = "mshenoda/roberta-spam"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

# Check for GPU use
device = "/GPU:0" if tf.config.list_physical_devices('GPU') else "/CPU:0"



In [None]:
import pandas as pd

df = pd.read_csv("spam.csv",  encoding='latin-1')
df = df[['v1', 'v2']].copy()
df.columns = ['label', 'message']
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

In [None]:
df.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2, random_state=42)

In [None]:
encodings = tokenizer(
    list(X_test), padding=True, truncation=True, max_length=512, return_tensors="tf"
)
test_dataset = tf.data.Dataset.from_tensor_slices((dict(encodings))).batch(16)

In [None]:
encodings

In [None]:
test_dataset

In [None]:
# model predictions
y_pred_logits = model.predict(test_dataset).logits

# Convert logits into classes (0 or 1)
y_pred = np.argmax(y_pred_logits, axis=1)
y_true = y_test.values


In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px

In [None]:
print(classification_report(y_true, y_pred))

As we can see we get good scores, we don't need to do any fine tuning here, just inference

In [None]:
cm2 = confusion_matrix(y_true, y_pred)
fig = px.imshow(
    cm2,
    labels=dict(x="Predict", y="Reel", color="Count"),
    x=["Ham", "Spam"],
    y=["Ham", "Spam"],
    text_auto=True,
    color_continuous_scale='Blackbody'
)

fig.update_layout(
    title="Confusion Matrix",
    xaxis_title="Prediction",
    yaxis_title="Reel"
)

fig.show()

Save the model & tokenizer : to save the model we will use **save_pretrained** provided from transformers library and is the recommanded way to save models from this library

In [None]:
model.save_pretrained("spam_tl")
tokenizer.save_pretrained("spam_tokenizer_tl")
print("Model & tokenizer saved !")

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt


fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Confusion matrix baseline model
sns.heatmap(cm1, annot=True, fmt='d', cmap="gnuplot", xticklabels=["Ham", "Spam"], yticklabels=["Ham", "Spam"], ax=axes[0])
axes[0].set_title("Confusion matrix - Baseline")
axes[0].set_xlabel("Predict")
axes[0].set_ylabel("Real")

# Confusion matrix transfer learning
sns.heatmap(cm2, annot=True, fmt='d', cmap="gnuplot", xticklabels=["Ham", "Spam"], yticklabels=["Ham", "Spam"], ax=axes[1])
axes[1].set_title("Confusion matrix - Transfer Learning")
axes[1].set_xlabel("Predict")
axes[1].set_ylabel("")

# Show charts
plt.tight_layout()
plt.show()


# 5. Gradio

In [None]:
!pip install gradio

In [None]:
import gradio as gr

In [None]:
import numpy as np

Load model and tokenizer

In [None]:
from transformers import TFRobertaForSequenceClassification, AutoTokenizer

model = TFRobertaForSequenceClassification.from_pretrained("spam_tl")
tokenizer = AutoTokenizer.from_pretrained("spam_tokenizer_tl")
print("Model & tokenizer loaded !")


In [None]:
def predict_spam(text):
    # Text Tokenization
    inputs = tokenizer(text, return_tensors="tf", truncation=True, padding=True, max_length=512)

    # Prediction with the model
    logits = model(inputs.data)[0]

    # Get predicted class (0 = Ham, 1 = Spam)
    pred_class = np.argmax(logits, axis=1)[0]

    result=""
    if pred_class == 1:
        result = "Spam ðŸ›‘"
    else:
        result = "Ham âœ…"

    return result

In [None]:
app = gr.Interface(
    fn=predict_spam,
    inputs=gr.Textbox(label="Enter a message "),
    outputs=gr.Label(label="Result"),
    title="Spam Detector with Transfer Learning",
    description="Enter a message to see if it is detected as 'Spam' or 'Ham'!"
)

app.launch(debug=True)
