<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M08-deep-learning/AT%26T_logo_2016.svg" alt="AT&T LOGO" width="30%" />

# SPAM detector

## Company's Description 📇

AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile telephone services in the U.S. As of 2022, AT&T was ranked 13th on the Fortune 500 rankings of the largest United States corporations, with revenues of $168.8 billion! 😮

## Project 🚧

One of the main pain point that AT&T users are facing is constant exposure to SPAM messages.

AT&T has been able to manually flag spam messages for a time, but they are looking for an automated way of detecting spams to protect their users.

## Goals 🎯

Your goal is to build a spam detector, that can automatically flag spams as they come based solely on the sms' content.

## Deliverable 📬

To complete this project, your team should:

* Write a notebook that runs preprocessing and trains one or more deep learning models in order to predict the spam or ham nature of the sms
* State the achieved performance clearly

# Imports

In [3]:
import tensorflow as tf
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
dataset = pd.read_csv("/content/drive/MyDrive/Fichiers/2.Scolarité/1. Jedha_Data_Science/CERTIF_PROJECTS/ML_Engineer_Certification_Projects/06_DEEP_LEARNING_At&t/src/src_spam.csv", on_bad_lines='skip', encoding = "cp1252")

# Dataset exploring

In [6]:
dataset.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [8]:
dataset["Unnamed: 3"].unique()

array([nan, ' MK17 92H. 450Ppw 16"', ' why to miss them', 'GE',
       'U NO THECD ISV.IMPORTANT TOME 4 2MORO\\""',
       'i wil tolerat.bcs ur my someone..... But',
       ' ILLSPEAK 2 U2MORO WEN IM NOT ASLEEP...\\""',
       'whoever is the KING\\"!... Gud nyt"', ' TX 4 FONIN HON',
       ' \\"OH No! COMPETITION\\". Who knew', 'IåÕL CALL U\\""'],
      dtype=object)

### I will concatenate Unnamed: 2, 3 and 4 because it seams that it is the following of a text message conversation.

In [9]:
columns_to_concatenate = ["v2", "Unnamed: 2", "Unnamed: 3", "Unnamed: 4"]
dataset["text_msg"] = dataset[columns_to_concatenate].astype(str).fillna('').apply(lambda row: ''.join(row), axis=1)
dataset.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4,text_msg
0,ham,"Go until jurong point, crazy.. Available only ...",,,,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...,,,,Ok lar... Joking wif u oni...nannannan
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...,,,,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
dataset = dataset.drop(columns = columns_to_concatenate, axis=1)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   v1        5572 non-null   object
 1   text_msg  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [11]:
# Message len statistic
len_message = dataset.copy()
len_message["word_count"] = len_message["text_msg"].apply(lambda x : len(x.split()))
mean_len_message = len_message["word_count"].mean()
mean_len_message

15.628320172290021

## Preprocessing the dataset for local training

In [12]:
#encode labels
dataset["target"] = dataset["v1"] == "spam"
dataset.head()

Unnamed: 0,v1,text_msg,target
0,ham,"Go until jurong point, crazy.. Available only ...",False
1,ham,Ok lar... Joking wif u oni...nannannan,False
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,True
3,ham,U dun say so early hor... U c already then say...,False
4,ham,"Nah I don't think he goes to usf, he lives aro...",False


In [13]:
!python -m spacy download en_core_web_sm -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [14]:
import en_core_web_sm
nlp = en_core_web_sm.load()
from spacy.lang.en.stop_words import STOP_WORDS

In [15]:
# preprocessing text messages to be able to train the model

dataset["clean_text_msg"] = dataset["text_msg"].apply(lambda x:''.join(ch for ch in x if ch.isalnum() or ch==" "))
dataset["clean_text_msg"] = dataset["text_msg"].apply(lambda x: x.replace(" +"," ").lower().strip())
dataset["clean_text_msg"] = dataset["text_msg"].apply(lambda x: " ".join([token.lemma_ for token in nlp(x) if (token.lemma_ not in STOP_WORDS) & (token.text not in STOP_WORDS)]))
data_clean = dataset[["clean_text_msg","target"]]
print(data_clean.info())
data_clean.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   clean_text_msg  5572 non-null   object
 1   target          5572 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 49.1+ KB
None


Unnamed: 0,clean_text_msg,target
0,"jurong point , crazy .. available bugis n grea...",False
1,ok lar ... joke wif u oni ... nannannan,False
2,free entry 2 wkly comp win FA Cup final tkts 2...,True
3,u dun early hor ... u c ... nannannan,False
4,"nah I think usf , live thoughnannannan",False


In [16]:
# Tokenizing the text messages

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(data_clean["clean_text_msg"])
data_clean["text_msg_encoded"] = tokenizer.texts_to_sequences(data_clean["clean_text_msg"])
data_clean["len_msg"] = data_clean["text_msg_encoded"].apply(lambda x: len(x))
data_clean = data_clean[data_clean["len_msg"]!=0]
data_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean["text_msg_encoded"] = tokenizer.texts_to_sequences(data_clean["clean_text_msg"])


Unnamed: 0,clean_text_msg,target,text_msg_encoded,len_msg
0,"jurong point , crazy .. available bugis n grea...",False,"[4005, 313, 477, 478, 1072, 31, 53, 199, 1073,...",15
1,ok lar ... joke wif u oni ... nannannan,False,"[12, 226, 677, 314, 3, 1652, 1]",7
2,free entry 2 wkly comp win FA Cup final tkts 2...,True,"[13, 321, 4, 579, 730, 41, 1653, 959, 453, 165...",26
3,u dun early hor ... u c ... nannannan,False,"[3, 132, 181, 2675, 3, 42, 1]",7
4,"nah I think usf , live thoughnannannan",False,"[791, 2, 22, 792, 137, 1076]",6


In [17]:
# Train test split, include target
X_train, X_val, y_train, y_val = train_test_split(data_clean['text_msg_encoded'], data_clean['target'].astype("int"), test_size=0.2, random_state=42)

In [18]:
# Padding data
X_train_pad = tf.keras.preprocessing.sequence.pad_sequences(X_train, padding="post")
X_val_pad = tf.keras.preprocessing.sequence.pad_sequences(X_val, padding="post")
y_train = y_train.to_numpy()
y_val = y_val.to_numpy()
print(X_train_pad.shape)
print(X_val_pad.shape)
print(y_train.shape)
print(y_val.shape)

(4457, 79)
(1115, 79)
(4457,)
(1115,)


In [19]:
# Create TF datasets

train_dataset = tf.data.Dataset.from_tensor_slices((X_train_pad, y_train))
val_dataset = tf.data.Dataset.from_tensor_slices((X_val_pad, y_val))

train_dataset = train_dataset.shuffle(len(train_dataset)).batch(64)
val_dataset = val_dataset.batch(64)

### First test : local trained simple RNN

In [32]:
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, GRU, LSTM, Input, Dropout

vocab_size = len(tokenizer.index_word)+1 #padding
model_simple_rnn = tf.keras.Sequential([
                Input([X_train_pad.shape[1],]),
                Embedding(vocab_size, 128, name="embedding"),
                SimpleRNN(units=256, return_sequences=False),
                Dense(64, activation='relu'),
                Dense(1, activation="sigmoid", name="last")
])

model_simple_rnn.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 79, 128)           1111936   
                                                                 
 simple_rnn_7 (SimpleRNN)    (None, 256)               98560     
                                                                 
 dense_6 (Dense)             (None, 64)                16448     
                                                                 
 last (Dense)                (None, 1)                 65        
                                                                 
Total params: 1227009 (4.68 MB)
Trainable params: 1227009 (4.68 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [33]:
# Train simple RNN model

model_simple_rnn.compile(optimizer='adam',
              loss="binary_crossentropy",
              metrics=['accuracy'])

history = model_simple_rnn.fit(
    train_dataset,
    epochs=10,
    validation_data=val_dataset
    )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [34]:
#GRU model

vocab_size = len(tokenizer.index_word)+1 #padding
model_gru = tf.keras.Sequential([
                Input([X_train_pad.shape[1],]),
                Embedding(vocab_size, 128, name="embedding"),
                GRU(units=264, return_sequences=False),
                Dense(32, activation='relu'),
                Dense(1, activation="sigmoid", name="last")
])

model_gru.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 79, 128)           1111936   
                                                                 
 gru (GRU)                   (None, 264)               312048    
                                                                 
 dense_7 (Dense)             (None, 32)                8480      
                                                                 
 last (Dense)                (None, 1)                 33        
                                                                 
Total params: 1432497 (5.46 MB)
Trainable params: 1432497 (5.46 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [36]:
# Train GRU model

model_gru.compile(optimizer='adam',
              loss="binary_crossentropy",
              metrics=['accuracy'])

history = model_gru.fit(
    train_dataset,
    epochs=20,
    validation_data=val_dataset
    )

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [37]:
#LSTM model

vocab_size = len(tokenizer.index_word)+1 #padding
model_lstm = tf.keras.Sequential([
                Input([X_train_pad.shape[1],]),
                Embedding(vocab_size, 128, name="embedding"),
                LSTM(units=256, return_sequences=False),
                Dense(64, activation='relu'),
                Dense(1, activation="sigmoid", name="last")
])

model_lstm.summary()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 79, 128)           1111936   
                                                                 
 lstm (LSTM)                 (None, 256)               394240    
                                                                 
 dense_8 (Dense)             (None, 64)                16448     
                                                                 
 last (Dense)                (None, 1)                 65        
                                                                 
Total params: 1522689 (5.81 MB)
Trainable params: 1522689 (5.81 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [38]:
# Train LSTM model

model_lstm.compile(optimizer='adam',
              loss="binary_crossentropy",
              metrics=['accuracy'])

history = model_lstm.fit(
    train_dataset,
    epochs=20,
    validation_data=val_dataset
    )

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
