# Tier 2. Module 4 - Deep Learning. Homework

## Lessons 9-10: Introduction to NLP. Word Embeddinges

The task is to create a model that can distinguish spam from legitimate messages using text data techniques. The deep learning model must classify emails into spam (unwanted messages) and "ham" (legitimate messages) based on the provided [Email Spam Detection Dataset](https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification), which contains samples of emails classified as spam or ham.

Technical task:
1. Conduct a preliminary analysis of the data.
2. Prepare the data for training the model.
3. Apply count-based methods and use pre-trained embeddings to represent the text data and build a classification model.
4. Evaluate the performance of the model and interpret the results.

### 1. Import of the required libraries

In [5]:
import pandas as pd
import numpy as np

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
import datetime
import gensim
from gensim.models import word2vec

from gensim.models import KeyedVectors #  implements word vectors
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, roc_auc_score

import matplotlib.cm as cm

import spacy

from tqdm.auto import tqdm
tqdm.pandas()

import matplotlib.pyplot as plt
import re

### 2. Data preparation

Data structure.

In [6]:
dataset_dir = "/kaggle/input/email-spam-detection-dataset-classification/"

In [3]:
!tree {dataset_dir} -L 2

[01;34m/kaggle/input/email-spam-detection-dataset-classification/[00m
`-- spam.csv

0 directories, 1 file


Data loading.

In [7]:
df = pd.read_csv(dataset_dir + "spam.csv", encoding = "ISO-8859-1") 

df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Data exploration.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [9]:
df_no_nan = df.dropna(subset=["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])
df_no_nan.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
281,ham,\Wen u miss someone,the person is definitely special for u..... B...,why to miss them,"just Keep-in-touch\"" gdeve.."""
1038,ham,"Edison has rightly said, \A fool can ask more ...",GN,GE,"GNT:-)"""
2255,ham,I just lov this line: \Hurt me with the truth,I don't mind,i wil tolerat.bcs ur my someone..... But,"Never comfort me with a lie\"" gud ni8 and swe..."
3525,ham,\HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...,HAD A COOL NYTHO,TX 4 FONIN HON,"CALL 2MWEN IM BK FRMCLOUD 9! J X\"""""
4668,ham,"When I was born, GOD said, \Oh No! Another IDI...",GOD said,"\""OH No! COMPETITION\"". Who knew","one day these two will become FREINDS FOREVER!"""


In [10]:
label_counts = df["v1"].value_counts()
label_counts

v1
ham     4825
spam     747
Name: count, dtype: int64

From the previous data analysis, we can conclude that the dataset is very unbalanced, with the number of ham messages significantly exceeding the number of spam.

In addition, columns 'v3' - 'v5' basically contain only blanks, and the available values ​​mostly belong to ham messages. Decision is to delete the last three columns, because they carry almost no information.

Also, let's rename columns 'v1' and 'v2' for better understanding of the data.

In [11]:
df_prep = df[["v1", "v2"]].rename({"v1": "label", "v2": "message"}, axis=1)

df_prep.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Also let's frop duplicates, if any.

In [12]:
df_prep = df_prep.drop_duplicates().reset_index(drop=True)
label_counts = df_prep["label"].value_counts()
label_counts

label
ham     4516
spam     653
Name: count, dtype: int64

Let's convert the target variable into 0 and 1, typical labels for binary classification.

In [13]:
df_prep["label"] = df_prep["label"].apply(lambda x: 1 if x == "spam" else 0)
df_prep.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Text preprocessing.

Contractions from [source](http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python).

In [14]:
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}

Stop-words.

In [15]:
stop_words = set(stopwords.words('english')).union({'also', 'would', 'much', 'many'})

negations = {
    'aren',
    "aren't",
    'couldn',
    "couldn't",
    'didn',
    "didn't",
    'doesn',
    "doesn't",
    'don',
    "don't",
    'hadn',
    "hadn't",
    'hasn',
    "hasn't",
    'haven',
    "haven't",
    'isn',
    "isn't",
    'mightn',
    "mightn't",
    'mustn',
    "mustn't",
    'needn',
    "needn't",
    'no',
    'nor',
    'not',
    'shan',
    "shan't",
    'shouldn',
    "shouldn't",
    'wasn',
    "wasn't",
    'weren',
    "weren't",
    'won',
    "won't",
    'wouldn',
    "wouldn't"
}

stop_words = stop_words.difference(negations)

In [16]:
stemmer = PorterStemmer() # During text normalization below, lemmatization is used instead of stemming

nlp = spacy.load("en_core_web_sm", disable = ['parser','ner'])

Text normalization.

In [17]:
def normalize_text(raw_review):
    
    # Remove html tags
    text = re.sub("<[^>]*>", " ", raw_review) # match <> and everything in between. [^>] - match everything except >
    
    # Remove emails
    text = re.sub("\\S*@\\S*[\\s]+", " ", text) # match non-whitespace characters, @ and a whitespaces in the end
    
    # remove links
    text = re.sub("https?:\\/\\/.*?[\\s]+", " ", text) # match http, s - zero or once, //, 
                                                    # any char 0-unlimited, whitespaces in the end
        
     # Convert to lower case, split into individual words
    text = text.lower().split()
    
    # Replace contractions with their full versions
    text = [contractions.get(word) if word in contractions else word 
            for word in text]
   
    # Re-splitting for the correct stop-words extraction
    text = " ".join(text).split()    
    
    # Remove stop words
    text = [word for word in text if not word in stop_words]

    text = " ".join(text)
    
    # Remove non-letters        
    text = re.sub("[^a-zA-Z' ]", "", text) # match everything except letters and '

    # Stem words. Need to define porter stemmer above
    # text = [stemmer.stem(word) for word in text.split()]

    # Lemmatize words. Need to define lemmatizer above
    doc = nlp(text)
    text = " ".join([token.lemma_ for token in doc if len(token.lemma_) > 1 ])
    
    # Remove excesive whitespaces
    text = re.sub("[\\s]+", " ", text)    
    
    # Join the words back into one string separated by space, and return the result.
    return text

In [18]:
df_prep['text_normalized'] = df_prep['message'].progress_apply(normalize_text)

df_prep.head()

  0%|          | 0/5169 [00:00<?, ?it/s]

Unnamed: 0,label,message,text_normalized
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis great wo...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts st ...
3,0,U dun say so early hor... U c already then say...,dun say early hor already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah not think go usf life around though


Test / train split.

In [19]:
train_idxs = df_prep.sample(frac=0.8, random_state=42).index
test_idxs = [idx for idx in df_prep.index if idx not in train_idxs]

In [20]:
X_train = df_prep.loc[train_idxs, 'text_normalized']
X_test = df_prep.loc[test_idxs, 'text_normalized']

y_train = df_prep.loc[train_idxs, 'label']
y_test = df_prep.loc[test_idxs, 'label']

### 3. Application of BoW and TF-IDF methods

Creating and training two different vectorizer objects which use n-grams and all features.

In [23]:
ngrams=(1,2) # using unigrams and bigrams
max_feats = None

Bag of Words vectorization.

In [24]:
vect_cv = CountVectorizer(ngram_range=ngrams, max_features=max_feats).fit(X_train)

len(vect_cv.vocabulary_)

30989

Term frequency-inverse document frequency vectorization.

In [25]:
vect_tfidf = TfidfVectorizer(ngram_range=ngrams, max_features=max_feats).fit(X_train)

len(vect_tfidf.vocabulary_)

30989

Features examples.

In [26]:
vect_cv.get_feature_names_out()[:5]

array(['aah', 'aah cuddle', 'aah speak', 'aaooooright',
       'aaooooright work'], dtype=object)

In [27]:
vect_tfidf.get_feature_names_out()[:5]

array(['aah', 'aah cuddle', 'aah speak', 'aaooooright',
       'aaooooright work'], dtype=object)

Transform the documents in the training and testing data to a document-term matrix.

In [28]:
X_train_vectorized_cv = vect_cv.transform(X_train)
X_test_vectorized_cv = vect_cv.transform(X_test)

In [29]:
X_train_vectorized_tfidf = vect_tfidf.transform(X_train)
X_test_vectorized_tfidf = vect_tfidf.transform(X_test)

### 4. Using pre-trained embeddings

A pre-trained embedding FastText will be used to accomplish this task.

Load pre-trained on WikiNews embeddings from Kaggle.

In [37]:
fasttext_path = '/kaggle/input/fasttext-wikinews/wiki-news-300d-1M.vec'

In [38]:
def load_embeddings(filepath):
    embeddings_index = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.strip().split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings_index[word] = vector
    return embeddings_index

In [39]:
fasttext_embeddings = load_embeddings(fasttext_path)

Function for converting text data into embedding vectors.

In [40]:
def text_to_embedding(text, embeddings_index, embedding_dim=300):
    words = text.split()  # Tokenize text into words
    embeddings = [embeddings_index.get(word, np.zeros(embedding_dim)) for word in words]
    if embeddings:  # If the text has valid words
        return np.mean(embeddings, axis=0)  # Average word embeddings
    else:  # If no valid words in the text
        return np.zeros(embedding_dim)

Convert train and validation datasets to embaddings.

In [42]:
def convert_dataset_to_embeddings(dataset, embeddings_index, embedding_dim=300):
    return np.array([text_to_embedding(text, embeddings_index, embedding_dim) for text in dataset])

In [45]:
X_train_embeddings = convert_dataset_to_embeddings(X_train, fasttext_embeddings)
X_test_embeddings = convert_dataset_to_embeddings(X_test, fasttext_embeddings)

In [46]:
X_train_embeddings[0]

array([-1.44466653e-01,  7.34333321e-02,  2.71666665e-02,  2.91000009e-02,
        9.23333392e-02,  8.66666716e-03,  1.01333335e-02,  6.39000013e-02,
        4.94000018e-02,  5.71333319e-02,  1.42666651e-02, -8.85666609e-02,
        4.80666645e-02,  6.28666654e-02, -3.33332755e-05, -3.49999964e-03,
        7.04333335e-02, -9.06000063e-02, -1.47666678e-01, -3.73000018e-02,
       -2.38133326e-01, -9.63666663e-02,  1.23133332e-01,  3.40666659e-02,
       -5.15666716e-02, -1.40133336e-01,  1.08333305e-02,  2.94999983e-02,
        1.90000108e-03,  4.24666665e-02,  1.29133329e-01,  8.56333300e-02,
        6.51333332e-02,  1.06666656e-02,  1.89666655e-02,  1.59966663e-01,
        1.51666701e-02, -3.09666693e-02,  7.86666665e-03,  5.83333010e-03,
       -5.61666675e-02, -6.70999959e-02, -5.96666569e-03,  1.62333325e-02,
        3.65666673e-02, -8.49666670e-02, -2.21333355e-02,  4.26666699e-02,
        5.13333315e-03, -5.96666569e-03, -4.26666951e-03, -6.01666681e-02,
       -7.34700024e-01,  

### 5. Building and training models

Model builder and trainer.

In [51]:
from sklearn.ensemble import GradientBoostingRegressor

def get_trained_model(x_train, y_train):
    model = GradientBoostingRegressor(random_state=42)
    model.fit(x_train, y_train)
    return model

Model trained on BoW data.

In [52]:
model_bow = get_trained_model(X_train_vectorized_cv, y_train)

Model trained on TF-IDF data.

In [53]:
model_tfidf = get_trained_model(X_train_vectorized_tfidf, y_train)

Model trained on FastText embeddings.

In [54]:
model_fasttext = get_trained_model(X_train_embeddings, y_train)

### 6. Model evaluation

Function for the model evaluation. In our case AUC (Area Under the Curve) metric is used, as regressor used for classification predicts probabilities of belonging to a ham o spam class insted of discrete predictions, which are suitable for the standard Accuracy metric.

In [None]:
def evaluate_model(model, x_test, y_test):
    predictions = model.predict(x_test)
    print('AUC: ', roc_auc_score(y_test, predictions))
    return predictions

Model trained on BoW data.

In [61]:
predictions_bow = evaluate_model(model_bow, X_test_vectorized_cv, y_test)

AUC:       0.9600751241972616


Model trained on TF-IDF data.

In [62]:
predictions_tfidf = evaluate_model(model_tfidf, X_test_vectorized_tfidf, y_test)

AUC:       0.9611190336381175


Model trained on FastText embeddings.

In [63]:
predictions_fasttextw = evaluate_model(model_fasttext, X_test_embeddings, y_test)

AUC:       0.9687572817343809


### 7. Analysis and interpretation of results

All three models, built on word vectorization using Bag of Words, Term frequency-inverse document frequency, and also using FastText embeddings pretrained on WikiNews, showed approximately the same results, their accuracy, calculated using AUC, was in the range of 96-97%. The model built on FastText embeddings coped with the task a little better, but with a small margin. This suggests that all three methods considered are effective and can be applied in practice.

To improve results, can try balancing the spam and ham classes, as well as using a FastText embeddings trained on a larger amount of data than WikiNews.