# NLP First Assignment - Spam Filtering

Import necessary modules. 

**If these cells fail, you won't be able to run the whole notebook. Make sure that you are able to import all the needed packages!**

In [1]:
!nvidia-smi

'nvidia-smi' is not recognized as an internal or external command,
operable program or batch file.


Make sure to match the cuda version of spacy!

In [2]:
#Update Spacy for 3.0 functionality (new features), install GPU support for better performance
!pip install -U pip setuptools wheel
!pip install --upgrade 'spacy[cuda112]'

Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 2.1/2.1 MB 125.5 kB/s eta 0:00:00
Collecting setuptools
  Downloading setuptools-67.6.1-py3-none-any.whl (1.1 MB)
     ---------------------------------------- 1.1/1.1 MB 129.6 kB/s eta 0:00:00
Collecting wheel
  Downloading wheel-0.40.0-py3-none-any.whl (64 kB)
     -------------------------------------- 64.5/64.5 kB 119.9 kB/s eta 0:00:00
Installing collected packages: wheel, setuptools, pip
  Attempting uninstall: wheel
    Found existing installation: wheel 0.38.4
    Uninstalling wheel-0.38.4:
      Successfully uninstalled wheel-0.38.4
Successfully installed pip-23.0.1 setuptools-67.6.1 wheel-0.40.0



[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: C:\Users\Zee Tech\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
ERROR: Invalid requirement: "'spacy[cuda112]'"


In [3]:
from typing import List
import pandas as pd
import spacy

import numpy as np
import sklearn

## 1. Task - Corpus visualization (2 Points)

In [4]:
# Download corpus
!wget http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip
!unzip smsspamcollection.zip

'wget' is not recognized as an internal or external command,
operable program or batch file.
'unzip' is not recognized as an internal or external command,
operable program or batch file.


- Read in the data into the `df` variable. 
- Display the first 5 rows of the DataFrame. 
- Check if there are missing values in the database.

*Hint: use pandas's `read_csv` method, the dataset does not have headers, and each column is separated by a tab.*

In [5]:
########################## Implement your solution BELOW ##########################
df=pd.read_csv("SMSSpamCollection.txt",sep="\t",names=["label","sms"])
df.head()
########################## Implement your solution ABOVE ##########################

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


- Discribe the database below in one or two sentences.
- The given database is of messages that are labeled either spam or ham. There are 5573 records and 2 columns. The 482 records are ham  and 747 are spam. The given data is severely imbalanced.

    **YOUR SOLUTION COMES HERE...**

In [6]:
df.isnull().sum()

label    0
sms      0
dtype: int64

In [7]:
df.shape

(5572, 2)

In [8]:
df["label"].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

## 2. Task - Spacy POS-tagging and tokenization (5 Points)

- Download the `en_core_web_sm` spacy English model.
- Load the model into the nlp variable.

In [9]:
########################## Implement your solution BELOW ##########################

nlp=spacy.load("en_core_web_sm")

########################## Implement your solution ABOVE ##########################
print("Pipeline:", nlp.pipe_names)



Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


### Define a function for tokenization which,
  - Takes two arguments: a list of strings - SMSs - to be tokenized and an nlp model
  - Returns the tokenized version of each email in a `Doc` object in a list

  **Be smart with how you achieve the tokenization. We only accept an optimal (fast) solution!**

In [10]:
########################## Implement your solution BELOW ##########################
def tokenize(sms, spacy_nlp):
    tokens=[]
    doc = spacy_nlp(sms)
    for token in doc:
        tokens.append(token.text)
    return tokens
########################## Implement your solution ABOVE ##########################

- Construct a "tokenized" column to our original database, with your `tokenize` function applied to the data from the appropriate column of the database.

In [11]:
df["tokenized"]=df["sms"].apply(lambda x:tokenize(x,nlp))

In [12]:
df.head()

Unnamed: 0,label,sms,tokenized
0,ham,"Go until jurong point, crazy.. Available only ...","[Go, until, jurong, point, ,, crazy, .., Avail..."
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, ..., Joking, wif, u, oni, ...]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F..."
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, ..., U, c, alrea..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, do, n't, think, he, goes, to, usf, ,,..."


### Define a function for POS tagging, which
  - Takes the tokenized document (string) - a Doc object - as the only parameter
  - Returns a list with the POS tags (string) of each token in the document

In [13]:
def pos_tag(tokens):
    pos_list= []
    ######################## Implement your solution BELOW ########################
    doc=nlp(" ".join(tokens))
    for token in doc:
        pos_list.append(token.pos_)
    ######################## Implement your solution ABOVE ########################
    return pos_list

- Construct a `pos_tagged` column to our original database, with your `pos_tag` function applied to the data from the appropriate column of our database.

In [14]:
df["pos_tagged"]=df["tokenized"].apply(lambda x:pos_tag(x))

In [15]:
df.head()

Unnamed: 0,label,sms,tokenized,pos_tagged
0,ham,"Go until jurong point, crazy.. Available only ...","[Go, until, jurong, point, ,, crazy, .., Avail...","[VERB, ADP, PROPN, NOUN, PUNCT, ADJ, PUNCT, AD..."
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, ..., Joking, wif, u, oni, ...]","[INTJ, ADJ, PUNCT, VERB, NOUN, PROPN, NOUN, PU..."
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[ADJ, NOUN, ADP, NUM, DET, ADJ, NOUN, PART, VE..."
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, ..., U, c, alrea...","[PROPN, PROPN, VERB, ADV, ADJ, NOUN, PUNCT, NO..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, do, n't, think, he, goes, to, usf, ,,...","[INTJ, PRON, AUX, PART, VERB, PRON, VERB, ADP,..."


## 3. Task - Spacy stopword filtering (3 Points)

Use Spacy to apply stopword filtering to the corpus. Remove words like "is", "a", etc. 

Run the following cell, to import `STOP_WORDS`.

In [16]:
from spacy.lang.en.stop_words import STOP_WORDS

Define the stopword filtering function which also lemmatizes the tokens and,
  - Takes the tokenized document (string) - a Doc object - as the only parameter
  - And returns a list of strings with the stop words filtered and the tokens reduced to their lemmas. 

In [17]:
def stopword_filter_and_lematize(tokens):
    filtered_lemmas= []
    ######################## Implement your solution BELOW ########################
    doc=nlp(" ".join(tokens))
    filtered_tokens = [token.text for token in doc if not token.is_stop]   #stopwords filtering
    filtered_tokens= " ".join(filtered_tokens)
    doc=nlp(filtered_tokens)
    filtered_lemmas = [token.lemma_ for token in doc]           #lemmatization
    ######################## Implement your solution ABOVE ########################
    return filtered_lemmas

- Construct a `stopword_filtered_lemmas` column to our original database, with your `stopword_filter_and_lematize` function applied to the data from the appropriate column of the database.

In [18]:
df["stopword_filtered_lemmas"]=df["tokenized"].apply(lambda x:stopword_filter_and_lematize(x))

In [19]:
df.head()

Unnamed: 0,label,sms,tokenized,pos_tagged,stopword_filtered_lemmas
0,ham,"Go until jurong point, crazy.. Available only ...","[Go, until, jurong, point, ,, crazy, .., Avail...","[VERB, ADP, PROPN, NOUN, PUNCT, ADJ, PUNCT, AD...","[jurong, point, ,, crazy, .., available, bugis..."
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, ..., Joking, wif, u, oni, ...]","[INTJ, ADJ, PUNCT, VERB, NOUN, PROPN, NOUN, PU...","[ok, lar, ..., joke, wif, u, oni, ...]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[ADJ, NOUN, ADP, NUM, DET, ADJ, NOUN, PART, VE...","[free, entry, 2, wkly, comp, win, FA, Cup, fin..."
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, ..., U, c, alrea...","[PROPN, PROPN, VERB, ADV, ADJ, NOUN, PUNCT, NO...","[U, dun, early, hor, ..., u, c, ...]"
4,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, do, n't, think, he, goes, to, usf, ,,...","[INTJ, PRON, AUX, PART, VERB, PRON, VERB, ADP,...","[Nah, think, go, usf, ,, live]"


## 4. Task - Prepare the data for classification (5 Points)

We would like to do binary classification into two categories: spam and not spam.

- Construct a new column named `label` to our database by transforming the string labels "ham" and "spam" of the appropriate column into machine understandable binary values.

  - *Hint: Use the `apply` method!*

In [20]:
########################## Implement your solution BELOW ##########################
df["label"]=df["label"].apply(lambda x:1 if x=="spam" else 0)   #replacing ham with 0 spam with 1
########################## Implement your solution ABOVE ##########################

- Split the database into train and test sets, save both into `df_train` and `df_test` respectively.

*Hint: Don't forget to shuffle the database before the split!*

In [21]:
df=df.sample(frac=1)

In [22]:
########################## Implement your solution BELOW ##########################
from sklearn.model_selection import train_test_split
df_train,df_test=train_test_split(df,test_size=0.20,random_state=23)
########################## Implement your solution ABOVE ##########################

- Construct a `TfidfVectoreizer` model, which
  - uses the `"word"` analyzer
  - its token pattern is `None`
  - **and make sure to specify the `tokenizer` and the `preprocessor` as the identity function.**
    - *Note: The reason behind this is that we already did these steps with `spacy`.*

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
def identity_tokenizer(stopword_filtered_leamms):
    return " ".join(stopword_filtered_leamms)

def identity_preprocessor(stopword_filtered_leamms):
    return " ".join(stopword_filtered_leamms)

In [25]:
########################## Implement your solution BELOW ##########################
vectorizer = TfidfVectorizer(analyzer='word', token_pattern=None,
                        tokenizer=identity_tokenizer, preprocessor=identity_preprocessor)
########################## Implement your solution ABOVE ##########################

- Use the `df_train` set's `stopword_filtered_leamms` column to fit the vectorizer and transform the data to the `x_train` variable.
  - *Hint: There is a method for exactly this.*
- Use the trained vectorizer to transform the `df_test` set's `stopword_filtered_leamms` column and save the result into the `x_test` variable.
- Create the `y_train` and `y_test` variables by copying the data from the appropriate set's `label` column to a numpy array.

In [26]:
########################## Implement your solution BELOW ##########################
x_train=vectorizer.fit_transform(df_train["stopword_filtered_lemmas"])
x_test=vectorizer.transform(df_test["stopword_filtered_lemmas"])
y_train=df_train.label
y_test=df_test.label
########################## Implement your solution ABOVE ##########################
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(4457, 112) (4457,)
(1115, 112) (1115,)


## 5. Task - ML-based Classification (5 Points)

You should compare 3 machine learning based classifications models by evaluating their performance on the corpus.

Use the `metrics` module to evaluate the performance of the classifiers.

In [27]:
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix

### 5/a. Random Forest

In [28]:
from sklearn.ensemble import RandomForestClassifier

- Train multiple `RandomForestClassifier`s. It has a hyperparameter, namely the depth of the tree. Explore how different values effect the performance by testing with 4 different depths.
  - *Hint: Make sure to initalize each RandomForest with the same random state to achieve a fair comparison.*

In [29]:
########################## Implement your solution BELOW ##########################
rnd_forest_clf_list=RandomForestClassifier()
########################## Implement your solution ABOVE ##########################

- Predict with each `RandomForestClassifier` into the `rnd_test_perds` list.
- Evaluate the results of each classifier with the results of the following functions from the `metrics` module:
  - The `accuracy_score` (save the results into the `rnd_acc_list` variable).
  - The `balanced_accuracy_score` (save the results into the `rnd_bal_acc_list` varaible).
  - The `confusion_matrix` (save the results into the `rnd_conf_mtx_list` variable).
- Print out the results.

In [30]:

# Define the values of max_depth to try
max_depths = [5, 10, 15, 20]

# Create a list to store the classifiers, accuracy scores, balanced accuracy scores, and confusion matrices
rnd_forest_clf_list = []
rnd_acc_list = []
rnd_bal_acc_list = []
rnd_conf_mtx_list = []

# Loop over the values of max_depth and fit a classifier for each value
for max_depth in max_depths:
    clf = RandomForestClassifier(max_depth=max_depth)
    clf.fit(x_train, y_train)
    rnd_forest_clf_list.append(clf)
    
    # Make predictions on the test data and evaluate the performance
    rnd_test_perds = clf.predict(x_test)
    rnd_acc_list.append(accuracy_score(y_test, rnd_test_perds))
    rnd_bal_acc_list.append(balanced_accuracy_score(y_test, rnd_test_perds))
    rnd_conf_mtx_list.append(confusion_matrix(y_test, rnd_test_perds))

# Print out the results
for i, max_depth in enumerate(max_depths):
    print(f"Results for max_depth={max_depth}:")
    print(f"Accuracy score: {rnd_acc_list[i]}")
    print(f"Balanced accuracy score: {rnd_bal_acc_list[i]}")
    print(f"Confusion matrix:\n{rnd_conf_mtx_list[i]}\n")


Results for max_depth=5:
Accuracy score: 0.9829596412556054
Balanced accuracy score: 0.9379084967320261
Confusion matrix:
[[962   0]
 [ 19 134]]

Results for max_depth=10:
Accuracy score: 0.9874439461883409
Balanced accuracy score: 0.9542483660130718
Confusion matrix:
[[962   0]
 [ 14 139]]

Results for max_depth=15:
Accuracy score: 0.989237668161435
Balanced accuracy score: 0.9635325370619489
Confusion matrix:
[[961   1]
 [ 11 142]]

Results for max_depth=20:
Accuracy score: 0.9874439461883409
Balanced accuracy score: 0.9569965893495305
Confusion matrix:
[[961   1]
 [ 13 140]]



- Discribe the results of the evaluation. What can be said about the different hyperparameters?


    **YOUR SOLUTION COMES HERE...**


### 5/b. Naive Bayes

In [31]:
from sklearn.naive_bayes import MultinomialNB

- Train a `MultinomialNB` (Naive Bayes) model on the training set.
- Predict with the trained model on the test set and save the prediction into the `nb_test_pred` variable.
- Evaluate the results of the classifier with the results of the following functions from the `metrics` module:
  - The `accuracy_score` (save the results into the `nb_acc` variable).
  - The `balanced_accuracy_score` (save the results into the `nb_bal_acc` varaible).
  - The `confusion_matrix` (save the results into the `nb_conf_mtx` variable).
- Print out the results.

In [32]:
########################## Implement your solution BELOW ##########################
nb_clf=MultinomialNB()

nb_clf.fit(x_train,y_train)
# predict values
nb_test_pred = nb_clf.predict(x_test)

# Accuracy
nb_acc = accuracy_score(y_test, nb_test_pred)
nb_bal_acc=balanced_accuracy_score(y_test, nb_test_pred)
nb_conf_mtx = confusion_matrix(y_test, nb_test_pred)

print(f"Accuracy score:{nb_acc}")
print(f"Balanced accuracy score: {nb_bal_acc}")
print(f"Confusion matrix:\n{nb_conf_mtx}")
########################## Implement your solution ABOVE ##########################

Accuracy score:0.8753363228699551
Balanced accuracy score: 0.5484998573233868
Confusion matrix:
[[961   1]
 [138  15]]


### 5/c. LinearModel - SVM

In [33]:
from sklearn.svm import LinearSVC

- Train a `LinearSVC` (Support Vector Classifier) model on the training set.
- Predict with the trained model on the test set and save the prediction into the `svm_test_pred` variable.
- Evaluate the results of the classifier with the results of the following functions from the `metrics` module:
  - The `accuracy_score` (save the results into the `svm_acc` variable).
  - The `balanced_accuracy_score` (save the results into the `svm_bal_acc` varaible).
  - The `confusion_matrix` (save the results into the `svm_conf_mtx` variable).
- Print out the results.

In [34]:
########################## Implement your solution BELOW ##########################
svm_clf=LinearSVC()


svm_clf.fit(x_train,y_train)
# predict values
svm_test_pred = nb_clf.predict(x_test)

# Accuracy
svm_acc = accuracy_score(y_test, svm_test_pred)
svm_bal_acc=balanced_accuracy_score(y_test, svm_test_pred)
svm_conf_mtx = confusion_matrix(y_test, svm_test_pred)


print(f"Accuracy score:{svm_acc}")
print(f"Balanced accuracy score: {svm_bal_acc}")
print(f"Confusion matrix:\n{svm_conf_mtx}")

########################## Implement your solution ABOVE ##########################

Accuracy score:0.8753363228699551
Balanced accuracy score: 0.5484998573233868
Confusion matrix:
[[961   1]
 [138  15]]


### 5/d Model Comparisons
- Discribe the results of the whole evaluation of the 3 different models. What can be said about their performance?


Basically, we tried three different models each of them with a different intuition to solve a problem. The Random Forest Classifier this the best model among these three. It has an accuracy score is equal to 0.9802690582959641, and a balanced accuracy score of 0.9397713199144326 with max_depth=20. The remaining two models giving the same strangely gave the same results of 88% accuracy with a 54% balanced accuracy score. There is a significant difference between the balanced accuracy score of RFC and SVM and NB. When there is a significant difference between the balanced accuracy scores of two models on the same data, it indicates that the models are performing differently in terms of their ability to correctly classify each class in a balanced way. Balanced accuracy is a metric that takes into account the distribution of the classes in the data set, and it provides a more accurate assessment of model performance when the classes are imbalanced. A balanced accuracy score of 0.5 indicates that the model is performing no better than random, while a score of 1.0 indicates perfect classification accuracy. If one model has a significantly higher balanced accuracy score than another model on the same data, it means that the former model is better at correctly classifying each class in a balanced way. This could be due to a variety of factors, such as the choice of algorithm, the quality of the features used for classification, or the tuning of hyperparameters. It is important to note that while balanced accuracy is a useful metric for assessing model performance on imbalanced datasets, it should be used in conjunction with other metrics such as precision, recall, and F1-score to get a more complete picture of the model's performance.

