# Machine Learning - Exercise 3
# SMS SPAM classification

To perform the experiments on the SMSSpamCollection dataset you need to set-up your Colab such that it is able to load the desired data. To achieve this, you need to perform the following actions:

*   download the dataset available at this [link](https://drive.google.com/a/diag.uniroma1.it/file/d/17YZemn1MidhFA0-wenfVolZAwclLRUXM/view)
*   copy the dataset in a folder of your personal Drive
*   mount your Google Drive (more details will follow)
*   set the correct path for loading the dataset (more details will follow)

## Import needed libraries

In [12]:
import numpy as np
import pandas as pd
import random

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import *
from sklearn.naive_bayes import *
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

print('Libraries imported.')

Libraries imported.


## Load data

Mount Google Drive by following the instructions given at the provided link

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


To load the file set the correct path of the dataset located in your drive. Once mounted, your drive works like a Linux system, so you can check folders etc... running commands like `ls` or `cd` preceded by `%`

In [3]:
# example path of dataset copied in My Drive folder: /content/drive/My Drive/SMSSpamCollection'
filename = 'C:/Users/Gianmarco/Università-Git/MachineLearning/test/Datasets/SMSSpamCollection'
db = pd.read_csv(filename, sep='\t', header=None, names=['label', 'text'])
print('File loaded: %d samples.' %(len(db.label)))

File loaded: 5572 samples.


Show a random sample

In [4]:
id = random.randrange(0,len(db.label))
print('- ID: %d\n- Label: %s\n- Text: %s' %(id,db.label[id],db.text[id]))

- ID: 1211
- Label: ham
- Text: Guessin you ain't gonna be here before 9?


## Choose vectorizer

Compute vectorizer terms for all messages. More info:



*   [Hashing](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)
*   [Count](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
*   [Tfid](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) 



In [5]:
vectorizer_type = "count" # "hashing", "count" or "tfid"

if vectorizer_type == "hashing":
  vectorizer = HashingVectorizer(stop_words='english') # multivariate
elif vectorizer_type == "count":
  vectorizer = CountVectorizer(stop_words='english') # multinomial
elif vectorizer_type == "tfid":
  vectorizer = TfidfVectorizer(stop_words='english')

X_all = vectorizer.fit_transform(db.text)
y_all = db.label

print(F"X (Sparse Matrix) shape: {X_all.shape}")
print(F"Y (GT Labels) shape: {y_all.shape}")
 

X (Sparse Matrix) shape: (5572, 8444)
Y (GT Labels) shape: (5572,)


## Split data

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, 
          test_size=0.2, random_state=16)

print("Train: %d - Test: %d" %(X_train.shape[0],X_test.shape[0]))

#id = random.randrange(0,X_train.shape[0])
#print('%d ' %(id))
#print('%d %s %s' %(id,str(y_train[id]),str(X_train[id])))


Train: 4457 - Test: 1115


## Create and fit Model

In [7]:
model_type = "multinomial" # "bernoulli" or "multinomial"

if model_type == "bernoulli":
  model = BernoulliNB().fit(X_train, y_train)
  print('Bernoulli Model created')
elif model_type == "multinomial":
  model = MultinomialNB().fit(X_train, y_train)
  print('Multinomial Model created')

Multinomial Model created


## Evaluation

In [8]:
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[962   9]
 [ 10 134]]
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       971
        spam       0.94      0.93      0.93       144

    accuracy                           0.98      1115
   macro avg       0.96      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



## Prediction

In [9]:
smsnew1 = np.array(['Hello, did you solve ML exercise?'])
# We ask our model to transform our previous SMS into a 1x8444 sparse Matrix
# Then we predict based on the previous transformation
xnew1 = vectorizer.transform(smsnew1)
ynew1 = model.predict(xnew1) 
print('%s has been classified as:  %s' %(smsnew1,ynew1))

# Same goes here.
smsnew2 = np.array(['You won $1,000! Call now 1-800-1234567'])
xnew2 = vectorizer.transform(smsnew2)
ynew2 = model.predict(xnew2)
print('%s has been classified as:  %s' %(smsnew2,ynew2))


['Hello, did you solve ML exercise?'] has been classified as:  ['ham']
['You won $1,000! Call now 1-800-1234567'] has been classified as:  ['spam']


## Home Exercises

**Question 1**

Design and implement an evaluation procedure to assess and compare the performance of the three vectorizers and the two models proposed above.




In [18]:
def generatePredictions(model, vectorizerName):

    # Choose the appropiate Vectorizer
    if vectorizerName == "hashing":
        vectorizer = HashingVectorizer(stop_words='english') # multivariate
    elif vectorizerName == "count":
        vectorizer = CountVectorizer(stop_words='english') # multinomial
    elif vectorizerName == "tfid":
        vectorizer = TfidfVectorizer(stop_words='english')

    X_all = vectorizer.fit_transform(db.text)
    y_all = db.label

    X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, 
            test_size=0.2, random_state=random.seed(None))

    # We fit the model first, letting it find a solution.
    model.fit(X_train, y_train)
    
    # X_test contains the inputs reserved for testing purposes.
    # This was done above at cell n.9.
    y_pred = model.predict(X_test).reshape(-1, 1)
    
    acc = accuracy_score(y_pred, y_test)
    
    return acc

df = pd.DataFrame({'Model': ["Bernoulli", "Bernoulli", "Bernoulli", "Multinomial", "Multinomial", "Multinomial"], 
                    'Vectorizer': ["Hashing","Count","TFID", "Hashing","Count","TFID"],
                     'Accuracy': [0,0,0,0,0,0],
                     '': ["","","","","",""]
                    })

print("Predictions for the two models Bernoulli / Multinomial with the three vectorizers.")


### ==== BERNOULLI / Hashing - Count - TFID ==== ###
modelBernoulli = BernoulliNB()

df.iat[0, 2] = generatePredictions(modelBernoulli, "hashing")
df.iat[1, 2] = generatePredictions(modelBernoulli, "count")
df.iat[2, 2] = generatePredictions(modelBernoulli, "tfid")

### ========= Multinomial / Hashing - Count - TFID ======== ###
modelMultinomial = MultinomialNB()

# Using HashingVectorizer on a Multinomial model, results in error.
# This is why Multinomial/Hashing has the same accuracy as Multinomial/Count.
#df.iat[3, 2]  = generatePredictions(modelMultinomial, "hashing")
df.iat[4, 2]  = generatePredictions(modelMultinomial, "count")
df.iat[5, 2]  = generatePredictions(modelMultinomial, "tfid")

df = df.reset_index(drop=True)

max_value = df["Accuracy"].max()

for index, row in df.iterrows():
    modelName = row['Model']
    vectorType = row['Vectorizer']
    accuracy = row['Accuracy']

    if(max_value == accuracy):
        df.iat[index, 3] = "[BEST]"

print(F"\n{df}")

Predictions for the two models Bernoulli / Multinomial with the three vectorizers.

         Model Vectorizer  Accuracy        
0    Bernoulli    Hashing  0.860987        
1    Bernoulli      Count  0.973991        
2    Bernoulli       TFID  0.974888        
3  Multinomial    Hashing  0.000000        
4  Multinomial      Count  0.981166  [BEST]
5  Multinomial       TFID  0.970404        
