# Machine Learning - Exercise 3
# SMS SPAM classification

*   download the dataset available at this [link](https://drive.google.com/a/diag.uniroma1.it/file/d/17YZemn1MidhFA0-wenfVolZAwclLRUXM/view)
*   copy the dataset in a folder of your personal Drive
*   mount your Google Drive (more details will follow)
*   set the correct path for loading the dataset (more details will follow)





## Import needed libraries

In [1]:
import numpy as np
import pandas as pd
import random

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import *
from sklearn.naive_bayes import *
from sklearn.metrics import confusion_matrix, classification_report

print('Libraries imported.')

Libraries imported.


## Load data



To load the file set the correct path of the dataset located in your drive. Once mounted, your drive works like a Linux system, so you can check folders etc... running commands like `ls` or `cd` preceded by `%`

In [12]:
# open the dataset in sample_data
filename = '/content/sample_data/SMSSpamCollection'
db = pd.read_csv(filename, sep='\t', header=None, names=['label', 'text'])
print('File in '+filename+' loaded: %d samples.' %(len(db.label)))

File in /content/sample_data/SMSSpamCollection loaded: 5572 samples.


Show a random sample

In [13]:
id = random.randrange(0,len(db.label))
print('ID: %d\nLabel: %s\nDescription: %s' %(id, db.label[id], db.text[id]))

ID: 1507
Label: spam
Description: Thanks for the Vote. Now sing along with the stars with Karaoke on your mobile. For a FREE link just reply with SING now.


## Choose vectorizer

Compute vectorizer terms for all messages. More info:



*   [Hashing](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)
*   [Count](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
*   [Tfid](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) 



In [56]:
vectorizer_type = "count" # "hashing", "count" or "tfid"

if vectorizer_type == "hashing":
  vectorizer = HashingVectorizer(stop_words='english') # multivariate
elif vectorizer_type == "count":
  vectorizer = CountVectorizer(stop_words='english') # multinomial
elif vectorizer_type == "tfid":
  vectorizer = TfidfVectorizer(stop_words='english')

X_all = vectorizer.fit_transform(db.text)
y_all = db.label

print(X_all.shape)
print(y_all.shape)

# HOMEWORK - EVALUATE ALL OF METHODS
# Calculate each methods
print('\n--- Homework')
print('Hashing Vectorizer')
vectorizer_1 = HashingVectorizer(stop_words='english')
X_all_1 = vectorizer_1.fit_transform(db.text)
y_all_1 = db.label
print(X_all_1.shape)
print(y_all_1.shape)
print('')

print('Count Vectorizer')
vectorizer_2 = CountVectorizer(stop_words='english')
X_all_2 = vectorizer_2.fit_transform(db.text)
y_all_2 = db.label
print(X_all_2.shape)
print(y_all_2.shape)
print('')

print('Tfidf Vectorizer')
vectorizer_3 = TfidfVectorizer(stop_words='english')
X_all_3 = vectorizer_3.fit_transform(db.text)
y_all_3 = db.label 
print(X_all_3.shape)
print(y_all_3.shape)


(5572, 8444)
(5572,)

--- Homework
Hashing Vectorizer
(5572, 1048576)
(5572,)

Count Vectorizer
(5572, 8444)
(5572,)

Tfidf Vectorizer
(5572, 8444)
(5572,)


## Split data

In [57]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, 
          test_size=0.2, random_state=16)

print("Train: %d - Test: %d" %(X_train.shape[0],X_test.shape[0]))

print('\n---Homework')
print('Split the hashing vector')
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_all_1, y_all_1, 
          test_size=0.2, random_state=16)

print("Train: %d - Test: %d\n" %(X_train_1.shape[0], X_test_1.shape[0]))

print('Split the count vector')
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_all_2, y_all_2, 
          test_size=0.2, random_state=16)

print("Train: %d - Test: %d\n" %(X_train_2.shape[0], X_test_2.shape[0]))

print('Split the tfidf vector')
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_all_3, y_all_3, 
          test_size=0.2, random_state=16)

print("Train: %d - Test: %d\n" %(X_train_3.shape[0], X_test_3.shape[0]))



Train: 4457 - Test: 1115

---Homework
Split the hashing vector
Train: 4457 - Test: 1115

Split the count vector
Train: 4457 - Test: 1115

Split the tfidf vector
Train: 4457 - Test: 1115



## Create and fit Model

In [59]:
model_type = "multinomial" # "bernoulli" or "multinomial"

if model_type == "bernoulli":
  model = BernoulliNB().fit(X_train, y_train)
  print('Bernoulli Model created')
elif model_type == "multinomial":
  model = MultinomialNB().fit(X_train, y_train)
  print('Multinomial Model created')

print('\n---Homework')
model_1_B = BernoulliNB().fit(X_train_1, y_train_1)
# For hashing vector we can't use a Multinomial model because the vector
# contains negative values.

model_2_B = BernoulliNB().fit(X_train_2, y_train_2)
model_2_M = MultinomialNB().fit(X_train_2, y_train_2)

model_3_B = BernoulliNB().fit(X_train_3, y_train_3)
model_3_M = MultinomialNB().fit(X_train_3, y_train_3)
print('Models created!')

Multinomial Model created

---Homework
Models created!


## Evaluation

In [62]:
y_pred = model.predict(X_test)
#print(confusion_matrix(y_test, y_pred))

# with zero division on
print(classification_report(y_test, y_pred, zero_division=1))


print('\n---Homework --------------------------------------')
y_pred_1_B = model_1_B.predict(X_test_1)

y_pred_2_B = model_2_B.predict(X_test_2)
y_pred_2_M = model_2_M.predict(X_test_2)

y_pred_3_B = model_3_B.predict(X_test_3)
y_pred_3_M = model_3_M.predict(X_test_3)

print('Classification for hashing vector - model Bernoulli:')
print(classification_report(y_test_1, y_pred_1_B, zero_division=1))
print('')

print('Classification for count vector - model Bernoulli:')
print(classification_report(y_test_2, y_pred_2_B, zero_division=1))
print('')
print('Classification for count vector - model Multinomial:')
print(classification_report(y_test_2, y_pred_2_M, zero_division=1))
print('')

print('Classification for tfidf vector - model Bernoulli:')
print(classification_report(y_test_3, y_pred_3_B, zero_division=1))
print('')
print('Classification for tfidf vector - model Multinomial:')
print(classification_report(y_test_3, y_pred_3_M, zero_division=1))
print('')


              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       971
        spam       0.94      0.93      0.93       144

    accuracy                           0.98      1115
   macro avg       0.96      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115


---Homework --------------------------------------
Classification for hashing vector - model Bernoulli:
              precision    recall  f1-score   support

         ham       0.87      1.00      0.93       971
        spam       1.00      0.00      0.00       144

    accuracy                           0.87      1115
   macro avg       0.94      0.50      0.47      1115
weighted avg       0.89      0.87      0.81      1115


Classification for count vector - model Bernoulli:
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       971
        spam       0.97      0.85      0.91       144

    accuracy             

## Prediction

In [50]:
smsnew1 = np.array(['Hello, what is your name?'])
xnew1 = vectorizer.transform(smsnew1)
ynew1 = model.predict(xnew1)
print('%s %s' %(smsnew1,ynew1))

smsnew2 = np.array(['Your account is blocked! Do login now'])
xnew2 = vectorizer.transform(smsnew2)
ynew2 = model.predict(xnew2)
print('%s %s' %(smsnew2,ynew2))


['Hello, what is your name?'] ['ham']
['Your account is blocked! Do login now'] ['spam']


## Home Exercises

**[X] Question 1**

`Design and implement an evaluation procedure to assess and compare the performance of the three vectorizers and the two models proposed above.`


**Done.**