<a href="https://colab.research.google.com/github/EnFiore/ai-machine-learning-modelli-e-algoritmi/blob/main/3%20-%20Naive%20Bayes/naive_bayes_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Filtering con Naive Bayes
Per questa esercitazione dovrai utilizzare l'intero dataset di sms di spam per creare un classificare di spam utilizzando un algoritmo Naive Bayes.
#### Task:
- Scarica il [dataset da Kaggle](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset) (richiede un account gratuito), puoi farlo anche utilizzando le API.
- Processa il dataset per ottenere un dataframe con la stessa struttura di quello visto nelle lezioni di pratica.
- Costruisci e valuta i tuoi modelli ottimizzando le metriche che reputi corretto ottimizzare in base al problema affrontato.
- Una volta selezionato il modello finale, seleziona 3 email spam e 3 email non spam dalla tua casella di posta e prova ad usare il modello per classificarle. (n.b va bene anche se il tuo modello non le classifica tutte correttamente, ricorda che il dataset è di sms non di email)

 ## Importiamo i moduli

In [1]:
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, ComplementNB
from sklearn.metrics import log_loss, classification_report, confusion_matrix

## Definiamo le costanti

In [None]:
KAGGLE_USERNAME = "guizard"
KAGGLE_KEY = "f2d9edf4f7b1e1aab22a18472a7c4fb1"

RANDOM_SEED = 0

## Scarichiamo il Dataset

In [None]:
os.environ['KAGGLE_USERNAME'] = "guizard"
os.environ['KAGGLE_KEY'] = "f2d9edf4f7b1e1aab22a18472a7c4fb1"

In [None]:
!kaggle datasets download -d uciml/sms-spam-collection-dataset

Downloading sms-spam-collection-dataset.zip to /content
  0% 0.00/211k [00:00<?, ?B/s]
100% 211k/211k [00:00<00:00, 79.1MB/s]


In [None]:
!unzip sms-spam-collection-dataset.zip

Archive:  sms-spam-collection-dataset.zip
  inflating: spam.csv                


## Preprocessing dei dati

In [None]:
df = pd.read_csv("spam.csv", encoding = "ISO-8859-1")
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
df = df[["v1", "v2"]]
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df = df.rename(columns={"v1":"SPAM", "v2":"MESSAGE"})
df.head()

Unnamed: 0,SPAM,MESSAGE
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
classes_encoding = {"spam":1,"ham":0}
df["SPAM"] = df["SPAM"].map(lambda x:classes_encoding[x])
df.head()

Unnamed: 0,SPAM,MESSAGE
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
X = df["MESSAGE"]
y = df["SPAM"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=RANDOM_SEED)

occorre controllare se le classi target sono bilanciate

## Bernoulli Naive Bayes

vettorializziamo il testo, contando quant volte è presente una parola

In [None]:
bow = CountVectorizer(stop_words="english", max_features=1000)
X_train_bow = bow.fit_transform(X_train.tolist())
X_test_bow = bow.transform(X_test)
X_train_bow.shape

(3900, 1000)

addestriamo il modello

In [None]:
bnb = BernoulliNB()
bnb.fit(X_train_bow, y_train)
report = classification_report(bnb.predict(X_test_bow), y_test, digits=3)
print(report)

              precision    recall  f1-score   support

           0      0.999     0.981     0.990      1459
           1      0.887     0.991     0.936       213

    accuracy                          0.983      1672
   macro avg      0.943     0.986     0.963      1672
weighted avg      0.984     0.983     0.983      1672



il test restituisce un accuracy elevata

In [None]:
confusion_matrix(y_test, bnb.predict(X_test_bow))

array([[1432,    2],
       [  27,  211]])

## Multinomial Naive Bayes

si devono categorizzare le stringhe non solo con l'oocrenza, ma con countvectorize

In [None]:
tfidf = TfidfVectorizer(stop_words="english", max_features=1000)
X_train_tfidf = tfidf.fit_transform(X_train.tolist())
X_test_tfidf = tfidf.transform(X_test)
X_train.shape

(3900,)

In [None]:
mnb = MultinomialNB()
mnb.fit(X_train_tfidf, y_train)
report = classification_report(mnb.predict(X_test_tfidf), y_test, digits=3)
print(report)

              precision    recall  f1-score   support

           0      0.996     0.977     0.986      1462
           1      0.857     0.971     0.911       210

    accuracy                          0.976      1672
   macro avg      0.926     0.974     0.948      1672
weighted avg      0.978     0.976     0.977      1672



In [None]:
confusion_matrix(y_test, mnb.predict(X_test_tfidf))

array([[1428,    6],
       [  34,  204]])

### Complement Naive Bayes

In [None]:
cnb = ComplementNB()
cnb.fit(X_train_tfidf, y_train)
report = classification_report(cnb.predict(X_test_tfidf), y_test, digits=3)
print(report)

              precision    recall  f1-score   support

           0      0.938     0.996     0.966      1350
           1      0.979     0.724     0.832       322

    accuracy                          0.944      1672
   macro avg      0.958     0.860     0.899      1672
weighted avg      0.946     0.944     0.940      1672



In [None]:
confusion_matrix(y_test, cnb.predict(X_test_tfidf))

array([[1345,   89],
       [   5,  233]])

## Proviamo il modello su email di spam

In [None]:
email = """
Unique bot for lazy traders!
Now it`s easier to make money. The newbot allows you to make from 900
dollars with minimal investment. Doubt it?
A unique opportunity to test the capabilities of our bot on a PRO
tariff for a month!
The offer expires in a week.
"""

email_bow = bow.transform([email])
bnb.predict(email_bow)

array([1])

In [None]:
email = """
Hi Giuseppe,
If you’re looking for a new way to build customer relationships, grow your email list or get consistent bookings for your business, give lead ads a try.
Lead ads allow you to collect contact information, book appointments and qualify potential customers directly on Facebook and Instagram through an Instant Form.
"""

email_bow = bow.transform([email])
bnb.predict(email_bow)

array([1])

In [None]:
email = """
We noticed a new sign-in to your Google Account on a Windows device. If this was you, you don’t need to do anything. If not, we’ll help you secure your account.
"""

email_bow = bow.transform([email])
bnb.predict(email_bow)

array([0])