<a href="https://colab.research.google.com/github/AlessandroTenani02/work-dump/blob/main/sup_learning_su_ham_spam_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modelli di supervisioned learning

Si mostrano qui le prestazioni di alcuni dei principali modelli di apprendimento supervisionato. <br>Si utilizzano le implementazioni di sklearn.
<br>Il dataset utilizzato è lo [spam detection dataset]('https://raw.githubusercontent.com/niccosala-st/Text-Processing/main/spam_text_messages.csv')

## Operazioni preliminari, pulizia del dataset e definizione dei modelli di encoding

Import del dataset e pulizia dei dati. <br> Al dataset vengono aggiunte le colonne 'Cleaned' e 'Labels'.



In [2]:
!pip install cleantext
import math
import numpy as np
import cleantext
import torch
import torch.nn as nn
import torchtext
import re
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
from tqdm import tqdm
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, cluster

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

url = 'https://raw.githubusercontent.com/niccosala-st/Text-Processing/main/spam_text_messages.csv'
df = pd.read_csv(url)

Collecting cleantext
  Downloading cleantext-1.1.4-py3-none-any.whl (4.9 kB)
Installing collected packages: cleantext
Successfully installed cleantext-1.1.4


In [3]:
def clean_text(text_list):
    messages = []

    for message in text_list:
        corpus = cleantext.clean_words(message,
                              clean_all= False,    # Execute all cleaning operations
                              extra_spaces=True,   # Remove extra white spaces
                              stemming=True,       # Stem the words
                              stopwords=True,      # Remove stop words
                              lowercase=True,      # Convert to lowercase
                              numbers=False,       # Remove all digits
                              punct=True,          # Remove all punctuations
                              stp_lang='english'   # Language
        )
        cleaned_message = ' '.join(corpus)
        cleaned_message = re.sub("([\w\.\-\_]+@[\w\.\-\_]+)", "<EMAIL>", cleaned_message)
        cleaned_message = re.sub("(\d+)", "<NUMBER>", cleaned_message)
        cleaned_message = re.sub("(http\S+)", "<URL>", cleaned_message)
        cleaned_message = re.sub("(\s){2,}", "", cleaned_message)

        messages.append(cleaned_message)


    return messages

In [4]:
df['Cleaned'] = clean_text(df['Message'])
df.head(20)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,Category,Message,Cleaned
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazi avail bugi n great world...
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entri <NUMBER> wkli comp win fa cup final...
3,ham,U dun say so early hor... U c already then say...,u dun say earli hor u c alreadi say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think goe usf live around though
5,spam,FreeMsg Hey there darling it's been 3 week's n...,freemsg hey darl <NUMBER> week word back id li...
6,ham,Even my brother is not like to speak with me. ...,even brother like speak treat like aid patent
7,ham,As per your request 'Melle Melle (Oru Minnamin...,per request mell mell oru minnaminungint nurun...
8,spam,WINNER!! As a valued network customer you have...,winner valu network custom select receivea £<N...
9,spam,Had your mobile 11 months or more? U R entitle...,mobil <NUMBER> month u r entitl updat latest c...


In [5]:
label_mapping = {'ham': 0, 'spam': 1}
df['Labels'] = df['Category'].map(label_mapping)

Divido il dataset in train e test.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df['Cleaned'], df['Labels'], test_size=0.2, random_state=100)

Definisco la funzione *evaluation()*.

In [7]:
def evaluation(predicted_labels, true_labels):
  accuracy = accuracy_score(true_labels, predicted_labels)
  precision = precision_score(true_labels, predicted_labels, average='weighted')
  recall = recall_score(true_labels, predicted_labels, average='weighted')
  f1 = f1_score(true_labels, predicted_labels, average='weighted')

  print("Accuracy:", accuracy)
  print("Precision:", precision)
  print("Recall:", recall)
  print("F1 Score:", f1)

Definisco il TF-IDF Vectorizer per convertire il dataset in una matrice con delle features (le parole).

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# processo il train set
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
input_tfidf_train = tfidf_vectorizer.fit_transform(X_train)
print(input_tfidf_train.shape)

# processo il test set
input_tfidf_test = tfidf_vectorizer.transform(X_test)
print(input_tfidf_test.shape)

(4457, 6190)
  (0, 5511)	0.4530820826162595
  (0, 5144)	0.475205450149337
  (0, 5012)	0.25412815947077416
  (0, 3635)	0.475205450149337
  (0, 3250)	0.3256571660032935
  (0, 2484)	0.4152619360157454


Definisco il modello di HugginFace per ottenere gli embeddings dal dataset.

In [None]:
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

In [None]:
embeddings_train = model.encode(X_train.tolist())
embeddings_test = model.encode(X_test.tolist())

---

## 1. SVM

### **a. Linear**

In [72]:
from sklearn.svm import SVC
svm = SVC(kernel='linear')

### 1.a.1 TF-IDF Vectorizer

In [73]:
svm.fit(input_tfidf_train, y_train)
y_pred = svm.predict(input_tfidf_test)

In [74]:
evaluation(y_pred, y_test)

Accuracy: 0.9811659192825112
Precision: 0.9809731960170872
Recall: 0.9811659192825112
F1 Score: 0.9810272529271684


### 1.a.2 *HugginFace* Embeddings

In [75]:
svm.fit(embeddings_train, y_train)
y_pred = svm.predict(embeddings_test)

In [76]:
evaluation(y_pred, y_test)

Accuracy: 0.9766816143497757
Precision: 0.976440158198864
Recall: 0.9766816143497757
F1 Score: 0.9761061488234982


### **b. RBF**

In [77]:
from sklearn.svm import SVC
svm = SVC(kernel='rbf')

### 1.b.1 TF-IDF Vectorizer

In [78]:
svm.fit(input_tfidf_train, y_train)
y_pred = svm.predict(input_tfidf_test)

In [79]:
evaluation(y_pred, y_test)

Accuracy: 0.9766816143497757
Precision: 0.9763229663443185
Recall: 0.9766816143497757
F1 Score: 0.9763309930233398


### 1.b.2 *HugginFace* Embeddings

In [80]:
svm.fit(embeddings_train, y_train)
y_pred = svm.predict(embeddings_test)

In [81]:
evaluation(y_pred, y_test)

Accuracy: 0.9838565022421525
Precision: 0.9838722686910532
Recall: 0.9838565022421525
F1 Score: 0.9835108658157986


###**c. Polynomial**

In [82]:
from sklearn.svm import SVC
svm = SVC(kernel='poly', degree=3)
# 3 perché 1 non ha senso e RBF è già in 2 dimensioni

### 1.c.1 TF-IDF Vectorizer

In [83]:
svm.fit(input_tfidf_train, y_train)
y_pred = svm.predict(input_tfidf_test)

In [84]:
evaluation(y_pred, y_test)

Accuracy: 0.9641255605381166
Precision: 0.9639193299788033
Recall: 0.9641255605381166
F1 Score: 0.9623609515037331


### 1.c.2 *HugginFace* Embeddings

In [85]:
svm.fit(embeddings_train, y_train)
y_pred = svm.predict(embeddings_test)

In [86]:
evaluation(y_pred, y_test)

Accuracy: 0.9838565022421525
Precision: 0.9839948747620156
Recall: 0.9838565022421525
F1 Score: 0.9834581030316525


### **d. Sigmoid**

In [87]:
from sklearn.svm import SVC
svm = SVC(kernel='sigmoid')

### 1.d.1 TF-IDF Vectorizer

In [88]:
svm.fit(input_tfidf_train, y_train)
y_pred = svm.predict(input_tfidf_test)

In [89]:
evaluation(y_pred, y_test)

Accuracy: 0.979372197309417
Precision: 0.9791525031727287
Recall: 0.979372197309417
F1 Score: 0.9792203246345177


### 1.d.2 *HugginFace* Embeddings

In [90]:
svm.fit(embeddings_train, y_train)
y_pred = svm.predict(embeddings_test)

In [91]:
evaluation(y_pred, y_test)

Accuracy: 0.9730941704035875
Precision: 0.9726073030108904
Recall: 0.9730941704035875
F1 Score: 0.9726045781091706


## 2. Decision Tree Classifier

In [92]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()

### 2.1 TF-IDF Vectorizer

In [93]:
dtc.fit(input_tfidf_train, y_train)
y_pred = dtc.predict(input_tfidf_test)

In [94]:
evaluation(y_pred, y_test)

Accuracy: 0.9587443946188341
Precision: 0.9601753794169909
Recall: 0.9587443946188341
F1 Score: 0.9593158389806852


### 2.2 *HugginFace* Embeddings

In [95]:
dtc.fit(embeddings_train, y_train)
y_pred = dtc.predict(embeddings_test)

In [96]:
evaluation(y_pred, y_test)

Accuracy: 0.9174887892376682
Precision: 0.9246244161413846
Recall: 0.9174887892376682
F1 Score: 0.9202792125466626


## 3. Logistic Regression

In [97]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()

### 3.1 TF-IDF Vectorizer

In [98]:
log_reg.fit(input_tfidf_train, y_train)
y_pred = log_reg.predict(input_tfidf_test)

In [99]:
evaluation(y_pred, y_test)

Accuracy: 0.9614349775784753
Precision: 0.9604001263525085
Recall: 0.9614349775784753
F1 Score: 0.9601586225264362


### 3.2 *HugginFace* Embeddings

In [100]:
log_reg.fit(embeddings_train, y_train)
y_pred = log_reg.predict(embeddings_test)

In [101]:
evaluation(y_pred, y_test)

Accuracy: 0.9748878923766816
Precision: 0.9746387478722345
Recall: 0.9748878923766816
F1 Score: 0.9741846840761177


## 4. KNN

### a. `K = 5`

In [102]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

### 4.a.1 TF-IDF Vectorizer

In [103]:
knn.fit(input_tfidf_train, y_train)
y_pred = knn.predict(input_tfidf_test)

In [104]:
evaluation(y_pred, y_test)

Accuracy: 0.9237668161434978
Precision: 0.9261900514892635
Recall: 0.9237668161434978
F1 Score: 0.9112555820914525


### 4.a.2 *HugginFace* Embeddings

In [105]:
knn.fit(embeddings_train, y_train)
y_pred = knn.predict(embeddings_test)

In [106]:
evaluation(y_pred, y_test)

Accuracy: 0.9739910313901345
Precision: 0.9744366863052607
Recall: 0.9739910313901345
F1 Score: 0.9741748247561244


### b. `K = 1`

In [107]:
from sklearn.neighbors import KNeighborsClassifier
onenn = KNeighborsClassifier(n_neighbors=1)

### 4.b.1 TF-IDF Vectorizer

In [108]:
onenn.fit(input_tfidf_train, y_train)
y_pred = onenn.predict(input_tfidf_test)

In [109]:
evaluation(y_pred, y_test)

Accuracy: 0.9623318385650225
Precision: 0.9638982373573681
Recall: 0.9623318385650225
F1 Score: 0.9595927398695062


### 4.b.2 *HugginFace* Embeddings

In [110]:
onenn.fit(embeddings_train, y_train)
y_pred = onenn.predict(embeddings_test)

In [111]:
evaluation(y_pred, y_test)

Accuracy: 0.9748878923766816
Precision: 0.9758890092937687
Recall: 0.9748878923766816
F1 Score: 0.9752357280751998


## 5. Naive Bayesian Classifier

In [112]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()

### 5.1 TF-IDF Vectorizer

In [113]:
nb.fit(input_tfidf_train.toarray(), y_train)
y_pred = nb.predict(input_tfidf_test.toarray())

In [114]:
evaluation(y_pred, y_test)

Accuracy: 0.852017937219731
Precision: 0.9081104890280469
Recall: 0.852017937219731
F1 Score: 0.8688228302292337


### 5.2 *HugginFace* Embeddings

In [115]:
nb.fit(embeddings_train, y_train)
y_pred = nb.predict(embeddings_test)

In [116]:
evaluation(y_pred, y_test)

Accuracy: 0.9291479820627803
Precision: 0.9448607019651665
Recall: 0.9291479820627803
F1 Score: 0.933695029243523


## 6. AdaBoost

In [117]:
from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier()

### 6.1 TF-IDF Vectorizer

In [118]:
adaboost.fit(input_tfidf_train, y_train)
y_pred = adaboost.predict(input_tfidf_test)

In [119]:
evaluation(y_pred, y_test)

Accuracy: 0.9551569506726457
Precision: 0.9546796004195107
Recall: 0.9551569506726457
F1 Score: 0.9548939182112518


### 6.2 *HugginFace* Embeddings

In [120]:
adaboost.fit(embeddings_train, y_train)
y_pred = adaboost.predict(embeddings_test)

In [121]:
evaluation(y_pred, y_test)

Accuracy: 0.9533632286995516
Precision: 0.9519914515083157
Recall: 0.9533632286995516
F1 Score: 0.9523647234678626


## 7. Random Forest

In [122]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier()

### 7.1 TF-IDF Vectorizer

In [123]:
random_forest.fit(input_tfidf_train, y_train)
y_pred = random_forest.predict(input_tfidf_test)

In [124]:
evaluation(y_pred, y_test)

Accuracy: 0.97847533632287
Precision: 0.9789960943150585
Recall: 0.97847533632287
F1 Score: 0.9776504682590524


### 7.2 *HugginFace* Embeddings

In [125]:
random_forest.fit(embeddings_train, y_train)
y_pred = random_forest.predict(embeddings_test)

In [126]:
evaluation(y_pred, y_test)

Accuracy: 0.9605381165919282
Precision: 0.9617830661989966
Recall: 0.9605381165919282
F1 Score: 0.957668584625197


## Considerazioni

Tutti i modelli testati mostrano un ottimo grado di precisione in tutte le metriche utilizzate per la valutazione; tra di essi spiccano **Support Vector Machine**, **Random Forest** e **K-Nearest Neighbor**. <br><br>
Si noti che, all'aumentare del grado di complessità (cioè le dimensioni) di SVM, dimostrano di funzionare meglio i modelli allenati sugli embeddings di *HugginFace*.
<br>
Per quanto riguarda KNN, invece, le performance migliorano al diminuire del numero di vicini. Anche in questo caso, i modelli allenati con gli embeddings performano leggermente meglio di quelli addestrati tramite in vectorizer di TF-IDF.
<br><br>
Le prestazioni peggiori sembrano essere offerte da **Naive Bayes**, che comunque denota risultati accettabili nei benchmarks.