<figure style="margin-left: 20px; margin-right: 20px;">
  <img src="../../figures/logo-esi-sba.png" width="256" height="256" align="right" alt="Logo">
</figure>

# Email spam classification using semi-supervised learning techniques

*Directed by* 
- Fellah Abdnour (ab.fellah@esi-sba.dz) 
- Benyamina Yacine Lazreg (yl.benyamina@esi-sba.dz) 
- Mokadem Adel Abdelkader (aa.mokadem@esi-sba.dz) 
- Benounene Abdelrahmane (a.benounene@esi-sba.dz) 

# Notebook 3: Label Propagation & Label Spreading
This notebook explores Label Propagation and Label Spreading algorithms. These methods are particularly valuable when dealing with datasets where labeled samples are scarce but can be propagated through the graph or data structure to infer labels for unlabeled instances.

## Outline
- [Necessary packages](#necessary_packages)
- [Data Loading](#data_loading)
- [Label Spreading Models](#ls_models)
- [Label Propagation Models](#lp_models)
- [Evaluation](#eva)
- [Conclusions](#con)
- [Save the best model](#save_model)

<div id="necessary_packages" >
    <h3>Necessary packages</h3>
</div>

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_validate,KFold,train_test_split
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score,roc_auc_score,confusion_matrix,make_scorer,roc_curve,brier_score_loss
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MaxAbsScaler
from sklearn.semi_supervised import LabelSpreading, LabelPropagation
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from joblib import load,dump


import sys
import warnings

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

<div id="data_loading" >
    <h3>Data Loading</h3>
</div>

In [2]:
df = pd.read_csv('..\\..\\data\\clean_df.csv')
df.head()

In [33]:
path = os.path.join("..","..","models","ssl")
os.listdir(path)

['dtree.joblib',
 'knn.joblib',
 'lr.joblib',
 'nb.joblib',
 'sgd.joblib',
 'svm.joblib',
 'xgboost.joblib']

In [34]:
def load_models(path):

    models = {}
    
    files = os.listdir(path)
    files = list(filter(lambda x:x.endswith("joblib"), files))

    for file in files:
        key = file.split(".")[0]
        models[key] = load(filename=os.path.join(path,file))

    return models

In [16]:
X = df["content"]
y = df['class']

In [17]:
#tfidf = TfidfVectorizer()
#countvec = CountVectorizer()
#X = tfidf.fit_transform(df['content']).toarray()
#X.sum()

In [18]:
X_labled = X[y != -1]
y_labled = y[y != -1]

In [19]:
X_unlabled = X[y == -1]
y_unlabled = y[y == -1]

In [20]:
train_idx,test_idx = train_test_split(X_labled.index,test_size=0.3)

In [21]:
X_train = X.loc[~df.index.isin(test_idx)]
y_train = y.loc[~df.index.isin(test_idx)]
X_test = X.loc[df.index.isin(test_idx)]
y_test = y.loc[df.index.isin(test_idx)]

In [72]:
lp_models = {}
ls_models = {}

<div id="ls_models" >
    <h3>Label Spreading Models</h3>
</div>

In [100]:
ls_models["ls_tfid_knn"] = Pipeline(steps=[
    ("cv", TfidfVectorizer()),
    ("estimator", LabelSpreading(
        kernel='knn', n_neighbors=5
    ))
])

In [101]:
ls_models["ls_tfid_knn"].fit(X_train,y_train)

In [84]:
ls_models["ls_tfid_knn_svd"] = Pipeline(steps=[
    ("cv", TfidfVectorizer()),
    ("svd", TruncatedSVD(n_components=100)),  # Adjust n_components as needed
    ("estimator", LabelSpreading(kernel='knn', n_neighbors=5))
])

In [85]:
ls_models["ls_tfid_knn_svd"].fit(X_train,y_train)

In [102]:
ls_models["ls_tfid_rbf"] = Pipeline(steps=[
    ("cv", TfidfVectorizer()),
    ("estimator", LabelSpreading(
        kernel='rbf'
    ))
])

In [103]:
ls_models["ls_tfid_rbf"].fit(X_train,y_train)

In [94]:
ls_models["ls_tfid_rbf_svd"] = Pipeline(steps=[
    ("cv", TfidfVectorizer()),
    ("svd", TruncatedSVD(n_components=100)),  # Adjust n_components as needed
    ("estimator", LabelSpreading(kernel='rbf'))
])

In [95]:
ls_models["ls_tfid_rbf_svd"].fit(X_train,y_train)

In [104]:
ls_models["ls_cv_knn"] = Pipeline(steps=[
    ("cv", CountVectorizer()),
    ("estimator", LabelSpreading(
        kernel='knn', n_neighbors=5
    ))
])

In [105]:
ls_models["ls_cv_knn"].fit(X_train,y_train)

In [106]:
ls_models["ls_cv_knn_svd"] = Pipeline(steps=[
    ("cv", CountVectorizer()),
    ("svd", TruncatedSVD(n_components=100)),  # Adjust n_components as needed
    ("estimator", LabelSpreading(kernel='knn', n_neighbors=5))
])

In [107]:
ls_models["ls_cv_knn_svd"].fit(X_train,y_train)

In [98]:
ls_models["ls_cv_rbf"] = Pipeline(steps=[
    ("cv", CountVectorizer()),
    ("estimator", LabelSpreading(
        kernel='rbf'
    ))
])

In [99]:
ls_models["ls_cv_rbf"].fit(X_train,y_train)

In [96]:
ls_models["ls_cv_rbf_svd"] = Pipeline(steps=[
    ("cv", CountVectorizer()),
    ("svd", TruncatedSVD(n_components=100)),  # Adjust n_components as needed
    ("estimator", LabelSpreading(kernel='rbf'))
])

In [97]:
ls_models["ls_cv_rbf_svd"].fit(X_train,y_train)

<div id="lp_models" >
    <h3>Label Propagation Models</h3>
</div>

In [125]:
lp_models["lp_tfid_knn"] = Pipeline(steps=[
    ("cv", TfidfVectorizer()),
    ("estimator", LabelPropagation(
        kernel='knn', n_neighbors=5
    ))
])

In [126]:
lp_models["lp_tfid_knn"].fit(X_train,y_train)

In [127]:
lp_models["lp_tfid_knn_svd"] = Pipeline(steps=[
    ("cv", TfidfVectorizer()),
    ("svd", TruncatedSVD(n_components=100)),
    ("estimator", LabelPropagation(
        kernel='knn', n_neighbors=5
    ))
])

In [128]:
lp_models["lp_tfid_knn_svd"].fit(X_train,y_train)

In [129]:
lp_models["lp_tfid_rbf"] = Pipeline(steps=[
    ("cv", TfidfVectorizer()),
    ("estimator", LabelPropagation(
        kernel='rbf'
    ))
])

In [130]:
lp_models["lp_tfid_rbf"].fit(X_train,y_train)

In [131]:
lp_models["lp_tfid_rbf_svd"] = Pipeline(steps=[
    ("cv", TfidfVectorizer()),
    ("svd", TruncatedSVD(n_components=100)),
    ("estimator", LabelPropagation(
        kernel='rbf'
    ))
])

In [132]:
lp_models["lp_tfid_rbf_svd"].fit(X_train,y_train)

In [133]:
lp_models["lp_cv_knn"] = Pipeline(steps=[
    ("cv", CountVectorizer()),
    ("estimator", LabelPropagation(
        kernel='knn', n_neighbors=5
    ))
])

In [134]:
lp_models["lp_cv_knn"].fit(X_train,y_train)

In [135]:
lp_models["lp_cv_knn_svd"] = Pipeline(steps=[
    ("cv", CountVectorizer()),
    ("svd", TruncatedSVD(n_components=100)),
    ("estimator", LabelPropagation(
        kernel='knn', n_neighbors=5
    ))
])

In [136]:
lp_models["lp_cv_knn_svd"].fit(X_train,y_train)

In [137]:
lp_models["lp_cv_rbf"] = Pipeline(steps=[
    ("cv", CountVectorizer()),
    ("estimator", LabelPropagation(
        kernel='rbf'
    ))
])

In [138]:
lp_models["lp_cv_rbf"].fit(X_train,y_train)

In [139]:
lp_models["lp_cv_rbf_svd"] = Pipeline(steps=[
    ("cv", CountVectorizer()),
    ("svd", TruncatedSVD(n_components=100)),
    ("estimator", LabelPropagation(
        kernel='rbf'
    ))
])

In [140]:
lp_models["lp_cv_rbf_svd"].fit(X_train,y_train)

<div id="eva" >
    <h3>Evalutation</h3>
</div>

In [117]:
def validate(models,metrics,X_test,y_test):
    result = []
    i = 0
    for model in models.values():
        y_hat = model.predict(X_test)
        scores = []
        for metric in metrics:
            score = metric(y_test,y_hat)
            scores.append(score)
        result.append(scores)
    columns = [metric.__name__ for metric in metrics]
    return pd.DataFrame(data=result,columns=columns,index=models.keys())

In [141]:
lp_evaluation_df = validate(lp_models,[accuracy_score,recall_score,f1_score,precision_score],X_test,y_test)

In [145]:
lp_evaluation_df.sort_values(by='f1_score', ascending=False)

Unnamed: 0,accuracy_score,recall_score,f1_score,precision_score
lp_tfid_knn,0.917763,0.741176,0.834437,0.954545
lp_tfid_rbf,0.891447,0.611765,0.759124,1.0
lp_cv_knn_svd,0.8125,0.882353,0.724638,0.614754
lp_tfid_knn_svd,0.756579,0.964706,0.689076,0.535948
lp_cv_rbf_svd,0.819079,0.682353,0.678363,0.674419
lp_cv_knn,0.730263,0.964706,0.666667,0.509317
lp_tfid_rbf_svd,0.592105,1.0,0.578231,0.406699
lp_cv_rbf,0.743421,0.364706,0.442857,0.563636


In [149]:
ls_evaluation_df = validate(ls_models,[accuracy_score,recall_score,f1_score,precision_score],X_test,y_test)

In [150]:
ls_evaluation_df.sort_values(by='f1_score', ascending=False)

Unnamed: 0,accuracy_score,recall_score,f1_score,precision_score
ls_tfid_rbf,0.907895,0.811765,0.831325,0.851852
ls_tfid_knn,0.914474,0.752941,0.831169,0.927536
ls_cv_knn_svd,0.861842,0.847059,0.774194,0.712871
ls_tfid_knn_svd,0.805921,0.941176,0.730594,0.597015
ls_cv_rbf_svd,0.845395,0.729412,0.725146,0.72093
ls_cv_knn,0.789474,0.929412,0.711712,0.576642
ls_tfid_rbf_svd,0.657895,1.0,0.620438,0.449735
ls_cv_rbf,0.763158,0.364706,0.462687,0.632653


<div id="con" >
    <h3>Conclusion</h3>
</div>

Label Propagation using TfidTransformer and knn as a kernel gives the best result

<div id="save_model" >
    <h3>Save the best model</h3>
</div>

In [153]:
model = lp_models[evaluation_df.index[evaluation_df["f1_score"].argmax()]]
dump(value = model,filename=os.path.join("..","..","models", "lp.joblib"))

['..\\..\\models\\lp.joblib']