# Modeling & training

## Naive Bayes

## Outline
- [Necessary packages](#necessary_packages)
- [Data Loading](#data_loading)
- [Modeling and training](#modeling_and_training)
- [Conclusion](#conclusion)
- [Save the best model](#save_the_best_model)

<div id="necessary_packages" >
    <h3>Necessary packages</h3>
</div>

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB,ComplementNB,GaussianNB,BernoulliNB,CategoricalNB
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import cross_validate,KFold
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score,make_scorer
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from joblib import dump

<div id="data_loading" >
    <h3>Data loading</h3>
</div>

In [2]:
path = os.path.join("..","..","data","clean_df.csv")
df = pd.read_csv(path, encoding="iso-8859-1")
df.fillna("",inplace=True)

In [3]:
df.columns

Index(['class', 'content', 'urls_count', 'digits_count',
       'contains_currency_symbols', 'length'],
      dtype='object')

<div>
    <h3>Modeling & training</h3>
</div>

In [4]:
X = df["content"]
y = df["class"]
X = X[y != -1]
y = y[y != -1]

In [5]:
scoring = {
    "accuracy":make_scorer(accuracy_score),
    "f1_score":make_scorer(f1_score),
    "precision":make_scorer(precision_score),
    "recall":make_scorer(recall_score)
}

In [6]:
def evaluate_cv(models,metrics,cv,X,y):

    df = []
    index = models.keys()
    columns = ["fit_time","score_time"]
    columns = columns + list(metrics.keys())
    
    for model in models.values():
        results = cross_validate(model, X, y, cv=cv,scoring=metrics)
        scores = []
        for score in results.values():
            scores.append(score.mean())
        df.append(scores)

    df = pd.DataFrame(data=df,index=index,columns=columns)
    return df

In [7]:
models = {}

In [8]:
models["mnb_cv"] = Pipeline(steps=[
    ("cv", CountVectorizer(binary=True)),
    ("estimator", MultinomialNB())
])

In [9]:
models["mnb_tfidf"] = Pipeline(steps=[
    ("cv", TfidfVectorizer(binary=True)),
    ("estimator", MultinomialNB())
])

In [10]:
models["cnb_cv"] = Pipeline(steps=[
    ("cv", CountVectorizer()),
    ("estimator", ComplementNB())
])

In [11]:
models["cnb_tfidf"] = Pipeline(steps=[
    ("cv", TfidfVectorizer()),
    ("estimator", ComplementNB())
])

In [12]:
models["bnb_cv"] = Pipeline(steps=[
    ("cv", CountVectorizer(binary=True)),
    ("estimator", BernoulliNB())
])

In [13]:
models["bnb_tfidf"] = Pipeline(steps=[
    ("cv", TfidfVectorizer(binary=True)),
    ("estimator", BernoulliNB())
])

In [14]:
evaluation_df = evaluate_cv(models,scoring,cv=KFold(shuffle=True),X=X,y=y)

In [15]:
evaluation_df

Unnamed: 0,fit_time,score_time,accuracy,f1_score,precision,recall
mnb_cv,0.101409,0.024082,0.938668,0.880051,0.98373,0.797399
mnb_tfidf,0.090092,0.025386,0.794284,0.446246,1.0,0.2924
cnb_cv,0.086805,0.024823,0.964381,0.936708,0.944505,0.930767
cnb_tfidf,0.088114,0.025045,0.910974,0.818369,0.975368,0.708708
bnb_cv,0.086715,0.025592,0.778413,0.392714,0.923687,0.253771
bnb_tfidf,0.087969,0.02578,0.777418,0.386089,0.933822,0.244209


<div id="conclusion" >
    <h3>Conclusion</h3>
</div>

- Complement naive bayes with Bag of word (count vectorizer) as a feature extraction technique gives the best results in terms of f1 score and even accuracy.
- Bernoulli naive bayes perfomes poorly on data points that belongs to the minority class.

<div id="save_the_best_model" >
    <h3>Save the best model</h3>
</div>

In [16]:
dump(value=models[evaluation_df.index[evaluation_df["f1_score"].argmax()]],filename=os.path.join("..","..","models","ssl","nb.joblib"))

['../../models/ssl/nb.joblib']