# MBTI Project - Modeling (step 5)
<br>

<div class="span5 alert alert-info">
<h3>Introduction</h3>
    <p>This notebook contains the <b>Modeling</b> step which comes after the <b>Feature Engineering & Preprocessing</b> step. The main goal of this step involves selecting, training and deploying a model to make predictive insights.<p>
</div>

<div class="span5 alert alert-danger">

<h3>Disclaimer</h3>
    <p>The purpose of this notebook is to go over certain aspects of Natural Language Processing. There might be some parts of the notebook that do not have particular use for the future of this project but they are useful for learning purposes so I left them inside. I also would like to mention that some of the code here is recycled from online articles and notebooks on GitHub, I will try to mention every source as best as possible.
</div>

<a id=top><a>

<br>

### Table of Contents

- [Summarized goals](#goals)
- [Importing Libraries](#importing)
- [Review of our Dataset](#review)
- [Models Introduction](#model)
- [Parameters and Models](#parameters)
- [Stopwords](#stopwords)
- [Train Test Split](#train_test)
- [CountVectorizer and tf-id](#cv)
- [Report Function](#report)
- [Let's Start Modeling](#modeling) Every model is created with CountVectorizer, TF-IDF words, TF-IDF n_grams, TF-IDF characters
    - [MACHINE LEARNING](#ml)
        - [Multinomial Naive Bayes Models](#nb)  
        - [Logistic Regression](#lr)  
        - [Support Vector Machines](#svm)  
        - [K-Nearest Neightbors](#knn)  
        - [Random Forest](#NB)      
        - [Stocastic Gradient Descent](#sgd)
        - [Boosting](#boost)
            - [Gradient Boosting Classifier](#gbc)
            - [XGBoost](#xgb)
            - [Catboost](#cb) - Pending
            - [Adaboost](#ab) - Pending        
            - [LightGBM](#lgbm) - Pending       
    - [DEEP LEARNING](#dl) - Pending all section
        - [Shallow Neural Network](#snn)
        - [Deep Neural Network](#dnn)
        - [Transformers](#trans)

<a id=goals></a>

## Summarized Goals
***

Find the best model that classifies each post into the pair of attributes of the MBTI:
 - Introversion vs. Extraversion (I vs. E)
 - Intuition vs. Sensing (N vs. S)
 - Thinking vs. Feeling (T vs. F) --> This notebook focuses on this attribute
 - Judging vs. Perceiving (J vs. P)

<a id=importing></a>

## Imports
***

In [1]:
# data wrangiling libraries
import pandas as pd
import numpy as np

# visualization libraries
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.transforms
from matplotlib.patches import Patch
import seaborn as sns

%matplotlib inline
sns.set() #

# natural language processing libraries
import nltk
import nltk.corpus 
import textstat

# other libraries
import os
import re
import random
import string
import pickle
import itertools
from tqdm import tqdm, tqdm_pandas
tqdm.pandas(desc="Progress!")
import time
import warnings
pd.options.mode.chained_assignment = None  # default='warn'

  from pandas import Panel


In [2]:
df = pd.read_csv('../../data/mbti_nlp.csv', index_col=0)

<a id=model></a>

<br>

## Models Introduction
***

In [3]:
# Extract only the columns we need
T = df[['T','text_clean_joined']]

**Note** I will be using `Christophe Pere's` notebook as the basis for this model. All credits go to him, [here is the original notebook](https://github.com/Christophe-pere/Model-Selection/blob/master/Text_Classification_Compare_Models.ipynb) and here is his [TowardsDataScience article](https://towardsdatascience.com/model-selection-in-text-classification-ac13eedf6146)

In [4]:
import glob
import sklearn
from sklearn.utils import class_weight
from sklearn.model_selection import train_test_split
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import decomposition, ensemble
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import precision_score
from sklearn.model_selection import cross_validate
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import balanced_accuracy_score, recall_score, f1_score
from sklearn.metrics import make_scorer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier

print(sklearn.__version__)

0.23.2


In [5]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.keras.utils import to_categorical

print(tf.__version__)

2.3.1


In [6]:
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

In [7]:
import fasttext
import fasttext.util

In [8]:
# Functions to extract the true, false positive and true false negative
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]

<a id=parameters></a>

## Parameters & Models
***

In [9]:
TEXT           = "text_clean_joined"
LABEL          = "T"
NAME_SAVE_FILE = "model_selection_results_TF" # put just the name the .csv will be added at the end

# global parameters
num_gpu                = len(tf.config.experimental.list_physical_devices('GPU'))   # detect the number of gpu
CV_splits              = 5        # Number of splits for cross-validation and k-folds
save_results           = True     # if you want an output file containing all the results
lang                   = False    # test if you want to use Google API detection (you will need to "import from googletrans import Translator")
sample                 = True     # use just a sample of data
nb_sample              = 6000     # default value of rows if sample selected
save_model             = True     # concat all the data representation
root_dir               = "models/"       # Place here the path where you want your models stored or use /path/to/your/folder/

In [10]:
# These are the names how the files will be saved as 
NAME_ENCODER                  = "encoder.sav"
NAME_COUNT_VECT_MODEL         = "count_vect_model.sav"
NAME_TF_IDF_MODEL             = "TF_IDF_model.sav"
NAME_TF_IDF_NGRAM_MODEL       = "TF_IDF_ngram_model.sav"
NAME_TF_IDF_NGRAM_CHAR_MODEL  = "TF_IDF_ngram_chars_model.sav"
NAME_TOKEN_EMBEDDINGS         = "token_embeddings.sav"

In [11]:
# models 
multinomial_naive_bayes= True
logistic_regression    = True
svm_model              = True
k_nn_model             = True
sgd                    = True
random_forest          = True
gradient_boosting      = True
xgboost_classifier     = True
adaboost_classifier    = True 
catboost_classifier    = True 
lightgbm_classifier    = True 
extratrees_classifier  = True
shallow_network        = True
deep_nn                = True
rnn                    = True
lstm                   = True
cnn                    = True
gru                    = True
cnn_lstm               = True
cnn_gru                = True
bidirectional_rnn      = True
bidirectional_lstm     = True
bidirectional_gru      = True
rcnn                   = True
transformers           = False
pre_trained            = False

In [12]:
if save_model:
    # will create the folder to save all the models
    try:
        dir_name =  NAME_SAVE_FILE
        os.makedirs(os.path.join(root_dir,dir_name))
        print("The folder is created")
    except:
        print("The folder can not be created")

The folder is created


In [13]:
# Here you can put all the metrics you want (included in sklearn.metrics).
score_metrics = {'acc': accuracy_score,
               'balanced_accuracy': balanced_accuracy_score,
               'prec': precision_score,
               'recall': recall_score,
               'f1-score': f1_score,
               'tp': tp, 'tn': tn,
               'fp': fp, 'fn': fn,
               'cohens_kappa':cohen_kappa_score,
               'matthews_corrcoef':matthews_corrcoef,
               "roc_auc":roc_auc_score}

`Christophe Pere` has a set of functions to clean the text but we have already done that so I will not add them

<a id='stopwords'></a>

## Stopwords
***

In [14]:
# we will do add a remove stop words function
def remove_stop_words( x, stop_word):
        '''
        Function to remove a list of words
        @param x : (str) text 
        @param stop_word: (list) list of stopwords to delete 
        @return: (str) new string without stopwords 
        '''
        x_new = text_to_word_sequence(x)    # tokenize text 
        x_ = []
        for i in x_new:
            if i not in stop_word:
                x_.append(i)
        return " ".join(x_)

In [15]:
# MBTI types are rarely discussed in day to day converstaions, we will take them out since they would have low prediction power
types = [x.lower() for x in df['type'].unique()] 
types_plural = [x+'s' for x in types]

# some words that appear a lot but do not add value
additional_stop_words = ['ll','type','fe','ni','na','wa','ve','don','nt','nf', 'ti','se','op','ne'] 

# We put these together and include the normal stopwords from the English language
stop_words = sklearn.feature_extraction.text.ENGLISH_STOP_WORDS.union(additional_stop_words + types + types_plural)

In [16]:
T[TEXT] = T.loc[:,TEXT].progress_apply(lambda x : remove_stop_words(x, stop_words))

Progress!: 100%|██████████| 8675/8675 [00:03<00:00, 2607.81it/s]


<a id='train_test'></a>

## Train Test Split
***

To keep the code simple (for the time being), I will start by focusing on the `Thinking / Feeling` and later implement the same process for the rest.

In [17]:
df = T.copy()
X_train, X_test, y_train, y_test = train_test_split(df[TEXT], df[LABEL], test_size=0.25, random_state=42, stratify=df[LABEL])

**Personal note on stratify:** if variable y is a binary categorical variable with values 0 and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure that your random split has 25% of 0's and 75% of 1's.

In [18]:
class_weights = class_weight.compute_class_weight(class_weight='balanced',
                                                 classes=np.unique(y_train),
                                                 y=y_train)

In [19]:
print(*[f'Class weight: {round(i[0],4)}\tclass: {i[1]}' for i in zip(class_weights, np.unique(y_train))], sep='\n')

Class weight: 0.9241	class: 0
Class weight: 1.0894	class: 1


In [20]:
# Determined if the dataset is balanced or imbalanced 
ratio = np.min(df[LABEL].value_counts()) / np.max(df[LABEL].value_counts())
if ratio > 0.1:      # Ratio 1:10 -> limite blanced / imbalanced 
    balanced = True
    print(f"\nThe dataset is balanced (ratio={round(ratio, 3)})")
else:
    balanced = False
    print(f"\nThe dataset is imbalanced (ratio={round(ratio, 3)})")
    #from imblearn.over_sampling import ADASYN
    # put class for debalanced data 
    # in progress


The dataset is balanced (ratio=0.848)


<a id='cv'></a>

## CountVectorizer & TF-IDF
***

This section transforms our data into something interpretable by the machine

In [21]:
%%time
# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df[TEXT])

# transform the training and validation data using count vectorizer object
x_train_count =  count_vect.transform(X_train)
x_test_count =  count_vect.transform(X_test)

if save_model:
    # save the model to disk
    filename = NAME_COUNT_VECT_MODEL
    pickle.dump(count_vect, open(os.path.join(root_dir, dir_name,filename), 'wb'))

CPU times: user 9.44 s, sys: 259 ms, total: 9.7 s
Wall time: 10.4 s


In [22]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=10000)
tfidf_vect.fit(df[TEXT])
x_train_tfidf =  tfidf_vect.transform(X_train)
x_test_tfidf =  tfidf_vect.transform(X_test)
print("word level tf-idf done")

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=10000)
tfidf_vect_ngram.fit(df[TEXT])
x_train_tfidf_ngram =  tfidf_vect_ngram.transform(X_train)
x_test_tfidf_ngram =  tfidf_vect_ngram.transform(X_test)
print("ngram level tf-idf done")

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char',  ngram_range=(2,3), max_features=10000) 
tfidf_vect_ngram_chars.fit(df[TEXT])
x_train_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(X_train) 
x_test_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(X_test) 
print("characters level tf-idf done")

if save_model:
    # save the model tf-idf to disk
    filename = NAME_TF_IDF_MODEL
    pickle.dump(tfidf_vect, open(os.path.join(root_dir, dir_name,filename), 'wb'))

    # save the model ngram to disk
    filename = NAME_TF_IDF_NGRAM_MODEL
    pickle.dump(tfidf_vect_ngram, open(os.path.join(root_dir, dir_name,filename), 'wb'))
    
    # save the model ngram char to disk
    filename = NAME_TF_IDF_NGRAM_CHAR_MODEL
    pickle.dump(tfidf_vect_ngram_chars, open(os.path.join(root_dir, dir_name,filename), 'wb'))

word level tf-idf done
ngram level tf-idf done
characters level tf-idf done
CPU times: user 2min 17s, sys: 2.44 s, total: 2min 19s
Wall time: 2min 21s


<a id='report'></a>

## Report Function
***

The followign function will generate, for each model we create, a set of metrics that evaluate how well that model did

In [23]:
def report(clf, x, y, X_test, y_test, name='classifier', cv=5, dict_scoring=None, fit_params=None, save=save_model):
    '''
    Function create a metric report automatically with cross_validate function.
    @param clf: (model) classifier
    @param x: (list or matrix or tensor) training x data
    @param y: (list) label data 
    @param name: (string) name of the model (default classifier)
    @param cv: (int) number of fold for cross-validation (default 5)
    @param dict_scoring: (dict) dictionary of metrics and names
    @param fit_aparams: (dict) add parameters for model fitting 
    @param save: (bool) determine if the model need to be saved
    @return: (pandas.dataframe) dataframe containing all the results of the metrics 
    for each fold and the mean and std for each of them
    '''
    
    '''{'acc': accuracy_score,
               'balanced_accuracy': balanced_accuracy_score,
               'prec': precision_score,
               'recall': recall_score,
               'f1-score': f1_score,
               'tp': tp, 'tn': tn,
               'fp': fp, 'fn': fn,
               'cohens_kappa':cohen_kappa_score,
               'matthews_corrcoef':matthews_corrcoef,
               "roc_auc":roc_auc_score}'''
    
    
    if dict_scoring!=None:
        score = dict_scoring.copy() # save the original dictionary
        for i in score.keys():
            if len(set(y))>2:
                if i in ["prec", "recall", "f1-score"]:
                    score[i] = make_scorer(score[i], average = 'weighted') # make each function scorer
                elif i=="roc_auc":
                    score[i] = make_scorer(score[i], average = 'weighted', multi_class="ovo",needs_proba=True) # make each function scorer
                else:
                    score[i] = make_scorer(score[i]) # make each function scorer
                    
            else:
                score[i] = make_scorer(score[i]) # make each function scorer
            
    try:
        scores = cross_validate(clf, x, y, scoring=score,
                         cv=cv, return_train_score=False, n_jobs=-1,  fit_params=fit_params)
    except:
        scores = cross_validate(clf, x, y, scoring=score,
                         cv=cv, return_train_score=False,  fit_params=fit_params)
        
    # Train test on the overall data
    fit_start = time.time()
    _model = clf
    _model.fit(x, y)
        
    fit_end = time.time() - fit_start

    
    score_start = time.time()
    y_pred = _model.predict(X_test)#>0.5).astype(int)
    score_end = time.time() - score_start
    
    # this saves the model for reuse
    if save:
        filename= name+".sav"
        pickle.dump(_model, open(os.path.join(root_dir, dir_name,filename), 'wb'))
    
    # initialisation 
    index = []
    value = []
    index.append("Model")
    value.append(name)
    for i in scores:  # loop on each metric generate text and values
        if i == "estimator":
            continue
        for j in enumerate(scores[i]):
            index.append(i+"_cv"+str(j[0]+1))
            value.append(j[1])
        
        
        index.append(i+"_mean")
        value.append(np.mean(scores[i]))
        index.append(i+"_std")
        value.append(np.std(scores[i]))
    
     # add metrics averall dataset on the dictionary 
    
    for i in scores:    # compute metrics 
        if i == "fit_time":
            
            scores[i] = np.append(scores[i] ,fit_end)
            index.append(i.split("test_")[-1]+'_overall')
            value.append(fit_end)
            continue
        if i == "score_time":
            
            scores[i] = np.append(scores[i] ,score_end)
            index.append(i.split("test_")[-1]+'_overall')
            value.append(score_end)
            continue
              
        
        scores[i] = np.append(scores[i] ,score[i.split("test_")[-1]](_model, X_test, y_test))
        index.append(i.split("test_")[-1]+'_overall')
        value.append(scores[i][-1])
    
    return pd.DataFrame(data=value, index=index).T

<a id='modeling'></a>

## Let's Start Modeling!
***

In [24]:
# We start by creating the empty dataframe we will use to put the results of each model we create
df_results = pd.DataFrame()

<a id='ml'></a>

## Machine Learning Models

<a id='nb'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Multinomial Naïve Bayes</H5>
</div>

In [25]:
%%time
if multinomial_naive_bayes:
    df_results = df_results.append(report(naive_bayes.MultinomialNB(),x_train_count, y_train, x_test_count, y_test, name='NB_Count_Vectors', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(naive_bayes.MultinomialNB(),x_train_tfidf, y_train, x_test_tfidf, y_test, name='NB_WordLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(naive_bayes.MultinomialNB(),x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='NB_N-Gram_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(naive_bayes.MultinomialNB(),x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test, name='NB_CharLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))

CPU times: user 803 ms, sys: 272 ms, total: 1.08 s
Wall time: 5.83 s


<a id='lr'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Logistic Regression</H5>
</div>

In [26]:
%%time
if logistic_regression:
    df_results = df_results.append(report(linear_model.LogisticRegression(max_iter=1000), x_train_count, y_train, x_test_count, y_test, name='LR_Count_Vectors', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(linear_model.LogisticRegression(max_iter=1000), x_train_tfidf, y_train, x_test_tfidf, y_test, name='LR_WordLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(linear_model.LogisticRegression(max_iter=1000), x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='LR_N-Gram_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(linear_model.LogisticRegression(max_iter=1000), x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test, name='LR_CharLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))

CPU times: user 40.2 s, sys: 4.09 s, total: 44.3 s
Wall time: 56.8 s


<a id='svm'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Support Vector Machine</H5>
</div>

In [27]:
%%time
if svm_model:
    df_results = df_results.append(report(svm.SVC(), x_train_count, y_train, x_test_count, y_test, name='SVM_Count_Vectors', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(svm.SVC(), x_train_tfidf, y_train, x_test_tfidf, y_test, name='SVM_WordLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(svm.SVC(), x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='SVM_N-Gram_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(svm.SVC(), x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test, name='SVM_CharLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))

CPU times: user 1h 1min 54s, sys: 7.14 s, total: 1h 2min 1s
Wall time: 1h 28min 9s


<a id='knn'></a>

<br>
<div class="span5 alert alert-info">
    <H5>K-Nearest Neighbors</H5>
</div>

In [28]:
%%time
if k_nn_model:
    df_results = df_results.append(report(KNeighborsClassifier(n_neighbors=20, weights='distance', n_jobs=-1), x_train_count, y_train, x_test_count, y_test, name='kNN_Count_Vectors', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(KNeighborsClassifier(n_neighbors=20, weights='distance', n_jobs=-1), x_train_tfidf, y_train, x_test_tfidf, y_test, name='kNN_WordLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(KNeighborsClassifier(n_neighbors=20, weights='distance', n_jobs=-1), x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='kNN_N-Gram_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(KNeighborsClassifier(n_neighbors=20, weights='distance', n_jobs=-1), x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test,  name='kNN_CharLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))

CPU times: user 16min 40s, sys: 19.2 s, total: 16min 59s
Wall time: 5min 52s


<a id='rf'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Random Forest</H5>
</div>

In [29]:
%%time
if random_forest:
    df_results = df_results.append(report(ensemble.RandomForestClassifier(bootstrap=True,min_impurity_decrease=1e-7,n_jobs=-1, random_state=42), x_train_count, y_train, x_test_count, y_test, name='RF_Count_Vectors', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(ensemble.RandomForestClassifier(bootstrap=True,min_impurity_decrease=1e-7,n_jobs=-1, random_state=42), x_train_tfidf, y_train, x_test_tfidf, y_test, name='RF_WordLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(ensemble.RandomForestClassifier(bootstrap=True,min_impurity_decrease=1e-7,n_jobs=-1, random_state=42), x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='RF_N-Gram_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(ensemble.RandomForestClassifier(bootstrap=True,min_impurity_decrease=1e-7,n_jobs=-1, random_state=42), x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test,  name='RF_CharLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))

CPU times: user 2min 41s, sys: 1.49 s, total: 2min 43s
Wall time: 2min 49s


<a id='sgd'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Stocastis Gradient Descent</H5>
</div>

Linear classifiers (SVM, logistic regression, etc.) with SGD training. This estimator implements regularized linear models with stochastic gradient descent (SGD) learning

In [30]:
%%time
if sgd:
    df_results = df_results.append(report(SGDClassifier(loss='modified_huber', max_iter=1000, tol=1e-3,   n_iter_no_change=10, early_stopping=True, n_jobs=-1 ), x_train_count, y_train, x_test_count, y_test, name='SGD_Count_Vectors', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(SGDClassifier(loss='modified_huber', max_iter=1000, tol=1e-3,   n_iter_no_change=10, early_stopping=True, n_jobs=-1 ), x_train_tfidf, y_train, x_test_tfidf, y_test, name='SGD_WordLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(SGDClassifier(loss='modified_huber', max_iter=1000, tol=1e-3,   n_iter_no_change=10, early_stopping=True, n_jobs=-1 ), x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='SGD_N-Gram_Vectors', cv=CV_splits, dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(SGDClassifier(loss='modified_huber', max_iter=1000, tol=1e-3,   n_iter_no_change=10, early_stopping=True, n_jobs=-1 ), x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test, name='SGD_CharLevel_Vectors', cv=CV_splits, dict_scoring=score_metrics, save=save_model))

CPU times: user 1.73 s, sys: 212 ms, total: 1.94 s
Wall time: 5.54 s


<a id='boost'></a>

<br>

### Boosting

<a id='gbc'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Gradient Boosting Classifier</H5>
</div>

In [31]:
%%time
if gradient_boosting:
    df_results = df_results.append(report(ensemble.GradientBoostingClassifier(n_estimators=1000,
                                               validation_fraction=0.2,
                                               n_iter_no_change=10, tol=0.01,
                                               random_state=0, verbose=0 ), 
                                          x_train_count, y_train, x_test_count, y_test,
                                          name='GB_Count_Vectors', 
                                          cv=CV_splits, 
                                          dict_scoring=score_metrics, 
                                          save=save_model))

CPU times: user 31.3 s, sys: 220 ms, total: 31.5 s
Wall time: 1min 45s


In [32]:
%%time
if gradient_boosting:
    df_results = df_results.append(report(ensemble.GradientBoostingClassifier(n_estimators=1000,
                                               validation_fraction=0.2,
                                               n_iter_no_change=10, tol=0.01,
                                               random_state=0, verbose=0 ), 
                                          x_train_tfidf, y_train, x_test_tfidf, y_test,
                                          name='GB_WordLevel_TF-IDF', 
                                          cv=CV_splits, 
                                          dict_scoring=score_metrics, 
                                          save=save_model))

CPU times: user 39 s, sys: 170 ms, total: 39.1 s
Wall time: 2min 4s


In [33]:
%%time
if gradient_boosting:
    df_results = df_results.append(report(ensemble.GradientBoostingClassifier(n_estimators=1000,
                                               validation_fraction=0.2,
                                               n_iter_no_change=10, tol=0.01,
                                               random_state=0, verbose=0 ), 
                                          x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test,
                                          name='GB_N-Gram_TF-IDF', cv=CV_splits, 
                                          dict_scoring=score_metrics, save=save_model))

CPU times: user 5.01 s, sys: 54.3 ms, total: 5.06 s
Wall time: 17 s


In [34]:
%%time
if gradient_boosting:
    df_results = df_results.append(report(ensemble.GradientBoostingClassifier(n_estimators=1000,
                                               validation_fraction=0.2,
                                               n_iter_no_change=10, tol=0.01,
                                               random_state=0, verbose=0 ), 
                                          x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test,
                                          name='GB_CharLevel_TF-IDF', cv=CV_splits, 
                                          dict_scoring=score_metrics, save=save_model))

CPU times: user 2min 46s, sys: 848 ms, total: 2min 47s
Wall time: 9min 25s


<a id='xgb'></a>

<br>
<div class="span5 alert alert-info">
    <H5>XGBoost</H5>
</div>

In [35]:
%%time
if xgboost_classifier:
    fit_params={'early_stopping_rounds':10,'eval_set':[(x_test_count, y_test)]}
    
    if num_gpu>0:    # Config for GPU
        df_results = df_results.append(report(XGBClassifier(tree_method='gpu_hist',n_estimators=1000, subsample=0.8), x_train_count, y_train, x_test_count, y_test, name='XGB_Count_Vectors', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics, save=save_model))
    
    else:
        # run on CPU
        df_results = df_results.append(report(XGBClassifier(n_estimators=1000, subsample=0.8), x_train_count, y_train, x_test_count, y_test, name='XGB_Count_Vectors', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics, save=save_model))
    
    if save_results:
        df_results.to_csv(NAME_SAVE_FILE+".csv", sep=";", index=False)

CPU times: user 13min 36s, sys: 2.61 s, total: 13min 39s
Wall time: 4min 32s


In [36]:
%%time
if xgboost_classifier:
    fit_params={'early_stopping_rounds':10,'eval_set':[(x_test_tfidf, y_test)]}
    
    if num_gpu>0:    # Config for GPU
        df_results = df_results.append(report(XGBClassifier(tree_method='gpu_hist', n_estimators=1000, subsample=0.8), x_train_tfidf, y_train, x_test_tfidf, y_test, name='XGB_WordLevel_TF-IDF', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics, save=save_model))
    
    else:
        df_results = df_results.append(report(XGBClassifier(n_estimators=1000, subsample=0.8),x_train_tfidf, y_train, x_test_tfidf, y_test, name='XGB_WordLevel_TF-IDF', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics, save=save_model))
    
    if save_results:
        df_results.to_csv(NAME_SAVE_FILE+".csv", sep=";", index=False)

CPU times: user 11min 52s, sys: 1.23 s, total: 11min 53s
Wall time: 4min 23s


In [37]:
%%time
if xgboost_classifier:
    fit_params={'early_stopping_rounds':10, 'eval_set':[(x_test_tfidf_ngram, y_test)]}
    
    if num_gpu>0:    # Config for GPU
        df_results = df_results.append(report(XGBClassifier(tree_method='gpu_hist',n_estimators=1000, subsample=0.8), x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='XGB_N-Gram_TF-IDF', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics, save=save_model))
    
    else:
        df_results = df_results.append(report(XGBClassifier(n_estimators=1000, subsample=0.8), x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='XGB_N-Gram_TF-IDF', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics, save=save_model))
    
    if save_results:
        df_results.to_csv(NAME_SAVE_FILE+".csv", sep=";", index=False)

CPU times: user 5min 35s, sys: 779 ms, total: 5min 36s
Wall time: 1min 54s


In [38]:
%%time
if xgboost_classifier:
    fit_params={'early_stopping_rounds':10, 'eval_set':[(x_test_tfidf_ngram_chars, y_test)]}

    if num_gpu>0:    # Config for GPU
        df_results = df_results.append(report(XGBClassifier(tree_method='gpu_hist',n_estimators=1000, subsample=0.8), x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test, name='XGB_CharLevel_TF-IDF', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics, save=save_model))
    
    else:
        df_results = df_results.append(report(XGBClassifier(n_estimators=1000, subsample=0.8), x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test, name='XGB_CharLevel_TF-IDF', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics, save=save_model))
    
    if save_results:
        df_results.to_csv(NAME_SAVE_FILE+".csv", sep=";", index=False)


CPU times: user 44min 24s, sys: 5.12 s, total: 44min 29s
Wall time: 18min 54s


In [39]:
df_results

Unnamed: 0,Model,fit_time_cv1,fit_time_cv2,fit_time_cv3,fit_time_cv4,fit_time_cv5,fit_time_mean,fit_time_std,score_time_cv1,score_time_cv2,...,prec_overall,recall_overall,f1-score_overall,tp_overall,tn_overall,fp_overall,fn_overall,cohens_kappa_overall,matthews_corrcoef_overall,roc_auc_overall
0,NB_Count_Vectors,0.148857,0.164567,0.193882,0.1851,0.0552094,0.149523,0.0497097,0.139927,0.179873,...,0.79,0.714573,0.750396,711,985,189,284,0.557644,0.559858,0.776792
0,NB_WordLevel_TF-IDF,0.0427721,0.040756,0.0472,0.04,0.027494,0.0396444,0.0065706,0.0290551,0.0328178,...,0.822157,0.566834,0.671029,564,1052,122,431,0.474141,0.496038,0.731458
0,NB_N-Gram_TF-IDF,0.021641,0.0164397,0.013272,0.0159199,0.00968599,0.0153917,0.00393648,0.0325251,0.0329893,...,0.728169,0.519598,0.606452,517,981,193,478,0.363132,0.377194,0.677601
0,NB_CharLevel_TF-IDF,0.356506,0.354018,0.329918,0.334819,0.168894,0.308831,0.070735,0.078177,0.0425143,...,0.832,0.104523,0.185714,104,1153,21,891,0.0928224,0.185244,0.543318
0,LR_Count_Vectors,24.231,24.7313,24.0905,24.4714,8.17439,21.1397,6.48633,0.0406101,0.0223329,...,0.739604,0.750754,0.745137,747,911,263,248,0.526129,0.52618,0.763367
0,LR_WordLevel_TF-IDF,0.600737,0.623403,0.601045,0.430498,0.346689,0.520474,0.111198,0.0434902,0.0351679,...,0.769763,0.782915,0.776283,779,941,233,216,0.583688,0.58376,0.792224
0,LR_N-Gram_TF-IDF,0.36437,0.514269,0.379806,0.492275,0.293075,0.408759,0.0828236,0.0358307,0.0250649,...,0.698768,0.627136,0.661017,624,905,269,371,0.401139,0.402979,0.699002
0,LR_CharLevel_TF-IDF,2.98576,3.37208,4.0606,4.068,2.36705,3.3707,0.650797,0.0413959,0.0436339,...,0.755342,0.710553,0.732263,707,945,229,288,0.517834,0.51862,0.757747
0,SVM_Count_Vectors,123.176,123.122,123.839,124.329,71.9894,113.291,20.6557,38.2718,38.1412,...,0.751961,0.770854,0.76129,767,921,253,228,0.55429,0.554439,0.777676
0,SVM_WordLevel_TF-IDF,124.359,125.554,125.471,125.911,75.9015,115.439,19.7757,33.7508,33.8758,...,0.761905,0.78794,0.774704,784,929,245,211,0.577746,0.578033,0.789626


In [40]:
if save_results:
    df_results.to_csv(NAME_SAVE_FILE+".csv", sep=";", index=False)

<a id='cb'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Catboost</H5>
</div>

**PENDING**

In [None]:
%%time 
if catboost_classifier:
    # work in progress
    if num_gpu>0:  # test gpu available
        df_results = df_results.append(report(CatBoostClassifier(n_estimators=1000, early_stopping_rounds=10, task_type="GPU"), x_train_count, y_train, x_test_count, y_test, name='Catboost_Count_Vectors', cv=CV_splits,  dict_scoring=score_metrics))
        df_results = df_results.append(report(CatBoostClassifier(n_estimators=1000, early_stopping_rounds=10, task_type="GPU"), x_train_tfidf, y_train, x_test_tfidf, y_test, name='Catboost_WordLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics))
        df_results = df_results.append(report(CatBoostClassifier(n_estimators=1000, early_stopping_rounds=10, task_type="GPU"), x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='Catboost_N-Gram_TF-IDF', cv=CV_splits,  dict_scoring=score_metrics))
        df_results = df_results.append(report(CatBoostClassifier(n_estimators=1000, early_stopping_rounds=10, task_type="GPU"), x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test, name='Catboost_CharLevel_TF-IDF', cv=CV_splits,  dict_scoring=score_metrics))
    else:
        df_results = df_results.append(report(CatBoostClassifier(n_estimators=1000, early_stopping_rounds=10), x_train_count, y_train, x_test_count, y_test, name='Catboost_Count_Vectors', cv=CV_splits,  dict_scoring=score_metrics))
        df_results = df_results.append(report(CatBoostClassifier(n_estimators=1000, early_stopping_rounds=10), x_train_tfidf, y_train, x_test_tfidf, y_test, name='Catboost_WordLevel_TF-IDF', cv=CV_splits,  dict_scoring=score_metrics))
        df_results = df_results.append(report(CatBoostClassifier(n_estimators=1000, early_stopping_rounds=10), x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='Catboost_N-Gram_TF-IDF', cv=CV_splits,  dict_scoring=score_metrics))
        df_results = df_results.append(report(CatBoostClassifier(n_estimators=1000, early_stopping_rounds=10), x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test, name='Catboost_CharLevel_TF-IDF', cv=CV_splits, dict_scoring=score_metrics))

<a id='ab'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Adaboost</H5>
</div>

**PENDING**

In [None]:
%%time 
if adaboost_classifier:
    # work in progress
    df_results = df_results.append(report(AdaBoostClassifier(n_estimators=1000), x_train_count, y_train, x_test_count, y_test, name='Adaboost_Count_Vectors', cv=CV_splits,  dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(AdaBoostClassifier(n_estimators=1000), x_train_tfidf, y_train, x_test_tfidf, y_test, name='Adaboost_WordLevel_TF-IDF', cv=CV_splits,  dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(AdaBoostClassifier(n_estimators=1000), x_train_tfidf_ngram,y_train, x_test_tfidf_ngram, y_test, name='Adaboost_N-Gram_TF-IDF', cv=CV_splits,  dict_scoring=score_metrics, save=save_model))
    df_results = df_results.append(report(AdaBoostClassifier(n_estimators=1000), x_train_tfidf_ngram_chars,y_train, x_test_tfidf_ngram_chars, y_test, name='Adaboost_CharLevel_TF-IDF', cv=CV_splits,  dict_scoring=score_metrics, save=save_model))

<a id='lgbm'></a>

<br>
<div class="span5 alert alert-info">
    <H5>LightGBM</H5>
</div>

**PENDING**

In [None]:
%%time 
if lightgbm_classifier:
    
    # work in progress
    fit_params = {'early_stopping_rounds':10,'eval_set':[(x_test_count, y_test)]}
    if num_gpu>0:
        df_results = df_results.append(report(LGBMClassifier(n_estimators = 1000, device = "gpu"), xtrain_count,train_y_sw, xvalid_count, valid_y, name='LGM_Count_Vectors', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics))
    else:   
        df_results = df_results.append(report(LGBMClassifier(n_estimators = 1000), xtrain_count,train_y_sw, xvalid_count, valid_y, name='LGM_Count_Vectors', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics))
    
    
    fit_params = {'early_stopping_rounds':10,'eval_set':[(x_test_tfidf, y_test)]}
    if num_gpu>0:
        df_results = df_results.append(report(LGBMClassifier(n_estimators = 1000, device = "gpu"), xtrain_tfidf,train_y_sw, xvalid_tfidf, valid_y, name='LGM_Count_Vectors', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics))
    else:   
        df_results = df_results.append(report(LGBMClassifier(n_estimators = 1000), xtrain_tfidf,train_y_sw, xvalid_tfidf, valid_y, name='LGM_Count_Vectors', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics))
    
    
    fit_params = {'early_stopping_rounds':10,'eval_set':[(x_test_tfidf_ngram, y_test)]}
    if num_gpu>0:
        df_results = df_results.append(report(LGBMClassifier(n_estimators = 1000, device = "gpu"), xtrain_tfidf_ngram,train_y_sw, xvalid_tfidf_ngram, valid_y, name='LGM_Count_Vectors', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics))
    else:   
        df_results = df_results.append(report(LGBMClassifier(n_estimators = 1000), xtrain_tfidf_ngram,train_y_sw, xvalid_tfidf_ngram, valid_y, name='LGM_Count_Vectors', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics))
    
    
    fit_params = {'early_stopping_rounds':10,'eval_set':[(x_test_tfidf_ngram_chars, y_test)]}
    if num_gpu>0:
        df_results = df_results.append(report(LGBMClassifier(n_estimators = 1000, device = "gpu"), xtrain_tfidf_ngram_chars,train_y_sw, xvalid_tfidf_ngram_chars, valid_y, name='LGM_Count_Vectors', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics))
    else:   
        df_results = df_results.append(report(LGBMClassifier(n_estimators = 1000), xtrain_tfidf_ngram_chars,train_y_sw, xvalid_tfidf_ngram_chars, valid_y, name='LGM_Count_Vectors', cv=CV_splits, fit_params=fit_params, dict_scoring=score_metrics))

In [110]:
df_results

Unnamed: 0,Model,fit_time_cv1,fit_time_cv2,fit_time_cv3,fit_time_cv4,fit_time_cv5,fit_time_mean,fit_time_std,score_time_cv1,score_time_cv2,...,prec_overall,recall_overall,f1-score_overall,tp_overall,tn_overall,fp_overall,fn_overall,cohens_kappa_overall,matthews_corrcoef_overall,roc_auc_overall
0,NB_Count_Vectors,0.083657,0.087841,0.084295,0.0876408,0.0806589,0.0848186,0.00268425,0.0557067,0.0472348,...,0.791759,0.714573,0.751189,711,987,187,284,0.559446,0.561763,0.777644
0,NB_WordLevel_TF-IDF,0.0570619,0.0624511,0.0512819,0.0743842,0.034615,0.0559588,0.0131171,0.0463171,0.0433159,...,0.821277,0.58191,0.681176,579,1048,126,416,0.485369,0.504886,0.737292
0,NB_N-Gram_TF-IDF,0.0146081,0.031333,0.0386279,0.0167282,0.0169821,0.0236558,0.00956548,0.0413001,0.0396559,...,0.730563,0.547739,0.626077,545,973,201,450,0.383852,0.394977,0.688265
0,NB_CharLevel_TF-IDF,0.849732,0.797022,1.1039,1.13063,0.286705,0.833599,0.303969,0.125775,0.0687079,...,0.834646,0.106533,0.188948,106,1153,21,889,0.0949604,0.188135,0.544323
0,LR_Count_Vectors,18.4121,19.1657,17.922,19.1001,9.19119,16.7582,3.81129,0.0704119,0.0472348,...,0.738872,0.750754,0.744766,747,910,264,248,0.525238,0.525295,0.762941
0,LR_WordLevel_TF-IDF,0.667456,0.669091,0.61599,0.690176,0.490711,0.626685,0.0722426,0.0417788,0.0403888,...,0.771372,0.779899,0.775612,776,944,230,219,0.583496,0.583527,0.791994
0,LR_N-Gram_TF-IDF,0.495503,0.429021,0.355603,0.409944,0.34902,0.407818,0.0535282,0.0507712,0.0475352,...,0.701226,0.632161,0.664905,629,906,268,366,0.406937,0.408658,0.701941
0,LR_CharLevel_TF-IDF,4.52604,5.30917,5.46984,5.11398,3.58113,4.80003,0.68807,0.07074,0.0679901,...,0.753205,0.708543,0.730192,705,943,231,290,0.514104,0.514884,0.75589
0,SVM_Count_Vectors,142.24,143.032,143.314,144.206,113.892,137.337,11.7393,45.6131,45.6117,...,0.750984,0.766834,0.758826,763,921,253,232,0.550446,0.55055,0.775666
0,SVM_WordLevel_TF-IDF,171.507,164.749,171.223,171.68,118.665,159.565,20.6152,51.6323,49.8943,...,0.760194,0.786935,0.773333,783,927,247,212,0.575001,0.575303,0.788271


<a id='dl'></a>

## Deep Learning Models

In [48]:
pretrained = fasttext.FastText.load_model('/Users/diego/Documents/NLP/crawl-300d-2M-subword/crawl-300d-2M-subword.bin')



In [49]:
%%time 
# create a tokenizer 
token = Tokenizer(oov_token='<OOV>')
token.fit_on_texts(df[TEXT])
word_index = token.word_index

# convert text to sequence of tokens and pad them to ensure equal length vectors 
train_seq_x = sequence.pad_sequences(token.texts_to_sequences(X_train), maxlen=300)
test_seq_x = sequence.pad_sequences(token.texts_to_sequences(X_test), maxlen=300)

# create token-embedding mapping
embedding_matrix = np.zeros((len(word_index) + 1, 300))
words = []

for word, i in tqdm(word_index.items()):
    embedding_vector = pretrained.get_word_vector(word) #embeddings_index.get(word)
    words.append(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

if save_model:
    filename = NAME_TOKEN_EMBEDDINGS
    pickle.dump(token, open(os.path.join(root_dir, dir_name,filename), 'wb'))

100%|██████████| 88180/88180 [00:04<00:00, 18509.68it/s]


CPU times: user 9.54 s, sys: 1.81 s, total: 11.4 s
Wall time: 13.5 s


In [113]:
class_w = {}
for i in zip(range(len(class_weights)), class_weights):
    class_w[i[0]] = i[1]

In [114]:
from tensorflow.keras import backend as K

In [163]:
def cross_validate_NN(model, X, y, X_test, y_test,name="NN", fit_params=None, scoring=None, n_splits=5, save=save_model, batch_size = 32,  use_multiprocessing=True):
    '''
    Function create a metric report automatically with cross_validate function.
    @param model: (model) neural network model
    @param X: (list or matrix or tensor) training X data
    @param y: (list) label data 
    @param X_test: (list or matrix or tensor) testing X data
    @param y_test: (list) label test data 
    @param name: (string) name of the model (default classifier)
    @param fit_aparams: (dict) add parameters for model fitting 
    @param scoring: (dict) dictionary of metrics and names
    @param n_splits: (int) number of fold for cross-validation (default 5)
    @return: (pandas.dataframe) dataframe containing all the results of the metrics 
    for each fold and the mean and std for each of them
    '''
    # ---- Parameters initialisation
    es = tf.keras.callbacks.EarlyStopping(monitor='loss', mode='auto', patience=3)
    seed = 42
    k = 1
    np.random.seed(seed)
    kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
    
    # Creation of list for each metric
    if scoring==None:        # create a dictionary if none is passed
        dic_scoring = {}
    if scoring!=None:        # save the dict 
        dic_score = scoring.copy()
    
    dic_score["fit_time"] = None   # initialisation for time fitting and scoring
    dic_score["score_time"] = None
    scorer = {}
    for i in dic_score.keys(): 
        scorer[i] = []
    
    index = ["Model"]
    results = [name]
    # ---- Loop on k-fold for cross-valisation
    for train, test in kfold.split(X, y):   # training NN on each fold 
        # create model
        print(f"k-fold : {k}")
        fit_start = time.time()
        _model = tf.keras.models.clone_model(model)
        if len(np.unique(y))==2: # binary
            _model.compile(optimizer='adam',
                  loss=tf.losses.BinaryCrossentropy(from_logits=True),
                  metrics=['accuracy'])
        else:  # multiclass 
            _model.compile(optimizer='adam',
                  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
        _model.fit(X[train], y.iloc[train],
                        epochs=1000, callbacks=[es], validation_data=(X[test], y.iloc[test]),
                         verbose=False, batch_size = batch_size,  use_multiprocessing=use_multiprocessing)
        
        fit_end = time.time() - fit_start

        score_start = time.time()
        y_pred = (_model.predict(X[test])>0.5).astype(int)
        score_end = time.time() - score_start
        #if len(set(y))>2:
        #    y_pred =np.argmax(y_pred,axis=1)
        #print(y_test[0], y_pred[0])
        if len(set(y))==2:
            print(f"Precision: {round(100*precision_score(y.iloc[test], y_pred), 3)}% , Recall: {round(100*recall_score(y.iloc[test], y_pred), 3)}%, Time \t {round(fit_end, 4)} ms")
        else: 
            print(f"Precision: {round(100*precision_score(y.iloc[test], np.argmax(y_pred,axis=1), average='weighted'), 3)}% , Recall: \
        {round(100*recall_score(y.iloc[test], np.argmax(y_pred,axis=1), average='weighted'), 3)}%, Time \t {round(fit_end, 4)} ms")
        
        
        # ---- save each metric
        for i in dic_score.keys():    # compute metrics 
            if i == "fit_time":
                scorer[i].append(fit_end)
                index.append(i+'_cv'+str(k))
                results.append(fit_end)
                continue
            if i == "score_time":
                scorer[i].append(score_end)
                index.append(i+'_cv'+str(k))
                results.append(score_end)
                continue
            
            if len(set(y))>2:
                if i in ["prec", "recall", "f1-score"]:
                    scorer[i].append(dic_score[i](y.iloc[test], np.argmax(y_pred,axis=1), average = 'weighted')) # make each function scorer

                elif i=="roc_auc":
                    scorer[i].append(dic_score[i](to_categorical(y.iloc[test]), y_pred, average = 'macro', multi_class="ovo")) # make each function scorer
                else:
                    scorer[i].append(dic_score[i]( y.iloc[test], np.argmax(y_pred,axis=1))) # make each function scorer

            else:
                scorer[i].append(dic_score[i]( y.iloc[test], y_pred)) # make each function scorer
            #scorer[i].append(dic_score[i]( y.iloc[test], y_pred))
            index.append("test_"+i+'_cv'+str(k))
            results.append(scorer[i][-1])
        K.clear_session()
        del _model
        k+=1
    
    # Train test on the overall data
    print("Overall train-test data")
    fit_start = time.time()
    _model =  tf.keras.models.clone_model(model)
    if len(np.unique(y))==2: # binary
        _model.compile(optimizer='adam',
                  loss=tf.losses.BinaryCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    else:  # multiclass 
        _model.compile(optimizer='adam',
                  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
        _model.fit(X[train], y.iloc[train],
                        epochs=1000, callbacks=[es], validation_data=(X[test], y.iloc[test]),
                         verbose=False)
    if save:
        check_p = tf.keras.callbacks.ModelCheckpoint(os.path.join(root_dir, dir_name, name+".h5"), save_best_only=True)
        _model.fit(X, y,epochs=1000, callbacks=[es, check_p], validation_split=0.2, batch_size = batch_size, 
                   verbose=False, use_multiprocessing=use_multiprocessing)
        
    else:
        _model.fit(X, y,epochs=1000, callbacks=[es],  validation_split=0.2, batch_size = batch_size, 
                   verbose=False, use_multiprocessing=use_multiprocessing)
        
    fit_end = time.time() - fit_start

    #_acc = _model.evaluate(X_test, y_test, verbose=0)

    score_start = time.time()
    y_pred = (_model.predict(X_test)>0.5).astype(int)
    score_end = time.time() - score_start
    #if len(set(y))>2:
    #    y_pred =np.argmax(y_pred,axis=1)
    if len(set(y))==2:
        print(f"Precision: {round(100*precision_score(y_test, y_pred), 3)}% , Recall: {round(100*recall_score(y_test, y_pred), 3)}%, Time \t {round(fit_end, 4)} ms")
    else: 
        print(f"Precision: {round(100*precision_score(y_test, np.argmax(y_pred,axis=1), average='weighted'), 3)}% , Recall: \
        {round(100*recall_score(y_test, np.argmax(y_pred,axis=1), average='weighted'), 3)}%, Time \t {round(fit_end, 4)} ms")

    # Compute mean and std for each metric
    for i in scorer: 
        
        results.append(np.mean(scorer[i]))
        results.append(np.std(scorer[i]))
        if i == "fit_time":
            index.append(i+"_mean")
            index.append(i+"_std")
            continue
        if i == "score_time":
            index.append(i+"_mean")
            index.append(i+"_std")
            continue
        
        index.append("test_"+i+"_mean")
        index.append("test_"+i+"_std")
        
    # add metrics averall dataset on the dictionary 
    for i in dic_score.keys():    # compute metrics 
        if i == "fit_time":
            scorer[i].append(fit_end)
            index.append(i+'_overall')
            results.append(fit_end)
            continue
        if i == "score_time":
            scorer[i].append(score_end)
            index.append(i+'_overall')
            results.append(score_end)
            continue
        
        if len(set(y))>2:
            if i in ["prec", "recall", "f1-score"]:
                scorer[i].append(dic_score[i](y_test, np.argmax(y_pred,axis=1), average = 'weighted')) # make each function scorer

            elif i=="roc_auc":
                scorer[i].append(dic_score[i](to_categorical(y_test), y_pred, average = 'weighted', multi_class="ovo")) # make each function scorer
            else:
                scorer[i].append(dic_score[i](y_test, np.argmax(y_pred,axis=1))) # make each function scorer

        else:
            scorer[i].append(dic_score[i](y.iloc[test], y_pred))                             
            #scorer[i].append(dic_score[i](_model, X_test, y_test))
        index.append(i+'_overall')
        results.append(scorer[i][-1])
    
            
    return pd.DataFrame(results, index=index).T

In [116]:
import tensorflow as tf
from tensorflow.keras.utils import to_categorical

<a id='snn'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Shallow Neural Network</H5>
</div>

In [117]:
def shallow_neural_networks(word_index, label=labels, embedding_matrix=embedding_matrix, pre_trained=False):
    '''
    Function to generate a shallow neural network for binary or multiclass classification.
    @param word_index: (matrix) unique token in corpus
    @param label: (list) list of labels to determine if it,s a binary or multiclass
    @param embedding_matrix: (matrix) matrix of integer for each word in the 
    @param pre_trained: (bool) determine if the model will use pretrained model
    @return: (model) shallow neural network 
    '''
    if pre_trained==False:
        embedded = keras.layers.Embedding(len(word_index) + 1, 16)
    else:
        print("Pre-trained model used")
        embedded = keras.layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)
    
    model = keras.Sequential([
      embedded,
      
      keras.layers.GlobalAveragePooling1D(),
        
      #keras.layers.Dense(6, activation="relu"),
      keras.layers.Dense(1 if len(label)<=2 else len(label), activation='sigmoid' if len(label)<=2 else "softmax")])

    return model #

In [151]:
scoring = score_metrics

# Creation of list for each metric
if scoring==None:        # create a dictionary if none is passed
    dic_scoring = {}
if scoring!=None:        # save the dict 
    dic_score = scoring.copy()

dic_score["fit_time"] = None   # initialisation for time fitting and scoring
dic_score["score_time"] = None
scorer = {}
for i in dic_score.keys(): 
    scorer[i] = []

index = ["Model"]
results = ['Shallow_NN_WE']

In [154]:
for i in dic_score.keys():
    scorer[i].append(dic_score[i](_model, X_test, y_test))

In [162]:
dic_score['acc'](y_test, )

<function sklearn.metrics._classification.accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)>

In [149]:
dic_score[i](_model, X_test, y_test)

5430    fanboy ism h k stuff scar h acr exactly aesthe...
5089    curse aye awww good old bargaining alarm clock...
111     usually psychic quality knowing come specific ...
7536    thank advice appreciate just fantastic advice ...
1113    trying understand difference sx sx just differ...
                              ...                        
2926    true comfortable contradiction hi following re...
3821    pretty christian ish t believe jesus christ go...
5805    bat sherlock holmes quite alright agnostic spi...
3140    try lang studying japanese sent gt using tapat...
2770    ugh honestly inclined think behavior abusive t...
Name: text_clean_joined, Length: 2169, dtype: object

In [164]:
%%time
if shallow_network:
    df_results = df_results.append(cross_validate_NN(shallow_neural_networks(word_index, pre_trained=pre_trained), 
                                                     train_seq_x, y_train, test_seq_x, y_test,
                                                     name="Shallow_NN_WE", scoring=score_metrics, 
                                                     n_splits=CV_splits, save=save_model))

k-fold : 1
Precision: 71.454% , Recall: 67.391%, Time 	 237.6945 ms
k-fold : 2
Precision: 72.438% , Recall: 68.677%, Time 	 365.2575 ms
k-fold : 3
Precision: 70.662% , Recall: 66.164%, Time 	 396.282 ms
k-fold : 4
Precision: 73.179% , Recall: 69.012%, Time 	 263.4461 ms
k-fold : 5
Precision: 71.357% , Recall: 71.357%, Time 	 327.027 ms
Overall train-test data
Precision: 70.657% , Recall: 67.035%, Time 	 338.0158 ms


ValueError: Found input variables with inconsistent numbers of samples: [1301, 2169]

In [None]:
if save_results:
    df_results.to_csv(NAME_SAVE_FILE+".csv", sep=";", index=False)

<a id='dnn'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Deep Neural Net</H5>
</div>

In [146]:
def deep_neural_networks(word_index, label=labels, embedding_matrix=embedding_matrix, pre_trained=False):
    '''
    Function to generate a deep neural network for binary or multiclass classification.
    @param word_index: (matrix) unique token in corpus
    @param label: (list) list of labels to determine if it,s a binary or multiclass
    @param embedding_matrix: (matrix) matrix of integer for each word in the 
    @param pre_trained: (bool) determine if the model will use pretrained model
    @return: (model) deep neural network 
    '''
    if pre_trained==False:
        embedded = keras.layers.Embedding(len(word_index) + 1, 50)
    else:
        print("Pre-trained model used")
        embedded = keras.layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)
    
    model = keras.Sequential([
      embedded,
      keras.layers.GlobalAveragePooling1D(),
      keras.layers.Dense(16, activation="relu"),#tf.nn.swish),
      keras.layers.Dense(1 if len(label)<=2 else len(label), activation='sigmoid' if len(label)<=2 else "softmax")])

    #print(model.summary())
    
    return model


In [None]:
%%time
if deep_nn:
    df_results = df_results.append(cross_validate_NN(deep_neural_networks(word_index, pre_trained=pre_trained), 
                                                     train_seq_x, y_train, test_seq_x, y_test,
                                                     name="Deep_NN_WE",scoring=score_metrics, 
                                                     n_splits=CV_splits , save=save_model))

In [None]:
def deep_neural_networks_var1(word_index, label=labels, embedding_matrix=embedding_matrix, pre_trained=False):
    '''
    Function to generate a deep neural network for binary or multiclass classification.
    @param word_index: (matrix) unique token in corpus
    @param label: (list) list of labels to determine if it,s a binary or multiclass
    @param embedding_matrix: (matrix) matrix of integer for each word in the 
    @param pre_trained: (bool) determine if the model will use pretrained model
    @return: (model) deep neural network 
    '''
    if pre_trained==False:
        embedded = keras.layers.Embedding(len(word_index) + 1, 100)
    else:
        embedded = keras.layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)
    
    model = keras.Sequential([
      embedded,
      keras.layers.GlobalAveragePooling1D(),
      keras.layers.Dense(16, activation="relu"),#tf.nn.swish),
      keras.layers.Dense(16, activation="relu"),#tf.nn.swish),
      keras.layers.Dense(1  if len(label)<=2 else len(label), activation='sigmoid' if len(label)<=2 else "softmax")])

    #print(model.summary())
    
    return model

In [None]:
%%time
if deep_nn:
    df_results = df_results.append(cross_validate_NN(deep_neural_networks_var1(word_index, pre_trained=pre_trained), 
                                                     train_seq_x, y_train, test_seq_x, y_test,
                                                     name="Deep_NN_var1_WE", 
                                                     scoring=score_metrics, n_splits=CV_splits, save=save_model))

<a id='trans'></a>

<br>
<div class="span5 alert alert-info">
    <H5>Transformers</H5>
</div>

![model](img/transformers_model_architecture.png)

The Transformer – Model Architecture - [Source](https://arxiv.org/abs/1706.03762)

<a id=vidya></a>