## 1. Introduction

In the modern era we are blessed with an enormous quantity of information. Everything can be found online, probably every bit of human knowledge is available with a few clicks, and most of it is completely free.
The downside of the connected world is probably the fact that there's no filter, everyone with a smartphone can post information online, and the more this information is repeated and liked (or disliked) in general the more it will be potent and influential (and also profitable), regardless of its veridicity.
With the term fake news we identify any false or midleading information presented as news [1], and we have seen them interfere with elections, COVID-19 vaccination programs, and ruin the reputation of many people in the last years.
The problem of automatically detect fake news it's not an easy one to solve, in this paper we will briefly look at the state of the art and try to add novelty to a particular approach.

### References

[1] https://en.wikipedia.org/wiki/Fake_news

[2] Y. Chen, N. J. Conroy, and V. L. Rubin, “News in an online world: The need for an automatic crap detector,” Proceedings of the Association for Information Science and Technology, vol. 52, no. 1, pp. 1–4, 2015.

[3] Parikh, S.B. & Atrey, P.K. 2018, "Media-Rich Fake News Detection: A Survey", IEEE, , pp. 436.

[4] W. Y. Wang, “” liar, liar pants on fire”: A new benchmark dataset for fake news detection,” arXiv preprint
arXiv:1705.00648, 2017

[5] Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”, Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.

[6] Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).

[7] https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset


### 1.1 Domain-specific Area

According to [2] the problem of automatically detect fake news is not easy to solve because of two factors: 

1. The Fake news content can be images or video or a podcast, very easy to fake but a lot more complex to analyze and preprocess than normal text.
2. There is no way in knowing where people take their information from. The web is full of platforms that provide news and governance is basically non-existant

But nonetheless the ML community has developed a series of solutions to tackle the problem with promising results (at least in the text-based news domain). In the survey by Shivam B. Parikh and Pradeep K. Atrey [3] the approcheas are divided in six methodology groups: 

1. Linguistic Features based Methods, based on the extraction and classification of  linguistic features from fake news, usually using a tdfidf representation of the text.
2. Deception Modeling based Methods, based on the extraction of the relations between text units on a story as an hierarchical tree.
3. Clustering based Methods, based on agglomerative clustering alghoritms (such as KNN) trained on a large number of data sets.
4. Predictive Modeling based Methods, based on logistic regression and positive or negative coefficients to point out the deception probability of a given text.
5. Content Cues based Methods, based on the assumption that the fake news is created solely to engage the readers, unlike a real news,  and some form of linguistic pattern are an indicator of this purpose
6. Non-Text Cues based Methods, focuses on the analsis of two non-text components of a news: images and user behavior.

In this work the focus will be on methodologies of class 1, the text will be processed and stored as a tf-idf matrix and different models will be evaluated against a baseline.



### 1.2 Description of the selected dataset

The datasets used for this analysis are three: 

The <b>LIAR</b> dataset presented in [4] and available to the public. It is composed of 12,8 K human labeled short statements from politifact.com labeled with truthfulness ratings: pants-fire, false, barely-true, half-true, mostly-true, and true. The dataset is well balanced and since the analysis will be a binary one (fake news yes/no), to mantain balance we apply the following mapping: 

 - pants-fire: fake news
 
 - false: fake news
  
 - barely-true: fake news
 
 - half-true: true
 
 - mostly-trye: true
 
 - true: true
 

The LIAR dataset is downloaded as three tsv files divided in train, test and validation set. It is composed of the following columns: 

1. ID - Text

2. Label - Text

3. Statement - Text

4. Subject - Text

5. Speaker - Text

6. Speaker Job Title - Text

7. State - Text

8. Party affiliation - Text - [democrat, republican]

9. Barely true count - Integer

10. Half true counts - Integer

11. Mostly true counts - Integer

12. Pants on fire counts - Integer

13. Venue / location of the statement - Text


The second dataset is the <b>ISOT Fake News Dataset</b>, introduced by Ahmed H, Traore I and Saad S. in [5], [6] and available on Kaggle [7]. It is composed of 21417 true news articles and 23481 fake news. The truthful articles were obtained by crawling articles from Reuters.com, and the fake news from different sources, mostly unreliable websites flagged by politifact and Wikipedia.

The ISOT dataset is downloaded as two csv files, true.csv and fake.csv. It is composed of the following columns: 

1. Title - Text

2. Text - Text

3. Subject - Text

4. Date - Date


Both the described datasets will be reduced to the same format for this analysis: 

1. Article - Text

2. isFake - Boolean


From the LIAR dataset we'll sample 3K random rows from the train file and from the ISOT dataset we'll sample 1,5K random rows from the true file and 1,5k rows from the fake file.


The third dataset used is a validation dataset and is the concatenation of the previous two datasets. It will have the standard format and it'll be composed of 6K rows.
 

### 1.3 Objectives

The objectives of this project are mainly two: 

1. Explore different ensemble methodologies with a set of classificators that will be the baseline agaist wich the ensembles will be evaluated. Study the differences between those methodologies and find if there's one best suited to the task of finding fake news. The ensemble techniques that will be used are: 

<b>Hard Blending Ensemble</b>, a form of Stacking Generalization without the k-fold cross validation.We use the predictions of the base models to create a "meta-model" that will be then used as training for a "blending model" (in our case Logistic Regressor) that will do the actual predictions.

<b>Soft Blending Ensemble</b>, like above, but with the difference that instead of using the predictions of the base models as meta-model we will use the probabilities given by the models as training for the blender

<b>Soft Weighted Voting Ensemble</b>, a form of Voting Ensemble in which the predictions of the base models will result in a prediction based on the majority vote, with a weight given by the accuracy of the single base model on a validation set


2. Create an ensemble that can outperform any of the  single "weak learner" composing the ensemble classificator.

### 1.4 Evaluation Methodology

2. Implementation

2.1 Pre-processing

In [40]:
import pandas as pd
import nltk
import string
import random
import numpy as np
from numpy import hstack
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import PorterStemmer
from nltk import ngrams
from functools import reduce
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.utils.extmath import softmax


from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn import tree


from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import RocCurveDisplay

nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mmenna/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
def text_processing(text, n = 1):
    """
    Takes in a string of text, then performs the following:
    1. Convert text to lower case and remove all punctuation
    2. Optionally apply stemming
    3. Apply Ngram Tokenisation
    4. Returns the tokenised text as a list 
    """
    
    stemmer = SnowballStemmer("english")
    stop = stopwords.words('english')
    #write steps here
    # lower function
    t_1 = lambda x : x.lower()
    # Remove punctuation function
    t_2 = lambda x : x.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    t_3 = lambda x : " ".join([w for w in x.split() if w not in stop])
    # Snowball stemming
    t_4 = lambda x : " ".join([stemmer.stem(w) for w in x.split()])
    # Ngrams with n number of grams
    t_5 = lambda x : [" ".join(ng) for ng in list(ngrams(x.split(), n))]
    
    
    #List of transformation functions
    t = [t_1, t_2, t_3, t_4, t_5]
    
    #Apply transformations
    tokenised = reduce(lambda r, f: f(r), t, text)
    
    return tokenised

In [3]:
n_1 = sum(1 for line in open('True.csv')) - 1
n_2 = sum(1 for line in open('Fake.csv')) - 1
s = 1500
skip_1 = sorted(random.sample(range(1,n_1+1),n_1-s)) 
skip_2 = sorted(random.sample(range(1,n_2+1),n_2-s))


raw_data_true = pd.read_csv('True.csv', skiprows=skip_1)
raw_data_fake = pd.read_csv('Fake.csv', skiprows=skip_2)

In [4]:
raw_data_true['isFake'] = 0
raw_data_fake['isFake'] = 1
raw_data = raw_data_true.append(raw_data_fake)


In [5]:
data_t1 = pd.DataFrame()
data_t1['article'] = raw_data['title'] + ' ' + raw_data['text']
data_t1['isFake'] = raw_data['isFake']

In [6]:
bag = data_t1['article'].apply(text_processing, n=1)
bag

0       [trump, say, russia, probe, fair, timelin, unc...
1       [mcconnel, happier, trump, tweet, tax, victori...
2       [republican, aim, ride, economi, elect, victor...
3       [congress, vote, avert, shutdown, send, trump,...
4       [us, hous, approv, 81, billion, disast, aid, w...
                              ...                        
1493    [peac, prize, presid, obama, approv, 200, bill...
1494    [reopen, kurt, cobain, case, poll, 21st, centu...
1495    [plastic, persona, behind, scene, ted, cruz, m...
1496    [activist, this, make, impact, 21st, centuri, ...
1497    [trial, youtub, mainstream, media, use, second...
Name: article, Length: 2998, dtype: object

In [7]:
identity = lambda x : x
corpus = bag.values
print('Count Vectorizing...')
vectorizer = CountVectorizer(tokenizer = identity, preprocessor = identity)
count_vector = vectorizer.fit_transform(corpus).toarray()
print('Transforming to tfidf matrix...')
tfidfTransformer = TfidfTransformer()
text_tfidf = tfidfTransformer.fit_transform(count_vector)


Count Vectorizing...
Transforming to tfidf matrix...


In [8]:
X = pd.DataFrame(text_tfidf.toarray())
y = data_t1['isFake']

In [98]:
n = sum(1 for line in open('liar_dataset/train.tsv')) - 1
s = 3000
skip = sorted(random.sample(range(1,n-1),n-s)) 

l_raw_data = pd.read_csv('liar_dataset/train.tsv', sep='\t', names= ['ID','Label','Statement', 'Subject', 'Speaker', 'Speaker Job', 'State', 'Party Aff', 'Credit', 'True', 'Half true', 'Mostly true', 'Pants on fire', 'Context'], skiprows=skip)

In [99]:
liar_mapper = {
    'false': 1,
    'half-true': 0,
    'mostly-true': 0,
    'true': 0,
    'barely-true': 1,
    'pants-fire': 1
}
reduce_fake = lambda x : liar_mapper[x]
l_data_t1 = pd.DataFrame()
l_data_t1['article'] = l_raw_data['Statement']
l_data_t1['isFake'] = l_raw_data['Label'].apply(reduce_fake)

l_data_t1

Unnamed: 0,article,isFake
0,Says the Annies List political group supports ...,1
1,The Chicago Bears have had more starting quart...,0
2,"Since 2000, nearly 12 million Americans have s...",0
3,Says Mitt Romney wants to get rid of Planned P...,1
4,We have a federal government that thinks they ...,0
...,...,...
2991,"As a result of Obamacare, California seniors f...",1
2992,"For the first time since the Korean War, total...",0
2993,The proudest accomplishment (of my tenure) was...,0
2994,Mayor Fung wants to punish our childrens educa...,1


In [100]:
bag = l_data_t1['article'].apply(text_processing, n=1)
bag

0       [say, anni, list, polit, group, support, third...
1       [chicago, bear, start, quarterback, last, 10, ...
2       [sinc, 2000, near, 12, million, american, slip...
3       [say, mitt, romney, want, get, rid, plan, pare...
4       [feder, govern, think, author, regul, toilet, ...
                              ...                        
2991    [result, obamacar, california, senior, face, b...
2992    [first, time, sinc, korean, war, total, feder,...
2993    [proudest, accomplish, tenur, leav, state, 12,...
2994    [mayor, fung, want, punish, children, educ, re...
2995    [rule, suprem, court, lobbyist, could, go, leg...
Name: article, Length: 2996, dtype: object

In [101]:
corpus = bag.values
print('Count Vectorizing...')
vectorizer = CountVectorizer(tokenizer = identity, preprocessor = identity)
count_vector = vectorizer.fit_transform(corpus).toarray()
print('Transforming to tfidf matrix...')
tfidfTransformer = TfidfTransformer()
text_tfidf = tfidfTransformer.fit_transform(count_vector)

Count Vectorizing...
Transforming to tfidf matrix...


In [102]:
l_X = pd.DataFrame(text_tfidf.toarray())
l_y = l_data_t1['isFake']

In [103]:
t_data_t1 = data_t1.append(l_data_t1)
t_data_t1

Unnamed: 0,article,isFake
0,"Trump says Russia probe will be fair, but time...",0
1,McConnell happier with Trump tweets after tax ...,0
2,As Republicans aim to ride economy to election...,0
3,"Congress votes to avert shutdown, sends Trump ...",0
4,U.S. House approves $81 billion for disaster a...,0
...,...,...
2991,"As a result of Obamacare, California seniors f...",1
2992,"For the first time since the Korean War, total...",0
2993,The proudest accomplishment (of my tenure) was...,0
2994,Mayor Fung wants to punish our childrens educa...,1


In [104]:
t_data_t1 = data_t1.append(l_data_t1)
bag = t_data_t1['article'].apply(text_processing, n=1)
corpus = bag.values
print('Count Vectorizing...')
vectorizer = CountVectorizer(tokenizer = identity, preprocessor = identity)
count_vector = vectorizer.fit_transform(corpus).toarray()
print('Transforming to tfidf matrix...')
tfidfTransformer = TfidfTransformer()
text_tfidf = tfidfTransformer.fit_transform(count_vector)

Count Vectorizing...
Transforming to tfidf matrix...


In [105]:
t_X = pd.DataFrame(text_tfidf.toarray())
t_y = t_data_t1['isFake']

2.2 Baseline performance

In [106]:
class RidgeClassifierWithProba(RidgeClassifier):
    def predict_proba(self, X):
        d = self.decision_function(X)
        d_2d = np.c_[-d, d]
        
        return softmax(d_2d)
    
models= [tree.DecisionTreeClassifier(), 
         LogisticRegression(random_state=42), 
         SGDClassifier(max_iter=1000, tol=1e-3, loss='modified_huber'),
         RidgeClassifierWithProba(),
         MultinomialNB()
        ]
model_names = [
    "Decision Tree",
    "Logistic Regression",
    "Stocasthic Gradient Descent",
    "Ridge",
    "Naive Bayes"
]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
l_X_train, l_X_test, l_y_train, l_y_test = train_test_split(l_X, l_y, test_size=0.2)
t_X_train, t_X_test, t_y_train, t_y_test = train_test_split(t_X, t_y, test_size=0.2)


In [107]:
scores = []
l_scores = []
t_scores = []
for i,model in enumerate(models):
    print('Fitting model ', model_names[i], 'for Fake News dataset...')
    model.fit(X_train, y_train)
    yhat = model.predict(X_test)
    scores.append([
        model_names[i],
        yhat,
        roc_auc_score(y_test, yhat),
        f1_score(y_test, yhat),
        precision_score(y_test, yhat),
        recall_score(y_test, yhat),
        accuracy_score(y_test, yhat)
    ])
    print('Fitting model ', model_names[i], 'for Liar dataset...')
    model.fit(l_X_train, l_y_train)
    yhat = model.predict(l_X_test)
    l_scores.append([
        model_names[i],
        yhat,
        roc_auc_score(l_y_test, yhat),
        f1_score(l_y_test, yhat),
        precision_score(l_y_test, yhat),
        recall_score(l_y_test, yhat),
        accuracy_score(l_y_test, yhat)
    ])
    print('Fitting model ', model_names[i], 'for Total dataset...')
    model.fit(t_X_train, t_y_train)
    yhat = model.predict(t_X_test)
    t_scores.append([
        model_names[i],
        yhat,
        roc_auc_score(t_y_test, yhat),
        f1_score(t_y_test, yhat),
        precision_score(t_y_test, yhat),
        recall_score(t_y_test, yhat),
        accuracy_score(t_y_test, yhat)
    ])
    

Fitting model  Decision Tree for Fake News dataset...
Fitting model  Decision Tree for Liar dataset...
Fitting model  Decision Tree for Total dataset...
Fitting model  Logistic Regression for Fake News dataset...
Fitting model  Logistic Regression for Liar dataset...
Fitting model  Logistic Regression for Total dataset...
Fitting model  Stocasthic Gradient Descent for Fake News dataset...
Fitting model  Stocasthic Gradient Descent for Liar dataset...
Fitting model  Stocasthic Gradient Descent for Total dataset...
Fitting model  Ridge for Fake News dataset...
Fitting model  Ridge for Liar dataset...
Fitting model  Ridge for Total dataset...
Fitting model  Naive Bayes for Fake News dataset...
Fitting model  Naive Bayes for Liar dataset...
Fitting model  Naive Bayes for Total dataset...


In [108]:
scores_df = pd.DataFrame(scores, columns= ['Model', 'Predictions', 'ROC AUC', 'F1-Score', 'Precision', 'Recall', 'Accuracy'])
l_scores_df = pd.DataFrame(l_scores, columns= ['Model', 'Predictions', 'ROC AUC', 'F1-Score', 'Precision', 'Recall', 'Accuracy'])
t_scores_df = pd.DataFrame(t_scores, columns= ['Model', 'Predictions', 'ROC AUC', 'F1-Score', 'Precision', 'Recall', 'Accuracy'])

baseline_scores = scores
l_baseline_scores = l_scores
t_baseline_scores = t_scores

In [109]:
scores_df

Unnamed: 0,Model,Predictions,ROC AUC,F1-Score,Precision,Recall,Accuracy
0,Decision Tree,"[0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, ...",1.0,1.0,1.0,1.0,1.0
1,Logistic Regression,"[0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, ...",0.976621,0.976351,0.982993,0.969799,0.976667
2,Stocasthic Gradient Descent,"[0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, ...",0.981677,0.981575,0.979933,0.983221,0.981667
3,Ridge,"[0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, ...",0.988299,0.988196,0.99322,0.983221,0.988333
4,Naive Bayes,"[0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, ...",0.954831,0.953528,0.978799,0.92953,0.955


In [110]:
l_scores_df

Unnamed: 0,Model,Predictions,ROC AUC,F1-Score,Precision,Recall,Accuracy
0,Decision Tree,"[1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, ...",0.552486,0.537005,0.513158,0.563177,0.551667
1,Logistic Regression,"[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",0.578579,0.491736,0.574879,0.429603,0.59
2,Stocasthic Gradient Descent,"[1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, ...",0.554548,0.541096,0.514658,0.570397,0.553333
3,Ridge,"[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, ...",0.56463,0.515038,0.537255,0.494585,0.57
4,Naive Bayes,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.573711,0.402948,0.630769,0.296029,0.595


In [111]:
t_scores_df

Unnamed: 0,Model,Predictions,ROC AUC,F1-Score,Precision,Recall,Accuracy
0,Decision Tree,"[0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, ...",0.772792,0.758135,0.72437,0.795203,0.770642
1,Logistic Regression,"[0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, ...",0.784553,0.765297,0.757685,0.773063,0.785655
2,Stocasthic Gradient Descent,"[0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, ...",0.765551,0.753873,0.706452,0.808118,0.761468
3,Ridge,"[0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, ...",0.778695,0.757604,0.756906,0.758303,0.780651
4,Naive Bayes,"[0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, ...",0.747051,0.704569,0.783296,0.640221,0.757298


2.3 Classification Approach

In [112]:
# fit the blending ensemble
def fit_ensemble(models, X_train, X_val, y_train, y_val, hard=True):
    # fit all models on the training set and predict on hold out set
    meta_X = list()
    for model in models:
        # fit in training set
        model.fit(X_train, y_train)
        # predict on hold out set
        yhat = model.predict(X_val) if hard else model.predict_proba(X_val)
        # reshape predictions into a matrix with one column
        if hard:
            yhat = yhat.reshape(len(yhat), 1)
        # store predictions as input for blending
        meta_X.append(yhat)
    # create 2d array from predictions, each set is an input feature
    meta_X = hstack(meta_X)
    # define blending model
    blender = LogisticRegression()
    # fit on predictions from base models
    blender.fit(meta_X, y_val)
    return blender

# make a prediction with the blending ensemble
def predict_ensemble(models, blender, X_test, hard=True):
    # make predictions with base models
    meta_X = list()
    for model in models:
        # predict with base model
        yhat = model.predict(X_test) if hard else model.predict_proba(X_test)
        # reshape predictions into a matrix with one column
        if hard: 
            yhat = yhat.reshape(len(yhat), 1)
        # store prediction
        meta_X.append(yhat)
    # create 2d array from predictions, each set is an input feature
    meta_X = hstack(meta_X)
    # predict
    return blender.predict(meta_X)

# evaluate each base model
def evaluate_models(models, X_train, X_val, y_train, y_val):
    # fit and evaluate the models
    scores = list()
    for model in models:
        # fit the model
        model.fit(X_train, y_train)
        # evaluate the model
        yhat = model.predict(X_val)
        acc = accuracy_score(y_val, yhat)
        # store the performance
        scores.append(acc)
    # report model performance
    return scores

In [113]:
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(l_X, l_y, test_size=0.5, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)

blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
yhat = predict_ensemble(models, blender, X_test)

l_scores = l_baseline_scores

l_scores.append([
        'Hard Voting Blender',
        yhat,
        roc_auc_score(y_test, yhat),
        f1_score(y_test, yhat),
        precision_score(y_test, yhat),
        recall_score(y_test, yhat),
        accuracy_score(y_test, yhat)
    ])


In [114]:
blender = fit_ensemble(models, X_train, X_val, y_train, y_val, False)
yhat = predict_ensemble(models, blender, X_test, False)

l_scores.append([
        'Soft Voting Blender',
        yhat,
        roc_auc_score(y_test, yhat),
        f1_score(y_test, yhat),
        precision_score(y_test, yhat),
        recall_score(y_test, yhat),
        accuracy_score(y_test, yhat)
    ])

In [115]:
accuracies = evaluate_models(models, X_train, X_val, y_train, y_val)
print(accuracies)
ensemble = VotingClassifier(estimators=list(zip(model_names, models)), voting='soft', weights=accuracies)
ensemble.fit(X_train, y_train)
yhat = ensemble.predict(X_test)

l_scores.append([
        'Soft Weighted Ensemble',
        yhat,
        roc_auc_score(y_test, yhat),
        f1_score(y_test, yhat),
        precision_score(y_test, yhat),
        recall_score(y_test, yhat),
        accuracy_score(y_test, yhat)
    ])

[0.503030303030303, 0.5959595959595959, 0.5555555555555556, 0.5696969696969697, 0.591919191919192]


In [116]:
l_scores_df = pd.DataFrame(l_scores, columns= ['Model', 'Predictions', 'ROC AUC', 'F1-Score', 'Precision', 'Recall', 'Accuracy'])


l_scores_df

Unnamed: 0,Model,Predictions,ROC AUC,F1-Score,Precision,Recall,Accuracy
0,Decision Tree,"[1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, ...",0.552486,0.537005,0.513158,0.563177,0.551667
1,Logistic Regression,"[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",0.578579,0.491736,0.574879,0.429603,0.59
2,Stocasthic Gradient Descent,"[1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, ...",0.554548,0.541096,0.514658,0.570397,0.553333
3,Ridge,"[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, ...",0.56463,0.515038,0.537255,0.494585,0.57
4,Naive Bayes,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.573711,0.402948,0.630769,0.296029,0.595
5,Hard Voting Blender,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.543951,0.363996,0.547619,0.272593,0.570761
6,Soft Voting Blender,"[0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, ...",0.545735,0.42669,0.523707,0.36,0.564085
7,Soft Weighted Ensemble,"[0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, ...",0.547226,0.457096,0.515829,0.41037,0.560748


In [117]:
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)

blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
yhat = predict_ensemble(models, blender, X_test)

scores.append([
        'Hard Voting Blender',
        yhat,
        roc_auc_score(y_test, yhat),
        f1_score(y_test, yhat),
        precision_score(y_test, yhat),
        recall_score(y_test, yhat),
        accuracy_score(y_test, yhat)
    ])


In [118]:
blender = fit_ensemble(models, X_train, X_val, y_train, y_val, False)
yhat = predict_ensemble(models, blender, X_test, False)

scores.append([
        'Soft Voting Blender',
        yhat,
        roc_auc_score(y_test, yhat),
        f1_score(y_test, yhat),
        precision_score(y_test, yhat),
        recall_score(y_test, yhat),
        accuracy_score(y_test, yhat)
    ])

In [126]:
accuracies = evaluate_models(models, X_train, X_val, y_train, y_val)
ensemble = VotingClassifier(estimators=list(zip(model_names, models)), voting='soft', weights=accuracies)
ensemble.fit(X_train, y_train)
yhat = ensemble.predict(X_test)

scores_bck = scores.copy()
scores.append([
        'Soft Weighted Ensemble',
        yhat,
        roc_auc_score(y_test, yhat),
        f1_score(y_test, yhat),
        precision_score(y_test, yhat),
        recall_score(y_test, yhat),
        accuracy_score(y_test, yhat)
    ])

In [120]:
scores_df = pd.DataFrame(scores, columns= ['Model', 'Predictions', 'ROC AUC', 'F1-Score', 'Precision', 'Recall', 'Accuracy'])

scores_df

Unnamed: 0,Model,Predictions,ROC AUC,F1-Score,Precision,Recall,Accuracy
0,Decision Tree,"[0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, ...",1.0,1.0,1.0,1.0,1.0
1,Logistic Regression,"[0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, ...",0.976621,0.976351,0.982993,0.969799,0.976667
2,Stocasthic Gradient Descent,"[0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, ...",0.981677,0.981575,0.979933,0.983221,0.981667
3,Ridge,"[0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, ...",0.988299,0.988196,0.99322,0.983221,0.988333
4,Naive Bayes,"[0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, ...",0.954831,0.953528,0.978799,0.92953,0.955
5,Hard Voting Blender,"[0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...",0.991623,0.99187,0.99026,0.993485,0.991667
6,Soft Voting Blender,"[0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...",0.991701,0.991843,0.993464,0.990228,0.991667
7,Soft Weighted Ensemble,"[0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...",0.986815,0.986885,0.993399,0.980456,0.986667


In [122]:
# split dataset into train and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(t_X, t_y, test_size=0.2, random_state=1)
# split training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.33, random_state=1)

blender = fit_ensemble(models, X_train, X_val, y_train, y_val)
yhat = predict_ensemble(models, blender, X_test)

t_scores.append([
        'Hard Voting Blender',
        yhat,
        roc_auc_score(y_test, yhat),
        f1_score(y_test, yhat),
        precision_score(y_test, yhat),
        recall_score(y_test, yhat),
        accuracy_score(y_test, yhat)
    ])

In [124]:
blender = fit_ensemble(models, X_train, X_val, y_train, y_val, False)
yhat = predict_ensemble(models, blender, X_test, False)

t_scores.append([
        'Soft Voting Blender',
        yhat,
        roc_auc_score(y_test, yhat),
        f1_score(y_test, yhat),
        precision_score(y_test, yhat),
        recall_score(y_test, yhat),
        accuracy_score(y_test, yhat)
    ])

In [123]:
accuracies = evaluate_models(models, X_train, X_val, y_train, y_val)
ensemble = VotingClassifier(estimators=list(zip(model_names, models)), voting='soft', weights=accuracies)
ensemble.fit(X_train, y_train)
yhat = ensemble.predict(X_test)

t_scores.append([
        'Soft Weighted Ensemble',
        yhat,
        roc_auc_score(y_test, yhat),
        f1_score(y_test, yhat),
        precision_score(y_test, yhat),
        recall_score(y_test, yhat),
        accuracy_score(y_test, yhat)
    ])

In [125]:
t_scores_df = pd.DataFrame(t_scores, columns= ['Model', 'Predictions', 'ROC AUC', 'F1-Score', 'Precision', 'Recall', 'Accuracy'])

t_scores_df

Unnamed: 0,Model,Predictions,ROC AUC,F1-Score,Precision,Recall,Accuracy
0,Decision Tree,"[0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, ...",0.772792,0.758135,0.72437,0.795203,0.770642
1,Logistic Regression,"[0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, ...",0.784553,0.765297,0.757685,0.773063,0.785655
2,Stocasthic Gradient Descent,"[0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, ...",0.765551,0.753873,0.706452,0.808118,0.761468
3,Ridge,"[0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, ...",0.778695,0.757604,0.756906,0.758303,0.780651
4,Naive Bayes,"[0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, ...",0.747051,0.704569,0.783296,0.640221,0.757298
5,Hard Voting Blender,"[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, ...",0.778331,0.754468,0.803607,0.710993,0.782319
6,Soft Weighted Ensemble,"[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, ...",0.769378,0.748387,0.779271,0.719858,0.77231
7,Soft Voting Blender,"[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, ...",0.771944,0.754306,0.7718,0.737589,0.773978


3. Conclusion

3.1 Evaluation

3.2 Summary and conclusions