# STEP 4.1b: Modeling AdaBoost

## Description of the methodology
> * Finalize bining of Target Variable
* Create Train and Test datasets
* Create a AdaBoost pipeline
* Define key parameters
* Run the model on on sub-train data set and test accuracy on the validation data set
* Select most accurate model based on the hyper-parameters, run it to get the confusion matrix
* Select Best AB model candidate and apply it on a larger dataset

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt

import re
from sklearn.preprocessing import Normalizer
import os
from sklearn import preprocessing
from scipy.stats import kurtosis
from scipy.stats import skew
from scipy import stats
from sklearn.preprocessing import power_transform
from sklearn.preprocessing import KBinsDiscretizer

from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.naive_bayes import ComplementNB, MultinomialNB
from sklearn.ensemble import AdaBoostClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import ParameterGrid

from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import LinearSVC
from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
import scikitplot as skplt

# t_NSE dimensionality reduction
from sklearn.manifold import TSNE

import random
from sklearn import ensemble

from sklearn.model_selection import StratifiedShuffleSplit

import warnings

with warnings.catch_warnings():
    warnings.simplefilter("ignore")


# Activate Seaborn style
sns.set()

## Load the file for analysis

In [2]:
# Importing the file and creating a dataframe
master_modeling = pd.read_csv(
    "master_modeling.csv",
    low_memory=False,
    skipinitialspace=True,
)  # , sep='\t'

In [3]:
# display all columns
pd.set_option("display.max_columns", None)

In [4]:
# remove the Unnamed column
master_modeling.drop("Unnamed: 0", axis=1, inplace=True)
master_modeling.shape

(194484, 351)

In [5]:
# Create a dataframe for the modeling phase (without text and not relevant features)
df_modeling = master_modeling.drop(["Title", "Post_ID", "Snippet"], axis=1)

In [6]:
df_modeling.shape

(194484, 348)

## Definition of # of classes for the Target Variable 'All_Impact'

> * We will split the variable in 3 classes using Scikit Learn preprocessing function KBinDiscretizer with the following parameters: number of bins 3, encode: ordinal and strategy: quantile
* Oridinal has been selected as we are trying to model a hierarchy between low and high tweet impact
* Quantile implies an even number of data points per class which would shape the model to learn about features for each class equally (avoiding unbalance classes)
* We may reconsider some of the value of the parameters depending on the modeling results

In [7]:
ai_bin = master_modeling[["ALL_Impact"]]

In [8]:
# Process binizer
est = KBinsDiscretizer(n_bins=3, encode="ordinal", strategy="quantile")
est.fit(ai_bin)
new_ai = est.transform(ai_bin)

In [9]:
# Call the edge of the different 3 bins
est.bin_edges_[0]

array([ 0., 30., 41., 80.])

In [10]:
new_ai_df = pd.DataFrame(new_ai)

In [11]:
new_ai_df.shape

(194484, 1)

In [12]:
df_modeling["All_impact bin"] = new_ai_df

In [13]:
df_modeling["All_impact bin"].value_counts()

2.0    69658
1.0    63049
0.0    61777
Name: All_impact bin, dtype: int64

In [14]:
# Remove the original All_Impact feature
df_modeling2 = df_modeling.drop(
    ["ALL_Impact", "TW_Hashtags", "ALL_Author", "TW_Account_Name"], axis=1
)

In [15]:
# Transform new All Impact feature type into int64
df_modeling2["All_impact bin"] = df_modeling2["All_impact bin"].astype(np.int64)

In [16]:
df_modeling2.shape

(194484, 345)

## Create a train, validation and test datasets (from the main Train set of data)
> * I am facing a lack of computing resources (laptop with i7 Intel chip and 16 Go RAM, no GPU) which implies a very long time for training models, especially with the tuning of hyper-parameters. As a consequence, I have combined my computing resources with Google Colaboratory in order to tune several parameters in parallel.
* **The overall dataset is divided in 3 buckets:**
* Bucket 1 (train/test): split for training the Best Selected model (in case of more important computing resources)
* Bucket 2 (train1/valid1): split for training the Best model candidate of a given class (no cross-validation)
* Bucket 3 (train2/valid2): split for hyper-parameter tuning leading to select the Best model candidate (cross-validation maybe considered in some cases)
* We could limit the risk of overfitting by using a cross-validation approach. However, we may run the risk of very demanding computing resources as we will combine hyper-parameter optimization (GridSearch) and large dataset (194484 rows x 344 variables).
* A compromised approach would be to use the standard train/test dataset split and leverage cross-validation for the validation phase in the process for selecting the best model.

### Create X and y arrays

In [17]:
# Create an array from df_modeling2 excluding the target variable All impact bin
X = df_modeling2.drop(["All_impact bin"], axis=1)
X = np.array(X)
X.shape

(194484, 344)

In [18]:
# Create y array for the target variable All impact bin
y = df_modeling2["All_impact bin"]
y = np.array(y)
y.shape

(194484,)

In [19]:
# Convert the type of the input matrix to float
X = X.astype(np.float)

# Create train set
X_tr_main, X_test, y_tr_main, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

# Create validation and test sets for best model selected for a given class
X_tr_2nd, X_valid1, y_tr_2nd, y_valid1 = train_test_split(
    X_tr_main, y_tr_main, test_size=4000, train_size=15000, random_state=0
)

# Create validation and test sets for hyper-parameter tuning and selection of the best model candidate
X_tr_3rd, X_valid2, y_tr_3rd, y_valid2 = train_test_split(
    X_tr_2nd, y_tr_2nd, test_size=1500, train_size=5000, random_state=0
)

print("Train:", X_tr_main.shape, y_tr_main.shape)
print("Test:", X_test.shape, y_test.shape)
print("Train1:", X_tr_2nd.shape, y_tr_2nd.shape)
print("Valid1:", X_valid1.shape, y_valid1.shape)
print("Train2:", X_tr_3rd.shape, y_tr_3rd.shape)
print("Valid2:", X_valid2.shape, y_valid2.shape)

Train: (155587, 344) (155587,)
Test: (38897, 344) (38897,)
Train1: (15000, 344) (15000,)
Valid1: (4000, 344) (4000,)
Train2: (5000, 344) (5000,)
Valid2: (1500, 344) (1500,)


In [20]:
pd.value_counts(y_valid2, normalize=True)

2    0.364667
0    0.324000
1    0.311333
dtype: float64

### Create a AdaBoost pipeline

In [21]:
# Create pipeline
pipe_ab1 = Pipeline(
    [
        ("scaler", StandardScaler()),  # with standardization StandardScaler()
        ("PCA", PCA(n_components=200)),  # 200 components to explain 95% of the variance
        ("ab", AdaBoostClassifier(random_state=0)),
    ]
)

In [22]:
# Get parameters
# pipe_ab1.get_params()

### Define the grid of parameters

In [23]:
# Grid of parameters
grid_ab1 = ParameterGrid(
    {
        "ab__learning_rate": [0.2, 0.1, 0.01],  # learning rate
        "ab__n_estimators": [200, 1000],  # of boosting stages
    }
)

# Print the number of combinations
print("Number of combinations:", len(grid_ab1))

Number of combinations: 6


### Run the model on on sub-train data set (5 000 tweets) and test accuracy on the validation data set (1 500 tweets)

In [24]:
#  Save accuracy on train and validation sets
train_scores = []
valid_scores = []

# Enumerate combinations starting from 1
for i, params_dict in enumerate(grid_ab1, 1):
    # Print progress
    print("Combination {}/{}".format(i, len(grid_ab1)))  # Total number of combinations

    # Set parameters
    pipe_ab1.set_params(**params_dict)

    # Fit a Decision Tree classifier
    pipe_ab1.fit(X_tr_3rd, y_tr_3rd)

    # Save accuracy on validation set
    params_dict["accuracy_train"] = pipe_ab1.score(X_tr_3rd, y_tr_3rd)
    params_dict["accuracy_valid"] = pipe_ab1.score(X_valid2, y_valid2)

    # Save result
    train_scores.append(params_dict)
    valid_scores.append(params_dict)

print("done")

Combination 1/6
Combination 2/6
Combination 3/6
Combination 4/6
Combination 5/6
Combination 6/6
done


In [25]:
# Create DataFrame with test scores
scores_df = pd.DataFrame(valid_scores)
# Print scores
scores_df.sort_values(by="accuracy_valid", ascending=False)

Unnamed: 0,ab__learning_rate,ab__n_estimators,accuracy_train,accuracy_valid
1,0.2,1000,0.7818,0.740667
3,0.1,1000,0.7606,0.734
0,0.2,200,0.7346,0.731333
2,0.1,200,0.7196,0.719333
5,0.01,1000,0.6956,0.701333
4,0.01,200,0.6364,0.646667


### Conclusions on AdaBoost classifer
> * Best AdaBoost delivers an accuracy on valid dataset (0.7413 vs 0.7854 on Train) whch is slightly better than the Best Random Forest model (0.71 vs 0.99 on Valid)
* However, AdaBoost overfits much less than Random Forest. We have to highlight that the number of trees is not the same between the 2 classes of classifier (2000 for the Random forest model and 1000 for the AdaBoost) which can explain part of the overfitting.

### Evaluation confusion matrix of most accurate model (based on accuracy)

### 1st model (2 hyper-parameters tuned: learning rate, n_estimators)

In [26]:
# Create pipeline
pipe_ab2 = Pipeline(
    [
        ("scaler", StandardScaler()),  # with standardization StandardScaler()
        ("PCA", PCA(n_components=200)),  # 200 components to explain 95% of the variance
        (
            "ab",
            AdaBoostClassifier(learning_rate=0.2, n_estimators=1000, random_state=0),
        ),
    ]
)

In [27]:
# Fit a Decision Tree classifier
model_ab2 = pipe_ab2.fit(X_tr_3rd, y_tr_3rd)

In [28]:
# Make prediction on X_valid dataset
y_pred_ab2 = pipe_ab2.predict(X_valid2)

In [29]:
# Confusions report
target_names = ["class 0", "class 1", "class 2"]
print(classification_report(y_valid2, y_pred_ab2, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.92      0.73      0.82       486
     class 1       0.56      0.84      0.67       467
     class 2       0.89      0.68      0.77       547

    accuracy                           0.75      1500
   macro avg       0.79      0.75      0.75      1500
weighted avg       0.80      0.75      0.75      1500



### 1st model based on larger set of data (train 15 000, test 4 000) - (2 hyper-parameters tuned: learning rate, n_estimators)

In [30]:
# Create pipeline
pipe_ab3 = Pipeline(
    [
        ("scaler", StandardScaler()),  # with standardization StandardScaler()
        ("PCA", PCA(n_components=200)),  # 200 components to explain 95% of the variance
        (
            "ab",
            AdaBoostClassifier(learning_rate=0.2, n_estimators=1000, random_state=0),
        ),
    ]
)

In [31]:
# Fit a Decision Tree classifier
model_ab3 = pipe_ab3.fit(X_tr_2nd, y_tr_2nd)

In [32]:
# Make prediction on X_valid dataset
y_pred_ab3 = pipe_ab3.predict(X_valid1)

In [33]:
# Confusions report
target_names = ["class 0", "class 1", "class 2"]
print(classification_report(y_valid1, y_pred_ab3, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.90      0.77      0.83      1240
     class 1       0.62      0.82      0.71      1328
     class 2       0.87      0.72      0.79      1432

    accuracy                           0.77      4000
   macro avg       0.80      0.77      0.78      4000
weighted avg       0.80      0.77      0.78      4000



### Conclusions
> * Performances on a larger dateset does not bring a huge improvement in terms of performances (both accuracy and f1 score)
* It may be due to the fact that either themodel captured the essence of the data structure even with a smaller dataste (5 000 vs 15 000) and/or that it would require to apply the model to the entire dataset (194 500 tweets) to get a picture closer to the reality

### Save results for later visualization and overall selection - Best AdaBoost Model with 2 hyper-parameters (learning_rate = 0.2, n_estimators = 1000)

In [None]:
AdaBoost_acc = 0.75
c1_ab_f1 = 0.82
c2_ab_f1 = 0.67
c3_ab_f1 = 0.77

%store AdaBoost_acc
%store c1_ab_f1
%store c2_ab_f1
%store c3_ab_f1