365 Classification Modeling

**Goal:  Develop a machine learning model to predict whether a Free Plan user would convert to a paid subscriber or not.**

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocessing-the-data" data-toc-modified-id="Preprocessing-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocessing the data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Split-your-data-into-training-and-testing-sets." data-toc-modified-id="Split-your-data-into-training-and-testing-sets.-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Split your data into training and testing sets.</a></span></li><li><span><a href="#SMOTE-(Synthetic-Minority-Oversampling)" data-toc-modified-id="SMOTE-(Synthetic-Minority-Oversampling)-1.0.2"><span class="toc-item-num">1.0.2&nbsp;&nbsp;</span>SMOTE (Synthetic Minority Oversampling)</a></span></li><li><span><a href="#Undersampling-Oversampled-Data" data-toc-modified-id="Undersampling-Oversampled-Data-1.0.3"><span class="toc-item-num">1.0.3&nbsp;&nbsp;</span>Undersampling Oversampled Data</a></span></li><li><span><a href="#Oversampling-->-Undersampling-Pipeline" data-toc-modified-id="Oversampling-->-Undersampling-Pipeline-1.0.4"><span class="toc-item-num">1.0.4&nbsp;&nbsp;</span>Oversampling -&gt; Undersampling Pipeline</a></span></li></ul></li><li><span><a href="#Beginning-model-testing" data-toc-modified-id="Beginning-model-testing-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Beginning model testing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Print-Functions" data-toc-modified-id="Print-Functions-1.1.0.1"><span class="toc-item-num">1.1.0.1&nbsp;&nbsp;</span>Print Functions</a></span></li></ul></li><li><span><a href="#Reducing-dimensionality-for-certain-models" data-toc-modified-id="Reducing-dimensionality-for-certain-models-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Reducing dimensionality for certain models</a></span></li><li><span><a href="#Model-0---SGDClassifier" data-toc-modified-id="Model-0---SGDClassifier-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Model 0 - SGDClassifier</a></span></li><li><span><a href="#Model-1---Ridge" data-toc-modified-id="Model-1---Ridge-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Model 1 - Ridge</a></span></li><li><span><a href="#Model-2---Logistic" data-toc-modified-id="Model-2---Logistic-1.1.4"><span class="toc-item-num">1.1.4&nbsp;&nbsp;</span>Model 2 - Logistic</a></span></li><li><span><a href="#Model-3---Decision-Tree" data-toc-modified-id="Model-3---Decision-Tree-1.1.5"><span class="toc-item-num">1.1.5&nbsp;&nbsp;</span>Model 3 - Decision Tree</a></span></li><li><span><a href="#Model-4---KNN" data-toc-modified-id="Model-4---KNN-1.1.6"><span class="toc-item-num">1.1.6&nbsp;&nbsp;</span>Model 4 - KNN</a></span></li><li><span><a href="#Model-5---Gradient-Boost" data-toc-modified-id="Model-5---Gradient-Boost-1.1.7"><span class="toc-item-num">1.1.7&nbsp;&nbsp;</span>Model 5 - Gradient Boost</a></span></li><li><span><a href="#Model-6---XGBoost" data-toc-modified-id="Model-6---XGBoost-1.1.8"><span class="toc-item-num">1.1.8&nbsp;&nbsp;</span>Model 6 - XGBoost</a></span></li><li><span><a href="#Voting-Ensemble" data-toc-modified-id="Voting-Ensemble-1.1.9"><span class="toc-item-num">1.1.9&nbsp;&nbsp;</span>Voting Ensemble</a></span></li><li><span><a href="#Stack-Ensemble" data-toc-modified-id="Stack-Ensemble-1.1.10"><span class="toc-item-num">1.1.10&nbsp;&nbsp;</span>Stack Ensemble</a></span></li></ul></li><li><span><a href="#Observations" data-toc-modified-id="Observations-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Observations</a></span></li></ul></li><li><span><a href="#Exporting-the-best-performing-model" data-toc-modified-id="Exporting-the-best-performing-model-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Exporting the best performing model</a></span></li></ul></div>

# Preprocessing the data

**Goal:  Develop a machine learning model to predict whether a Free Plan user would convert to a paid subscriber or not.**

In [1]:
import pandas as pd
import imblearn
import numpy as np
import sklearn

from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    roc_auc_score,
)
from sklearn.metrics import (
    f1_score,
    precision_score,
    recall_score,
    average_precision_score,
)

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import (
    RandomForestClassifier,
    VotingClassifier,
    StackingClassifier,
    GradientBoostingClassifier,
)
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier

from imblearn.under_sampling import (
    CondensedNearestNeighbour,
    RandomUnderSampler,
    TomekLinks,
    EditedNearestNeighbours,
)
from feature_engine.encoding import PRatioEncoder, RareLabelEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_validate

from imblearn.pipeline import Pipeline

import warnings

warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("student.csv")
df

Unnamed: 0,student_id,registered_month,registered_day,country,asked_question,engaged_lessons,engaged_quiz,engaged_exams,total_minutes_watched,average_rating_given,rating_count,quizzes_taken,quiz_questions_answered,correct_quiz_answers,subscribed
0,283159,7,30,IT,0,0,0,0,0.0,4.75,0,0,0,0,0
1,285343,8,14,US,0,0,0,0,0.0,4.75,0,0,0,0,0
2,288729,8,30,TR,0,0,0,0,0.0,4.75,0,0,0,0,0
3,288975,9,1,IN,0,0,0,0,0.0,4.75,0,0,0,0,0
4,258899,1,1,SA,0,0,0,0,0.0,4.75,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35225,278187,6,17,BD,0,0,0,0,0.0,4.75,0,0,0,0,0
35226,278304,6,18,IN,0,1,1,0,91.2,5.00,1,6,9,5,0
35227,279125,6,24,GB,0,0,0,0,0.0,4.75,0,4,7,7,0
35228,280000,7,1,US,0,1,1,0,91.6,4.75,0,5,8,8,1


In [3]:
df.isnull().any()

student_id                 False
registered_month           False
registered_day             False
country                     True
asked_question             False
engaged_lessons            False
engaged_quiz               False
engaged_exams              False
total_minutes_watched      False
average_rating_given       False
rating_count               False
quizzes_taken              False
quiz_questions_answered    False
correct_quiz_answers       False
subscribed                 False
dtype: bool

The string "NA" may have been mistaken as NaN by pandas when the .csv file was read.
"NA" is supposed to represent the country code of Namibia, a country in South Africa.

In [4]:
df.country.fillna("NA", inplace=True)

In [5]:
df[df.country.isnull()]

Unnamed: 0,student_id,registered_month,registered_day,country,asked_question,engaged_lessons,engaged_quiz,engaged_exams,total_minutes_watched,average_rating_given,rating_count,quizzes_taken,quiz_questions_answered,correct_quiz_answers,subscribed


All fixed!

In [6]:
df.subscribed.value_counts()

0    33095
1     2135
Name: subscribed, dtype: int64

As expected, the target values are heavily unbalanced

In [7]:
X = df.drop(["student_id", "subscribed"], axis=1)
X

Unnamed: 0,registered_month,registered_day,country,asked_question,engaged_lessons,engaged_quiz,engaged_exams,total_minutes_watched,average_rating_given,rating_count,quizzes_taken,quiz_questions_answered,correct_quiz_answers
0,7,30,IT,0,0,0,0,0.0,4.75,0,0,0,0
1,8,14,US,0,0,0,0,0.0,4.75,0,0,0,0
2,8,30,TR,0,0,0,0,0.0,4.75,0,0,0,0
3,9,1,IN,0,0,0,0,0.0,4.75,0,0,0,0
4,1,1,SA,0,0,0,0,0.0,4.75,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
35225,6,17,BD,0,0,0,0,0.0,4.75,0,0,0,0
35226,6,18,IN,0,1,1,0,91.2,5.00,1,6,9,5
35227,6,24,GB,0,0,0,0,0.0,4.75,0,4,7,7
35228,7,1,US,0,1,1,0,91.6,4.75,0,5,8,8


In [8]:
y = df["subscribed"].values
y

array([0, 0, 0, ..., 0, 1, 0])

### Split your data into training and testing sets.

In [9]:
# https://datascience.stackexchange.com/a/53161
from sklearn.model_selection import train_test_split

train_ratio = 0.80
test_ratio = 0.20

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=1 - train_ratio, shuffle=True
)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(28184, 13)
(28184,)
(7046, 13)
(7046,)


In [10]:
### Rare Label Encoding replaces any rarely-used labels with a singular "RARE"
rle = RareLabelEncoder(tol=0.03, replace_with="RARE")

In [11]:
X_train_rle = rle.fit_transform(X_train)
X_test_rle = rle.transform(X_test)

# set up a weight of evidence encoder
pre = PRatioEncoder(encoding_method="ratio")
pre.fit(X_train_rle, y_train)

# transform
X_train_pre = pre.transform(X_train_rle)
X_test_pre = pre.transform(X_test_rle)

In [12]:
pre.encoder_dict_

{'country': {'CA': 0.12817412333736397,
  'EG': 0.00929839391377853,
  'GB': 0.08262548262548262,
  'IN': 0.022467771639042358,
  'NG': 0.028252788104089217,
  'RARE': 0.05932910547396529,
  'US': 0.18975155279503103}}

### SMOTE (Synthetic Minority Oversampling) 
Slight alterations to datapoints as opposed to direct duplication in Random Oversampling

In [13]:
from imblearn.over_sampling import SMOTE

over = SMOTE(random_state=46, sampling_strategy="minority")

# X_train_smote, y_train_smote = over.fit_resample(X_train_pre,y_train)

### Undersampling Oversampled Data

In general, applying an under-sampling technique right after over-sampling increases the model’s performance

EditedNearestNeighbours

Undersample based on the edited nearest neighbour method.

This method will clean the database by removing samples close to the decision boundary.

In [14]:
# cnn = CondensedNearestNeighbour(random_state=46, sampling_strategy='auto', n_jobs=-1)
under = EditedNearestNeighbours(sampling_strategy="auto", n_neighbors=4)

# X_resampled, y_resampled = cnn.fit_resample(X_train_smote,y_train_smote)

Choose an approach that helps us handle the imbalance of the data. 
Choose an over-sampling method from the imblearn library. 
Fiddle around with the parameters—most importantly with sampling_strategy. 
Try to understand how this parameter changes the number of data points from each class.

### Oversampling -> Undersampling Pipeline

In [15]:
osus_pipeline = Pipeline(steps=[("o", over), ("u", under)])
X_resample, y_resample = osus_pipeline.fit_resample(X_train_pre, y_train)

## Beginning model testing

#### Print Functions

In [42]:
def results(y_test, y_pred):
    print(confusion_matrix(y_test, y_pred))
    print("----------")
    print(f"Precision: {precision_score(y_test, y_pred)}")
    print(f"Recall: {recall_score(y_test, y_pred)}")
    print(f"PR-AUC: {average_precision_score(y_test, y_pred)}")
    print("\n")
    print(f"F1-Score: {f1_score(y_test, y_pred)}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"ROC-AUC: {roc_auc_score(y_test, y_pred)}")
    
# Reference: https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/
def results_proba(y_test, y_pred):
    # apply threshold to positive probabilities to create labels   
    def to_labels(pos_probs, threshold):
        return (pos_probs >= threshold).astype('int')
    
    # keep probabilities for the positive outcome only
    probs = y_pred[:, 1]

    thresholds = np.arange(0, 1, 0.001)
    # evaluate each threshold

    f1_scores = [f1_score(y_test, to_labels(probs, t)) for t in thresholds]
    roc_scores = [roc_auc_score(y_test, to_labels(probs, t)) for t in thresholds]
    acc_scores = [accuracy_score(y_test, to_labels(probs, t)) for t in thresholds]

    # get best threshold
    best_f1 = np.argmax(f1_scores)
    best_roc = np.argmax(roc_scores)
    best_acc = np.argmax(acc_scores)

    print('Threshold=%.3f, F-Score=%.5f' % (thresholds[best_f1], f1_scores[best_f1]))
    print('Threshold=%.3f, ROC_AUC=%.5f' % (thresholds[best_roc], roc_scores[best_roc]))
    print('Threshold=%.3f, Accuracy=%.5f' % (thresholds[best_acc], acc_scores[best_acc]))

### Reducing dimensionality for certain models

In [17]:
less_significant_columns = ["quizzes_taken", "average_rating_given", "rating_count"]
# Bottom three columns in terms of significance
# Significance levels determined through coef attribute from logistic model. 
# Bottom three values were selected to be dropped

X_train_resampled_lower_dim = X_resample.drop(less_significant_columns, axis=1)
X_test_resampled_lower_dim = X_test_pre.drop(less_significant_columns, axis=1)

### Model 0 - SGDClassifier

In [18]:
parameters = {
    "penalty": ["l1", "l2"],
    "learning_rate": ["optimal", "adaptive"],
    "alpha": [0.0001, 0.001, 0.01],
}
sgd_gs = GridSearchCV(
    SGDClassifier(random_state=46, max_iter=30000),
    parameters,
    scoring="roc_auc",
    cv=5,
    verbose=1,
    n_jobs=-1,
)

sgd_gs.fit(X_train_resampled_lower_dim, y_resample)

print(sgd_gs.best_estimator_, sgd_gs.best_score_)
sgd_best = sgd_gs.best_estimator_

y_pred = sgd_best.predict(X_test_resampled_lower_dim)
results(y_test, y_pred)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
SGDClassifier(alpha=0.01, max_iter=30000, random_state=46) 0.9639801228989887
[[6232  396]
 [  68  350]]
----------
Precision: 0.4691689008042895
Recall: 0.8373205741626795
PR-AUC: 0.4024956391401476


F1-Score: 0.6013745704467355
Accuracy: 0.9341470337780301
ROC-AUC: 0.8887870221447074


### Model 1 - Ridge

In [19]:
parameters = {
    "alpha": np.arange(0.1, 1.01, 0.1),
    "solver": ["svd", "cholesky", "lsqr", "sparse_cg"],
}

rc = RidgeClassifier()

rc_gs = GridSearchCV(rc, parameters, scoring="roc_auc", cv=5, verbose=1, n_jobs=-1)
rc_gs.fit(X_train_resampled_lower_dim, y_resample)

print(rc_gs.best_estimator_, rc_gs.best_score_)
rc_best = rc_gs.best_estimator_



Fitting 5 folds for each of 40 candidates, totalling 200 fits
RidgeClassifier(solver='svd') 0.966305147024689


In [34]:
y_pred = rc_best.predict(X_test_resampled_lower_dim)
results(y_test,y_pred)

[[6000  628]
 [  53  365]]
----------
Precision: 0.3675730110775428
Recall: 0.8732057416267942
PR-AUC: 0.3284888620368656


F1-Score: 0.517363571934798
Accuracy: 0.9033494181095657
ROC-AUC: 0.8892280971260103


### Model 2 - Logistic

In [20]:
# Reference: https://stackoverflow.com/a/66981081
parameters = {
    "penalty": ("l1", "l2"),
    "C": np.arange(0.5, 5, 0.5),
}

lo_gs = GridSearchCV(
    LogisticRegression(solver="liblinear"),
    parameters,
    scoring="roc_auc",
    cv=5,
    verbose=2,
    n_jobs=-1,
)
lo_gs.fit(X_train_resampled_lower_dim, y_resample)

print(lo_gs.best_estimator_, lo_gs.best_score_)
lo_best = lo_gs.best_estimator_

Fitting 5 folds for each of 18 candidates, totalling 90 fits
LogisticRegression(penalty='l1', solver='liblinear') 0.9767700087525245


In [21]:
y_pred = lo_best.predict(X_test_resampled_lower_dim)
results(y_test, y_pred)

# Calculate feature importances
feature_importances = lo_best.coef_[0]

# Save the results inside a DataFrame using feature_list as an index
relative_importances = pd.DataFrame(
    index=lo_gs.feature_names_in_, data=feature_importances, columns=["importance"]
)

[[6311  317]
 [  71  347]]
----------
Precision: 0.5225903614457831
Recall: 0.8301435406698564
PR-AUC: 0.4439016521984731


F1-Score: 0.6414048059149722
Accuracy: 0.944933295486801
ROC-AUC: 0.8911580708780785


In [22]:
# Sort values to learn most important features
relative_importances.sort_values(by="importance", ascending=False)

Unnamed: 0,importance
country,16.069168
engaged_lessons,2.76613
correct_quiz_answers,0.092555
registered_day,0.038064
quiz_questions_answered,0.024679
total_minutes_watched,0.005964
engaged_quiz,0.0
registered_month,-0.028325
engaged_exams,-0.149296
asked_question,-2.599894


### Model 3 - Decision Tree

In [23]:
parameters = {
    "criterion": ["gini", "entropy"],
    "splitter": ["random", "best"],
    "max_features": ["sqrt", "log2"],
    "max_depth": [4, 5, 6, 7, 8],
}

dt_gs = RandomizedSearchCV(
    DecisionTreeClassifier(random_state=46),
    parameters,
    scoring="roc_auc",
    verbose=1,
    cv=5,
    n_iter=50,
    n_jobs=-1,
)
dt_gs.fit(X_train_resampled_lower_dim, y_resample)

print(dt_gs.best_estimator_, dt_gs.best_score_)
dt_best = dt_gs.best_estimator_

y_pred = dt_best.predict(X_test_resampled_lower_dim)
results(y_test, y_pred)

# # Calculate feature importances
# feature_importances = dt_gs.best_estimator_.feature_importances_

# # Save the results inside a DataFrame using feature_list as an index
# relative_importances = pd.DataFrame(index=dt_gs.feature_names_in_,data=feature_importances, columns=["importance"])

# # Sort values to learn most important features
# relative_importances.sort_values(by="importance", ascending=False)

Fitting 5 folds for each of 40 candidates, totalling 200 fits
DecisionTreeClassifier(criterion='entropy', max_depth=8, max_features='sqrt',
                       random_state=46) 0.9892084593326402
[[6338  290]
 [  80  338]]
----------
Precision: 0.5382165605095541
Recall: 0.8086124401913876
PR-AUC: 0.44656256603848926


F1-Score: 0.6462715105162524
Accuracy: 0.9474879364178257
ROC-AUC: 0.8824293341572508


### Model 4 - KNN

In [24]:
parameters = {
    "weights": ("uniform", "distance"),
    "n_neighbors": (50, 100, 150, 200),
    "algorithm": ("auto", "ball_tree", "kd_tree", "brute"),
    "leaf_size": (15, 30, 45, 60),
    "p": (1, 2),
}

knn_gs = RandomizedSearchCV(
    KNeighborsClassifier(),
    parameters,
    scoring="roc_auc",
    cv=5,
    verbose=1,
    n_jobs=-1,
    n_iter=30,
)
knn_gs.fit(X_train_resampled_lower_dim, y_resample)

print(knn_gs.best_estimator_, knn_gs.best_score_)
knn_best = knn_gs.best_estimator_

y_pred = knn_best.predict(X_test_resampled_lower_dim)
results(y_test,y_pred)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
KNeighborsClassifier(algorithm='ball_tree', leaf_size=15, n_neighbors=50, p=1,
                     weights='distance') 0.9930210694557399
[[6267  361]
 [  71  347]]
----------
Precision: 0.4901129943502825
Recall: 0.8301435406698564
PR-AUC: 0.41694077568617954


F1-Score: 0.6163410301953818
Accuracy: 0.9386886176554073
ROC-AUC: 0.8878388192184528


### Model 5 - Gradient Boost

In [25]:
parameters = {
    "learning_rate": [0.25, 0.3],
    "n_estimators": [50, 100, 200, 300],
    "max_features": ["sqrt", "log2"],
    "max_depth": [3, 4, 5],
    "min_weight_fraction_leaf": np.arange(0, 0.50, 0.1),
}


gb_gs = RandomizedSearchCV(
    GradientBoostingClassifier(random_state=46),
    parameters,
    scoring="roc_auc",
    cv=5,
    verbose=1,
    n_iter=65,
    n_jobs=-1,
    error_score="raise",
)
gb_gs.fit(X_train_resampled_lower_dim, y_resample)

print(gb_gs.best_estimator_, gb_gs.best_score_)
print('--------')
gb_best = gb_gs.best_estimator_

y_pred = gb_best.predict(X_test_resampled_lower_dim)
results(y_test, y_pred)

Fitting 5 folds for each of 65 candidates, totalling 325 fits
GradientBoostingClassifier(learning_rate=0.25, max_depth=5, max_features='log2',
                           n_estimators=200, random_state=46) 0.9976904060325771
--------
[[6499  129]
 [ 119  299]]
----------
Precision: 0.6985981308411215
Recall: 0.715311004784689
PR-AUC: 0.5166039459566649


F1-Score: 0.706855791962175
Accuracy: 0.9648027249503264
ROC-AUC: 0.8479240600266232


### Model 6 - XGBoost

In [26]:
parameters = {
    "booster": ["gbtree", "dart"],
    "gamma": [0, 0.25],
    "learning_rate": [0.25, 0.3],
    "max_depth": [7, 8],
    "min_child_weight": [1, 2, 3],
    "max_delta_step": [0, 1, 2],
}

xgb_gs = RandomizedSearchCV(
    XGBClassifier(objective="binary:logistic"),
    parameters,
    cv=5,
    scoring="roc_auc",
    n_iter=65,
    n_jobs=-1,
    verbose=1,
)

xgb_gs.fit(X_train_resampled_lower_dim, y_resample)

print(xgb_gs.best_estimator_, xgb_gs.best_score_)
print('--------')
xgb_best = xgb_gs.best_estimator_


y_pred = xgb_best.predict(X_test_resampled_lower_dim)
results(y_test, y_pred)

Fitting 5 folds for each of 65 candidates, totalling 325 fits
XGBClassifier(base_score=None, booster='dart', callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=0, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.3, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=0,
              max_depth=8, max_leaves=None, min_child_weight=2, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=None, ...) 0.9983991266050353
--------
[[6505  123]
 [ 126  292]]
----------
Precision: 0.7036144578313253
Recall: 0.6985645933014354
PR-AUC: 0.5094026340931226


F1-Score: 0.7010804321728692
Accuracy: 0.9646608004541584
R

### Voting Ensemble

In [45]:
vc = VotingClassifier(
    estimators=[
        ("sgd_best", SGDClassifier(loss="modified_huber", alpha = 0.001, random_state=46)),
        ("xgb_best", xgb_best),
        ("gb_best", gb_best),
        ("lo_best", lo_best),
        ("dt_best", dt_best),
        ("knn_best", knn_best),
    ],
    voting="soft",
    weights=[1, 1, 1, 2, 1, 1],
    verbose=True,
)

vc.fit(X_train_resampled_lower_dim, y_resample)
print(vc.score(X_train_resampled_lower_dim, y_resample))

# y_pred = vc.predict(X_test_resampled_lower_dim)
# results(y_test, y_pred)

y_pred = vc.predict_proba(X_test_resampled_lower_dim)
results_proba(y_test, y_pred)

[Voting] ................. (1 of 6) Processing sgd_best, total=   0.2s
[Voting] ................. (2 of 6) Processing xgb_best, total=   3.8s
[Voting] .................. (3 of 6) Processing gb_best, total=   3.8s
[Voting] .................. (4 of 6) Processing lo_best, total=   0.1s
[Voting] .................. (5 of 6) Processing dt_best, total=   0.0s
[Voting] ................. (6 of 6) Processing knn_best, total=   0.0s
0.9821829823619479
Threshold=0.746, F-Score=0.74166
Threshold=0.403, ROC_AUC=0.90661
Threshold=0.822, Accuracy=0.97105


### Stack Ensemble

In [43]:
sc = StackingClassifier(
    estimators=[
        ("sgd_best", sgd_best),
        ("xgb_best", xgb_best),
        ("gb_best", gb_best),
        ("rc_best", rc_best),
        ("lo_best", lo_best),
        ("dt_best", dt_best),
        ("knn_best", knn_best),
    ],
    final_estimator=lo_best,
    n_jobs=-1,
    verbose=True,
)

sc.fit(X_train_resampled_lower_dim, y_resample)
print(sc.score(X_train_resampled_lower_dim, y_resample))

# y_pred = sc.predict(X_test_resampled_lower_dim)
# results(y_test, y_pred)

y_pred = sc.predict_proba(X_test_resampled_lower_dim)
results_proba(y_test, y_pred)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.1s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.4s finished
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.5s finished
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.9s finished
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.6s finished
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    6.4s finished


0.9993636779414982
Threshold=0.561, F-Score=0.72970
Threshold=0.013, ROC_AUC=0.89146
Threshold=0.941, Accuracy=0.97034


## Observations

* From the models tested, the voting ensemble had the best *accuracy* peaking at approximately **97.1%**
    * It also had the best *F1-Score* out of any model **(0.741 peak)**
    * It also had the best *ROC-AUC* score peaking at **90.66%**


* In all non-ensemble models, the accuracy and ROC-AUC tend to sit at similar values from each other. In percentage, Accuracy is barely reach mid 90s , ROC-AUC is about low to mid 80s. F1-Score is the most volatile, ranging from the low 60s to mid 70s.


* The majority of models tend to have a higher recall precision. 

* Overall, every model seems to do well in differentiating target labels (> 0.80 ROC-AUC). In other words, it doesn't seem that the models struggle too much in determining a subscriber or free student.


* Among the non-ensemble models, Logistic Regression performs the best overall.

# Exporting the best performing model

In [52]:
import pickle
   
pickl = {'model': vc}
    
pickle.dump(pickl, open('models/'+'model_file' + ".p", "wb"))

file_name = "models/model_file.p"
with open(file_name, 'rb') as pickled:
    data = pickle.load(pickled)
    model = data['model']

In [53]:
model

VotingClassifier(estimators=[('sgd_best',
                              SGDClassifier(alpha=0.001, loss='modified_huber',
                                            random_state=46)),
                             ('xgb_best',
                              XGBClassifier(base_score=None, booster='dart',
                                            callbacks=None,
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=None,
                                            early_stopping_rounds=None,
                                            enable_categorical=False,
                                            eval_metric=None,
                                            feature_types=None, gamma=0,
                                            gpu_id=N...
                                                         n_estimators=200,
                          