In [None]:
# <p style="font-family: Arial; font-size:2.0em; color:#3323C0; font-style:bold">
# Summary: Compare model results and final model selection</p>
# <br>

### Compare model results and final model selection

1. Evaluate saved models both "logisitic regression" and "random fores" on the validation set
2. Select the best model based on performance on the validation set
3. Evaluate that model on the holdout test set

** **last update: December 18, 2019**

### Read in Data

In [11]:
import joblib
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, confusion_matrix
# from sklearn.metrics import plot_precision_recall_curve
from time import time


test_features = pd.read_csv('X_test_df_2.csv')
test_labels = pd.read_csv('y_test_df_2.csv')

In [12]:
test_features.drop(columns=["views"], inplace = True)

In [13]:
test_features.shape

(1213, 8)

In [14]:
test_labels.shape

(1213, 1)

### Read in Models: linear regression and random forest

In [15]:
models = {}

for mdl in ['LR_2', 'RF_3']:
    models[mdl] = joblib.load('{}.pkl'.format(mdl))

In [16]:
models

{'LR_2': LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=1000, multi_class='multinomial',
           n_jobs=None, penalty='l2', random_state=42, solver='newton-cg',
           tol=0.0001, verbose=0, warm_start=False),
 'RF_3': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=4, min_samples_split=5,
             min_weight_fraction_leaf=0.0, n_estimators=700, n_jobs=None,
             oob_score=True, random_state=42, verbose=0, warm_start=False)}

<p style="font-family: Arial; font-size:2.0em; color:#3323C0; font-style:bold">
Evaluate models on the validation set </p>
<br>

In [17]:
def evaluate_model(name, model, features, labels):
    start = time()
    pred = model.predict(features)
    end = time()
    accuracy = round(accuracy_score(labels, pred), 3)
    precision = round(precision_score(labels, pred, average='micro'), 3)
    recall = round(recall_score(labels, pred, average='micro'), 3)
    
    print('{} -- Accuracy: {} / Precision: {} / Recall: {} /'.format(name, accuracy,precision, recall))

    
    print()
    print("Classification report for ", name)
    print()
    print(classification_report(labels, pred))

    
#     print()
#     print("Confusion matrix for ", name)
#     print()
#     print(confusion_matrix(labels, pred)) 
#     print()

In [18]:
for name, mdl in models.items():
    evaluate_model(name, mdl, test_features, test_labels)

LR_2 -- Accuracy: 0.594 / Precision: 0.594 / Recall: 0.594 /

Classification report for  LR_2

              precision    recall  f1-score   support

           0       0.75      0.66      0.70       302
           1       0.48      0.43      0.46       295
           2       0.64      0.79      0.71       305
           3       0.50      0.49      0.49       311

   micro avg       0.59      0.59      0.59      1213
   macro avg       0.59      0.59      0.59      1213
weighted avg       0.59      0.59      0.59      1213

RF_3 -- Accuracy: 0.742 / Precision: 0.742 / Recall: 0.742 /

Classification report for  RF_3

              precision    recall  f1-score   support

           0       0.83      0.86      0.84       302
           1       0.74      0.61      0.67       295
           2       0.76      0.82      0.79       305
           3       0.65      0.67      0.66       311

   micro avg       0.74      0.74      0.74      1213
   macro avg       0.74      0.74      0.74      

In [None]:
# <p style="font-family: Arial; font-size:2.0em; color:#3323C0; font-style:bold">
# Class encoding -- 0:extereme, 1:high, 2:low, 3:medium </p>
# <br>

### Conclusions:

* YouTube videos in the US are classified based on their number of views. 
* Videos are labeled as "Extreme", of which the number of views is more than 75 percentile. Videos with number of views between the 50th and 75th percentile is labeled as "High". Similarly, "Medium" is in between the 25th and 50th percentile of views and "Low" is below the 25th percentile.
* Label endocing refers to these classes as Extreme:0, High:1, Low:2, Medium:3.
* Two machine learning models, Logistic Regression and Random Forest are utilized.
* Dataset is randomly split into two segments: training dataset (80% of the dataset) and test dataset (20% of the dataset).
* Model parameters are chosen by using training dataset with five-fold cross-validation.  
* In the Logistic Regression model, the regularization parameter C is chosen by using grid search technique.
* In the Random Forest model, hyperparameters,  such as number of trees, maximum number of features, max. number of levels in each tree, are chosen by using random search technique. Random search instead of grid search is used to reduce the running time to find these hyperparameters. 
* Resuls demomstrate that random forest, with an accuracy of 74%, outperforms logistic regression, with an accuracy of 59%.  
* In terms of precision, random forest has a precision of 83% for class 0 (extermely views video), while logistic regression has a precision of 75%. Precision is the ability to correctly diagonize positive   
* Reasons that random forest performs better are:
    * Random forest is one of the most well-known ensemble models that combines a large number of independent decision trees and is trained over random and equally distributed subsets of a dataset.
    * Random forest offers less overfitting.
    * Logistic regression is a linear model and works better where there exists a linear relationship. However, in the YouTube dataset the relationship between input features and output labels is non-linear, and hence the random forest performs better than the logistic regression. 
