# Modeling

Models Tested in this notebook: 
- kNN 
- SVM 
- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosting

Observations: 
- KNN, Decision Tree Classifier and Random Forest achieved near perfect results.
- Support Vector Classifier, Multi-class Logistic Regression and Gradient Boosting achieve perfect results.

A Model that achieves both performance and interpretability would be the Logistic Regression.

One improvement that might be considered is the interpretability of the model. Although logistic regression is known to be an intuitive model, the input features have been heavily transformed, so we need to ensure that we can interpret the results of the model with regard to the original features rather than the transformed ones.

In [1]:
from constants import ModelSelectionConstants, RandomStateConstants
import pandas as pd

from pathlib import Path
import pickle

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score, f1_score

In [2]:
data_folder_path = Path(ModelSelectionConstants.DATA_FOLDER)
train_processed = pd.read_csv(data_folder_path / 'train_processed.csv')
test_processed = pd.read_csv(data_folder_path / 'test_processed.csv')
train_processed.head()


Unnamed: 0,encoder__Device Model_OnePlus 9,encoder__Device Model_Samsung Galaxy S21,encoder__Device Model_Xiaomi Mi 11,encoder__Device Model_iPhone 12,encoder__Gender_Male,scaler__App Usage Time (min/day),scaler__Screen On Time (hours/day),scaler__Battery Drain (mAh/day),scaler__Number of Apps Installed,scaler__Data Usage (MB/day),scaler__Age,target__User Behavior Class
0,0.0,0.0,0.0,0.0,1.0,-1.920848,-1.162831,-1.168926,-0.981126,-1.227309,0.901454,1.0
1,0.0,1.0,0.0,0.0,1.0,-0.035817,0.180012,-0.165417,-0.154074,-0.077195,0.025581,3.0
2,0.0,0.0,1.0,0.0,0.0,1.326381,1.0906,0.932397,1.782212,1.126788,0.981126,5.0
3,0.0,0.0,0.0,0.0,0.0,-0.932612,-1.067571,-0.964457,-1.162831,-0.925421,-0.596232,1.0
4,1.0,0.0,0.0,0.0,1.0,-5.199338,-1.544575,-1.220421,-1.162831,-1.067571,-0.180012,1.0


In [3]:
target_name = 'target__User Behavior Class'
X_train, y_train = train_processed.drop(columns=target_name), train_processed[target_name]
X_test, y_test = test_processed.drop(columns=target_name), test_processed[target_name]

In [4]:
def grid_search_cv_factory(estimator, param_grid): 
    return GridSearchCV(
        estimator=estimator,
        param_grid=param_grid, 
        n_jobs = -1,
        scoring = 'roc_auc_ovo',
        cv=5
    )

In [5]:
knn_cv = grid_search_cv_factory(
    KNeighborsClassifier(),
    param_grid={
        'n_neighbors': [3, 5, 7, 9, 11]
    }
)
knn_cv.fit(X_train, y_train)

In [6]:
svm_cv = grid_search_cv_factory(
    SVC(probability=True, random_state=RandomStateConstants.SVC_SEED),
    param_grid={
        'C': [1, 0.1, 0.01, 0.001],
        'kernel': ['linear', 'rbf', 'sigmoid']
    }
)

svm_cv.fit(X_train, y_train)

In [7]:
log_reg_cv = grid_search_cv_factory(
    LogisticRegression(random_state=RandomStateConstants.LOG_REG_SEED), 
    param_grid={
        'C': [1, 0.1, 0.01, 0.001]
    }
)

log_reg_cv.fit(X_train, y_train)

In [8]:
tree_cv = grid_search_cv_factory(
    DecisionTreeClassifier(random_state=RandomStateConstants.DECISION_TREE_SEED),
    param_grid={
        'max_depth': range(1, 10)
    }
)

tree_cv.fit(X_train, y_train)


In [9]:
random_forest_cv = grid_search_cv_factory(
    RandomForestClassifier(random_state=RandomStateConstants.RANDOM_FOREST_SEED, n_jobs=-1),
    param_grid={
        'n_estimators': [20, 50, 100, 200], 
        'max_depth': range(1, 10),

    }
)

random_forest_cv.fit(X_train, y_train)


In [10]:
model_folder_path = Path(ModelSelectionConstants.MODEL_FOLDER)
gradient_boosting_cv_path = model_folder_path / 'gradient_boosting_cv.pkl'

if gradient_boosting_cv_path.exists(): 
    with open(gradient_boosting_cv_path, 'rb') as file:
        gradient_boosting_cv = pickle.load(file)

else : 
    gradient_boosting_cv = grid_search_cv_factory(
        GradientBoostingClassifier(random_state=RandomStateConstants.GRADIENT_BOOSTING_SEED), 
        param_grid = {
            'learning_rate': [0.2, 0.1, 0.01, 0.001],
            'n_estimators': [20, 50, 100, 200],
            'subsample': [1, 0.7, 0.5],
            'max_depth': [1, 2, 3, 4, 5]
        }
    )

    gradient_boosting_cv.fit(X_train, y_train)

    model_folder_path.mkdir(exist_ok=True, parents=True)
    with open( gradient_boosting_cv_path, 'wb' ) as file:
        pickle.dump(gradient_boosting_cv, file)


gradient_boosting_cv

In [11]:
estimators = [knn_cv, svm_cv, log_reg_cv, tree_cv, random_forest_cv, gradient_boosting_cv]

metrics = ['AUC ROC', 'f1-score']
scores = pd.DataFrame( index=[str(estimator.best_estimator_) for estimator in estimators], columns=metrics )

for estimator in estimators: 
    scores.loc[str(estimator.best_estimator_), 'AUC ROC'] = roc_auc_score(y_true = y_test, y_score = estimator.predict_proba(X_test), multi_class='ovo', average='macro')
    scores.loc[str(estimator.best_estimator_), 'f1-score'] = f1_score(y_true = y_test, y_pred = estimator.predict(X_test), average='macro')

In [12]:
scores.sort_values(by=metrics, ascending=(False, False))

Unnamed: 0,AUC ROC,f1-score
"SVC(C=1, kernel='linear', probability=True, random_state=789012)",1.0,1.0
"LogisticRegression(C=1, random_state=345678)",1.0,1.0
"GradientBoostingClassifier(learning_rate=0.2, max_depth=1, n_estimators=20,\n random_state=110797, subsample=1)",1.0,1.0
"RandomForestClassifier(max_depth=1, n_estimators=20, n_jobs=-1,\n random_state=456789)",1.0,0.993328
"DecisionTreeClassifier(max_depth=3, random_state=890123)",0.996429,0.993712
KNeighborsClassifier(n_neighbors=9),0.996146,0.973377


Observations: 
- KNN, Decision Tree Classifier and Random Forest achieved near perfect results.
- Support Vector Classifier, Multi-class Logistic Regression and Gradient Boosting achieve perfect results.

A Model that achieves both performance and interpretability would be the Logistic Regression.

One improvement that might be considered is the interpretability of the model. Although logistic regression is known to be an intuitive model, the input features have been heavily transformed, so we need to ensure that we can interpret the results of the model with regard to the original features rather than the transformed ones.