# Choosing a training model

I have a dataset for training, with chips and a dependent variable. It doesn't have much multicollinearity, so it fits my needs perfectly. We need to choose a training model. We have an obvious binary classification task. Based on this, the following models remain:

## STEP 1. Import data and libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import tqdm
train_data = pd.read_csv(r"prepared_dataframe.csv")
train_data = train_data.set_index('user_id')
train_data

Unnamed: 0_level_0,passed,viewed,wrong,unique_correct,correct_ratio,is_passed_course,use_time,passed_hard_step
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,1,0.0,0.0,0.00,False,0.00,0
2,9,9,0.0,2.0,1.00,False,0.05,0
3,15,20,4.0,4.0,0.50,False,0.31,0
5,1,1,2.0,2.0,0.50,False,0.00,0
7,1,1,0.0,0.0,0.00,False,0.00,0
...,...,...,...,...,...,...,...,...
26790,2,2,0.0,1.0,1.00,False,0.01,0
26793,0,1,0.0,0.0,0.00,False,0.00,0
26794,50,90,7.0,22.0,0.76,False,30.29,1
26797,10,10,0.0,2.0,1.00,False,0.20,0


Now we define a dataframe with the chips and with the target variable:

In [3]:
X = train_data.drop('is_passed_course',axis=1)
y = train_data['is_passed_course']
X

Unnamed: 0_level_0,passed,viewed,wrong,unique_correct,correct_ratio,use_time,passed_hard_step
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,1,0.0,0.0,0.00,0.00,0
2,9,9,0.0,2.0,1.00,0.05,0
3,15,20,4.0,4.0,0.50,0.31,0
5,1,1,2.0,2.0,0.50,0.00,0
7,1,1,0.0,0.0,0.00,0.00,0
...,...,...,...,...,...,...,...
26790,2,2,0.0,1.0,1.00,0.01,0
26793,0,1,0.0,0.0,0.00,0.00,0
26794,50,90,7.0,22.0,0.76,30.29,1
26797,10,10,0.0,2.0,1.00,0.20,0


In [4]:
y

user_id
1        False
2        False
3        False
5        False
7        False
         ...  
26790    False
26793    False
26794    False
26797    False
26798    False
Name: is_passed_course, Length: 19234, dtype: bool

## STEP 2. Decision Tree

A decision tree works as a kind of condition algorithm that passes each value through. These values are directed along its branch, depending on the class of the object. To train the decision tree, some hyperparameters need to be defined. I use GridSearchCV to optimally search for the parameters. 

In [6]:
clf = tree.DecisionTreeClassifier()
parametrs = {'criterion':['gini','entropy'],
             'max_depth':range(3,10),
             'min_samples_split':range(2,10),
             'min_samples_leaf':range(1,10)}
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.25)
grid_dt = GridSearchCV(clf,parametrs,n_jobs=-1,cv=5)
grid_dt.fit(X_train,y_train)
best_dt = grid_dt.best_estimator_
y_pred = best_dt.predict(X_test)
accuracy_score(y_test,y_pred)

## STEP 3. Random Forest

A random forest works in roughly the same way, but it includes many decision trees that do not own all the data. This facilitates deeper learning from the data, but avoids overfitting.

In [42]:
rf = RandomForestClassifier(random_state=0)
parametrs={'n_estimators':[10,20,30,40,50],
           'max_depth' : range(1,12,2),
           'min_samples_leaf':range(1,7),
           'min_samples_split':range(2,9,2)}
grid_rf = GridSearchCV(rf,parametrs,n_jobs=-1,cv=5)
grid_rf.fit(X_train,y_train)
best_rf = grid_rf.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.9062175088375962

It cannot be said for sure that one model is better than the other in the context of Decision Tree and Random forest. Depending on the data, Decision Tree may be more accurate than Random forest, as happened in this case

## STEP 4. XGBoost

XGBoost uses decision trees as base classifiers. The trees are built sequentially, where each new tree is trained on the errors left by previous trees

In [44]:
model_xgb = xgb.XGBClassifier()
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 500, 1000]
}
grid_search_xgb = GridSearchCV(estimator=model_xgb, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search_xgb.fit(X_train, y_train)
best_params_xgb = grid_search_xgb.best_params_
best_model_xgb = grid_search_xgb.best_estimator_
y_pred = best_model_xgb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.908240187158825


## STEP 5. Logistic Regression

The basic idea of logistic regression is to construct a logistic function (also known as a sigmoid function) to model the probability that an object belongs to a class

In [46]:
logic_regr = LogisticRegression()
param_grid = {
    'C': [0.1, 1.0, 10.0],
}
grid_search_lr = GridSearchCV(estimator=logic_regr, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search_lr.fit(X_train, y_train)
best_params = grid_search_lr.best_params_
best_model_lr = grid_search_lr.best_estimator_
y_pred = best_model_lr.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.905900701845594


# STEP 6. SVM

SVM constructs a separating hyperplane that is maximally distant from the nearest objects of different classes. In case of linearly separable data, the separating hyperplane is defined as a line or hyperplane in feature space.

In [None]:
model = SVC()
param_grid = {
    'C': [0.1, 1.0, 10.0],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}
grid_search_SVC = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy')
tqdm(grid_search_SVC.fit(X_train, y_train))
best_params_SVC = grid_search_SVC.best_params_
best_model_SVC = grid_search_SVC.best_estimator_
y_pred = best_model_SVC.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In general, we can see that in this case the accuracy of different models does not differ much. Since the object of the study is user behavior, it is impossible to achieve 100% accuracy, as people's behavior can vary greatly. However, the models achieved statistically significant accuracy, indicating that they are suitable for analyzing the success of new course users. However, the accuracy obtained was fixed on the validation sample. To check the accuracy for 2018-2019 users, we need to transform the test dataset and calculate the accuracy for it.

## STEP 7. Download model

In [27]:
events_test = pd.read_csv(r"events_data_test.csv")
submissions_test = pd.read_csv(r"submission_data_test.csv")
users_data = events_test.pivot_table(index='user_id',
                        columns='action',
                        values='step_id',
                        aggfunc='count',
                        fill_value=0).reset_index()
users_corrects = submissions_test.pivot_table(index='user_id',
                        columns='submission_status',
                        values='step_id',
                        aggfunc='count',
                        fill_value=0).reset_index()
users_data = users_data.merge(users_corrects,how='outer')
submissions_test.sort_values(by='user_id')
sub = submissions_test[['step_id','submission_status','user_id']]
sub = sub.loc[sub.submission_status == 'correct']
sub = sub.drop_duplicates()
sub = sub.groupby('user_id').agg({'submission_status':'count'}).rename(columns={'step_id':'uniq_correct'})
sub = sub.reset_index()
sub = sub.rename(columns={'submission_status':'unique_correct'})
users_data = users_data.merge(sub, how = 'outer')
users_data['correct_ratio'] = (users_data.unique_correct / (users_data.unique_correct + users_data.wrong)).round(2)
succses_students = submissions_test.loc[submissions_test.step_id == 31978 ]
succses_students = succses_students.drop(['timestamp'],axis=1)
succses_students = succses_students.loc[succses_students.submission_status == 'correct']
succses_students = succses_students.drop_duplicates()
succses_students = succses_students.drop('submission_status',axis=1)
succses_students = succses_students.sort_values('user_id').reset_index()
succses_students = succses_students.drop('index',axis=1)
succses_students['passed_hard_step'] = 1 == 1
succses_students['passed_hard_step'] = succses_students['passed_hard_step'].replace({True: 1, False: 0})
succses_students = succses_students.drop('step_id',axis=1)
users_data = users_data.merge(succses_students, how = 'outer')
users_data = users_data.fillna(0)
users_data = users_data.drop(['discovered','started_attempt','correct'],axis=1)
users_data = users_data.set_index('user_id')
users_data['first_timestamp'] = events_test.groupby('user_id').agg({'timestamp':'min'})
users_data['last_timestamp'] = events_test.groupby('user_id').agg({'timestamp':'max'})
users_data['use_time'] = ((users_data.last_timestamp - users_data.first_timestamp) / (60*60)).round(2)
users_data = users_data.drop(['first_timestamp','last_timestamp'],axis=1)
users_data = users_data[['passed','viewed','wrong','unique_correct','correct_ratio','use_time','passed_hard_step']]
y_pred = best_model_xgb.predict_proba(users_data)
y_pred = pd.DataFrame(best_rf.predict_proba(users_data))
y_pred = y_pred.drop(0,axis=1)
y_pred = y_pred.rename(columns={1:'is_gone'})
y_pred['is_gone'] = y_pred.is_gone.round(2)
itog  = users_data
itog = itog.reset_index()
itog = itog.join(y_pred)
itog = itog[['user_id','is_gone']]
itog.to_csv(r"my_answer.csv")
print('Success!')

Unnamed: 0,user_id,is_gone
0,4,0.00
1,6,0.00
2,10,0.00
3,12,0.05
4,13,0.44
...,...,...
6179,26791,0.00
6180,26795,0.00
6181,26796,0.04
6182,26799,0.07


# Your ROC score is 0.8873277731673863