## Amazon.com - Employee Access Challenge from Kaggle
The orginal challenge link is here: https://www.kaggle.com/c/amazon-employee-access-challenge/data
### Build an ML model to predict whether the access is granted (1) or not (0).

About CatBoost:

* A boosting method that focuses on processing categorical features and boosting trees with some “ordering principle”.
* The main take-away is to apply ordering principle in:
 * Target encoding categorical features
 * Boosting trees
*  fight a prediction shift caused by a special kind of target leakage
* https://arxiv.org/pdf/1706.09516.pdf

In [1]:
! pip install -q -U catboost==0.26.1

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m


In [2]:
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [3]:
data = pd.read_csv("../final_project/kaggle_amazon_employee_dataset.csv")

In [4]:
data.head(10)

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,1,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,1,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,1,42680,5905,117929,117930,119569,119323,123932,19793,119325
5,0,45333,14561,117951,117952,118008,118568,118568,19721,118570
6,1,25993,17227,117961,118343,123476,118980,301534,118295,118982
7,1,19666,4209,117961,117969,118910,126820,269034,118638,126822
8,1,31246,783,117961,118413,120584,128230,302830,4673,128231
9,1,78766,56683,118079,118080,117878,117879,304519,19721,117880


In [5]:
data["ACTION"].value_counts()

1    30872
0     1897
Name: ACTION, dtype: int64

In [6]:
y = data["ACTION"]
X = data.drop(columns="ACTION")

In [7]:
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.15, random_state=136, stratify=y)
# stratify parameter makes a split so that the proportion of values in the sample produced 
# will be the same as the proportion of values provided to parameter stratify

In [8]:
X.columns

Index(['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME',
       'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE'],
      dtype='object')

In [9]:
# CatBoostClassifier

# since this is an imbalanced dataset
# we will need to calculate the class wegihts and then fit the tree classifier
class_weight_0 = (sum(y_train == 0) + sum(y_train == 1)) / sum(y_train == 0)
class_weight_1 = (sum(y_train == 0) + sum(y_train == 1)) / sum(y_train == 1)

params = {
    "loss_function": "Logloss",  # Some others: CrossEntropy
    "eval_metric": "F1",  # Some others: Accuracy, Precision, Recall, F1, AUC
    "verbose": 200,  # output training process at every 200 iterations
    "random_seed": 136,
    "iterations": 200,
    "class_weights": [class_weight_0, class_weight_1],
}

# All input features are categorical
cat_features = [0, 1, 2, 3, 4, 5, 6, 7, 8] # categorical columns indices
cb_classifier = CatBoostClassifier(**params)
cb_classifier.fit(
    X_train,
    y_train,
    eval_set=(X_valid, y_valid),  # data to validate on
    use_best_model=True,
    cat_features=cat_features,
)

Learning rate set to 0.145159
0:	learn: 0.7106759	test: 0.7534392	best: 0.7534392 (0)	total: 88.5ms	remaining: 17.6s
199:	learn: 0.8829115	test: 0.8481463	best: 0.8540151 (106)	total: 7.4s	remaining: 0us

bestTest = 0.8540151035
bestIteration = 106

Shrink model to first 107 iterations.


<catboost.core.CatBoostClassifier at 0x15111ad90>

In [10]:
# make predictions
y_pred = cb_classifier.predict(X_valid)

print(classification_report(y_valid, np.round(y_pred)))

              precision    recall  f1-score   support

           0       0.33      0.79      0.47       285
           1       0.99      0.90      0.94      4631

    accuracy                           0.90      4916
   macro avg       0.66      0.85      0.71      4916
weighted avg       0.95      0.90      0.91      4916



In [11]:
# get feature importance
pd.DataFrame({'feature_importance': cb_classifier.get_feature_importance(), 
              'feature_names': X_train.columns}).sort_values(by=['feature_importance'], 
                                                           ascending=False)

Unnamed: 0,feature_importance,feature_names
0,20.4036,RESOURCE
1,20.328335,MGR_ID
4,13.009229,ROLE_DEPTNAME
6,10.37407,ROLE_FAMILY_DESC
3,10.15134,ROLE_ROLLUP_2
7,7.272918,ROLE_FAMILY
5,6.318493,ROLE_TITLE
2,6.095671,ROLE_ROLLUP_1
8,6.046343,ROLE_CODE


In [12]:
cb_classifier.get_feature_importance()

array([20.4036    , 20.32833546,  6.09567139, 10.15134001, 13.00922937,
        6.31849265, 10.3740701 ,  7.27291828,  6.04634273])