# Catboost & Hyperopt : Amazon employees dataset

* **Information from [Kaggle](https://www.kaggle.com/c/amazon-employee-access-challenge/overview)**
    - When an employee at any company starts work, they first need to obtain the computer access necessary to fulfill their role.
    - **Given data about current employees and their provisioned access, models can be built that automatically determine access privileges as employees enter and leave roles within a company.** These auto-access models seek to minimize the human involvement required to grant or revoke employee access.
    
    
* **Catboost**: 
    - **Yandex, the developers of Catboost, claim that default Catboost provides ~20% logloss improvement over LightGMB & XGBoost. Tuning further improves performance of the model.**
    - **I will be testing these claims.**
    - Catboost uses gradient boosted trees. Great for working on catgorical data and mixed data (with both categorical and numerical features)
    - Data is quantized into bins. The algorithm decides bin 'borders'(We can set our own values too). This quantization supports faster integration into parallel processing workflows. 
    - Symmetric gradient boosted trees are built, each subsequent tree improves the performance of the previous set of trees. 
    - Categorical preprocessing steps like One-Hot-Encoding, text preprocessing steps like tokenization, Bag of Words models can be performed within the Catboost algorithm (No need for additional preprocessing.)  
    
    
* **RESULT**:
    - **One of the columns had duplicated information. After removing this column - the default algorithm gave the best loss publicised by Yandex (~0.137). A kaggle submission showed 90% AUC score.**
    - Hyperopt tuning did not improve scores. 
    - Yandex's claims were proven. It had the best loss among the boosting models as shown in table below. 
    
    
<table>
  <tr>
    <th>Model</th>
    <th>Logloss from default</th>
  </tr>
  <tr>
    <td>Catboost</td>
    <td>0.13516505504697254</td>
  </tr>
  <tr>
    <td>Xgboost</td>
    <td>0.1554555542790197</td>
  </tr>
  <tr>
    <td>LightGBM</td>
    <td>0.16383632381872779</td>
  </tr>
</table>
    
    
    
## Table of Contents
* [Imports & Read in file](#first)
* [Explore data](#second)
* [Preprocessing](#third)
* [Baseline Model](#fifth)
* [Test set performance](#eighth)
* [Hyperparameter tuning](#sixth)
* [Model validation](#seventh)
* [Other Boosting Algorithms](#ninth)

<img src="https://images.freeimages.com/images/large-previews/753/go-to-work-1189863.jpg" alt="Drawing" style="width: 300px;"/>


* **Description of Features:**
    
<table>
  <tr>
    <th>Label</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>ACTION</td>
    <td>ACTION is 1 if the resource was approved, 0 if the resource was not</td>
  </tr>
  <tr>
    <td>RESOURCE</td>
    <td>An ID for each resource</td>
  </tr>
  <tr>
    <td>EMPLOYEE ID</td>
    <td>The EMPLOYEE ID of the manager of the current EMPLOYEE ID record; an employee may have only one manager at a time</td>
  </tr>
  <tr>
    <td>ROLE_ROLLUP_1</td>
    <td>Company role grouping category id 1 (e.g. US Engineering)<td>
  </tr>
  <tr>
    <td>ROLE_ROLLUP_2</td>
    <td>Company role grouping category id 2 (e.g. US Retail)</td>
  </tr>
  <tr>
    <td>ROLE_DEPTNAME</td>
    <td>Company role department description (e.g. Retail)</td>
  </tr>
  <tr>
    <td>ROLE_TITLE</td>
    <td>Company role business title description (e.g. Senior Engineering Retail Manager)</td>
  </tr>
  <tr>
    <td>ROLE_FAMILY_DESC</td>
    <td>Company role family extended description (e.g. Retail Manager, Software Engineering)</td>
  </tr>
  <tr>
    <td>ROLE_FAMILY</td>
    <td>Company role family description (e.g. Retail Manager)</td>
  </tr>
  <tr>
    <td>ROLE_CODE</td>
    <td>Company role code; this code is unique to each role (e.g. Manager)</td>
</table>

## Imports & Read in file <a class="anchor" id="first"></a>

In [None]:
import shap
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
import lightgbm as lgb
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix, precision_recall_curve, roc_curve, roc_auc_score, log_loss
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, cv, Pool
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK
from itertools import combinations

%matplotlib inline
sns.set(style='ticks')
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

In [None]:
# Read in data
test = pd.read_csv("../input/amazon-employee-access-challenge/test.csv")
train = pd.read_csv("../input/amazon-employee-access-challenge/train.csv")

### Functions

In [None]:
def performance(model, X_test, y_test):
    
    """
    Accepts a fitted model and an evaluation dataset at input.
    Prints the confusion matrix, classification_report & auc score. 
    Also, displays Precision-Recall curve & ROC curve.
    """
    
    # Make predictions on test set
    y_pred=model.predict(X_test)
    y_pred=np.round(y_pred)
    
    # Confusion matrix
    print(confusion_matrix(y_test, y_pred))
    
    # AUC score
    y_pred_prob = model.predict_proba(X_test)
    print("AUC score: ", roc_auc_score(y_test, y_pred_prob[:,1]))
    
    # Logloss
    print("Logloss : ", log_loss(y_test, y_pred_prob))

    # Accuracy, Precision, Recall, F1 score
    print(classification_report(y_test, y_pred))
    
    # Precision-Recall curve
    precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred)
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.grid(True)
    plt.show()

    # ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob[:,1])
    plt.plot([0, 1], [0, 1],'k--')
    plt.plot(fpr, tpr, label='Neural Network')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.show()


## Explore data <a class="anchor" id="second"></a>

In [None]:
print("Train shape: {}, Test shape: {}".format(train.shape, test.shape))

In [None]:
train.head()



* Target: ACTION
* 9 categorical features (represented as numbers for privacy reasons)



### Null values

In [None]:
print(train.isnull().any()) 
print(test.isnull().any())

In [None]:
# Compare number of Unique Categorical labels for train and test

unique_train= pd.DataFrame([(col,train[col].nunique()) for col in train.columns], 
                           columns=['Columns', 'Unique categories'])
unique_test=pd.DataFrame([(col,test[col].nunique()) for col in test.columns],
                columns=['Columns', 'Unique categories'])
unique_train=unique_train[1:]
unique_test=unique_test[1:]

fig, ax = plt.subplots(2, 1, sharex=True, sharey=True)
ax[0].bar(unique_train.Columns, unique_train['Unique categories'])
ax[1].bar(unique_test.Columns, unique_test['Unique categories'])
plt.xticks(rotation=90)

* The Training and Test data have different subsets of categorical variables.

### Balance of target labels

In [None]:
sns.countplot(train.ACTION)

* The dataset represents a case of Imbalanced classes. The 0 label has fewer values.

### Correlations

In [None]:
# Check for duplicated rows

if (sum(train.duplicated()), sum(test.duplicated())) == (0,0):
    print('No duplicated rows')
else: 
    print('train: ',sum(train.duplicated()))
    print('test: ',sum(train.duplicated()))

In [None]:
# Check for duplicated columns                          

for col1,col2 in combinations(train.columns, 2):
    condition1=len(train.groupby([col1,col2]).size())==len(train.groupby([col1]).size())
    condition2=len(train.groupby([col1,col2]).size())==len(train.groupby([col2]).size())
    condition3=(train[col1].nunique()==train[col2].nunique())
    if (condition1 | condition2) & condition3:
        print(col1,col2)
        print('Potential Categorical column duplication')

In [None]:
train.groupby(['ROLE_TITLE', 'ROLE_CODE']).mean()

* ROLE_TITLE and ROLE_CODE represent the same data. One of the two features can be dropped. 

## Preprocessing <a class="anchor" id="third"></a>

### Set random seed

In [None]:
np.random.seed(123)

In [None]:
# Drop duplicated column
train.drop('ROLE_CODE', axis=1, inplace=True)
test.drop('ROLE_CODE', axis=1, inplace=True)

In [None]:
# Split into features and target
y = train['ACTION']
X = train.drop('ACTION', axis=1)

# Split into train & validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8)

## Baseline Model <a class="anchor" id="fifth"></a>

* It is important to tell CatBoost which columns are categorical and which ones are text. If no information is provided - CatBoost assumes all features are numerical. 


* Default values of CatBoostClassifier() parameters depend of the type of input data - CatBoost automatically applies the best settings. Catboost can distinguish between binary & multiclass problems - it will appropriately assign 'Logloss' as the 'loss_function' for Binary problems, 'MultiClass' for multiclass problems and 'RMSE' for regression problems. 


* Default number of 'iterations' is 1000. I set early stopping rounds to 100 (for boosting algos high patience values give best models) for the first run and I also selected 'use_best_model'= True . (When we fit using the model, we want to use the best model, rather than the potentially substandard model saved in memory at the end of training). 'custom_metric' provides an additional plot to moniter while CatBoost fits (It does not change training performance). 'eval_metric' is the metric used for 'best model' selection. 


* When fitting 'eval_set' is optional. If we provide 'eval_metric' and 'use_best_model' (metric used for overfit detection), we will need to provide 'eval_set'. 


* AUC selected as parameter of choice - as Kaggle competition requires this. Logloss also studied to prove Yandex's claims about Catboost's superior performance. 


* Input data can be in many different tabular forms. 
    - If only a dataframe is provided, first column is assumed to be the target. Rest of the columns are assumed to be features. 
    - We can provide a dataframe of features and a dataframe/array of target values, as we do in Sklearn. 
    - The Pool() class is specific to CatBoost. 

In [None]:
cat_features = [*range(8)]

In [None]:
model = CatBoostClassifier(custom_metric=['TotalF1'], early_stopping_rounds=100, eval_metric='AUC')

model.fit(X_train, y_train, cat_features=cat_features,
          eval_set=(X_val, y_val), plot=True, verbose=False, use_best_model=True)

In [None]:
performance(model, X_val, y_val)

In [None]:
feat_imp=model.get_feature_importance(prettified=True)
plt.bar(feat_imp['Feature Id'], feat_imp['Importances'])
plt.xlabel('Features')
plt.ylabel('Feature Importance')
plt.xticks(rotation=90)

* The best model has a **loss of 0.137** and an **AUC score of 89.7%**. 


* RESOURCE & ROLE_DEPTNAME are the most important features. 


* Let us look at model performance of the test set now by making a submission to Kaggle. 


## Test set performance <a class="anchor" id="eighth"></a>

In [None]:
sub=pd.read_csv("../input/amazon-employee-access-challenge/sampleSubmission.csv")
sum(test.id==sub.Id), test.shape

#sub.to_csv('amazon1.csv', index=False, header=True)

In [None]:
y_pred=model.predict_proba(test.drop('id', axis=1))
sub.Action=y_pred[:,1]

In [None]:
sub.to_csv('amazon1.csv', index=False, header=True)
sub.head()

* The default model gives an **AUC score of 0.90373**. Not bad for the first model!! 
* The winning submission has a score of 92%. Lets see if we can tune the model to squeeze out the last 2%. 


* We will study the paramaters of the default model and try to provide a sensible range for hyperopt to tune on. 

In [None]:
model.get_all_params()


## Hyperparameter tuning <a class="anchor" id="sixth"></a>

In [None]:
"""
COmmented out as it takes too long to run. 
Under construction, some things can be improved.
space = {
    'depth': hp.quniform("depth", 1, 16, 1),
    'border_count': hp.quniform('border_count', 32, 255, 1),
    'l2_leaf_reg': hp.uniform('l2_leaf_reg', 3, 8),
    #'rsm': hp.uniform('rsm', 0.1, 1), # use only when task_type is default CPU
    'scale_pos_weight': hp.uniform('scale_pos_weight', 0.06, 1), # Can be set only when loss_function is default Logloss
    #'loss_function' : hp.choice('loss_function', ['Logloss', 'CrossEntropy'])
}


def hyperparameter_tuning(space):
    model = CatBoostClassifier(depth=int(space['depth']),
                               border_count=space['border_count'],
                               l2_leaf_reg=space['l2_leaf_reg'],
                               #rsm=space['rsm'],
                               scale_pos_weight=space['scale_pos_weight']
                               #loss_function=space['loss_function'],
                               task_type='GPU', # change to CPU when working on personal system
                               eval_metric='AUC'
                               early_stopping_rounds=100,
                              thread_count=-1)

    model.fit(X_train, y_train, cat_features=cat_features,use_best_model=True,
              verbose=False, eval_set=(X_val, y_val))

    preds_class = model.predict_proba(X_val)
    #score = classification_report(y_val, preds_class, output_dict=True)['0']['f1-score']
    score = roc_auc_score(y_val, preds_class[:,1])
    return{'loss': 1-score, 'status': STATUS_OK}


best = fmin(fn=hyperparameter_tuning,
            space=space,
            algo=tpe.suggest,
            max_evals=50)

print(best)
"""

## Model Validation <a class="anchor" id="seventh"></a>

* Some differnt sets of parameters returned as 'best' by hyperopt. 
* {'border_count': 209.0, 'depth': 8.0, 'l2_leaf_reg': 7.476976878626717, 'loss_function': 0, 'rsm': 0.7556557996868841}
* {'border_count': 248.0, 'depth': 4.0, 'l2_leaf_reg': 4.830204209625978, 'scale_pos_weight': 0.4107081177319144}
* {'border_count': 129.0, 'depth': 10.0, 'l2_leaf_reg': 4.450385969436819, 'scale_pos_weight': 0.1034646048953394}

In [None]:
# Best of the tuned models
model = CatBoostClassifier(border_count=248, depth=4, l2_leaf_reg=4.830204209625978,
                           scale_pos_weight=0.4107081177319144, 
                           eval_metric='AUC',
                           use_best_model=True,
                          early_stopping_rounds=100)
best=model.fit(X_train, y_train, cat_features=cat_features, eval_set=(X_val, y_val), use_best_model=True,
          verbose=False, plot=False)

In [None]:
performance(model, X_val, y_val)

### Shap

* Refer to https://www.kaggle.com/dansbecker/shap-values for explanation of SHAP. 
* "SHAP values interpret the impact of having a certain value for a given feature in comparison to the prediction we'd make if that feature took some baseline value."

In [None]:
model = CatBoostClassifier(border_count=248, depth=4, l2_leaf_reg=4.830204209625978,
                           scale_pos_weight=0.4107081177319144, iterations = 400)
model.fit(X_train, y_train, cat_features=cat_features,
          verbose=False, plot=False)
shap.initjs()
explainer = shap.TreeExplainer(model)

* We will look at SHAP values for a single row of the dataset (we arbitrarily chose row 2). For context, we'll look at the raw predictions before looking at the SHAP values. 

In [None]:
print('Probability of class 1 = {:.4f}'.format(model.predict_proba(X_train.iloc[2:3])[0][1]))
#print('Formula raw prediction = {:.4f}'.format(model.predict(X_train.iloc[0:1], prediction_type='RawFormulaVal')[0]))

In [None]:
shap_values = explainer.shap_values(Pool(X_train, y_train, cat_features=cat_features))
shap.force_plot(explainer.expected_value, shap_values[2,:], X_train.iloc[2,:])

* There is 97% likelihood for positive label - for this instance.


* From the plot we see that the base value is 2.638. 
* The SHAP values of all features sum up to explain why our prediction is different from the baseline (value of +3.46 - a positive value).
* The contribution of each of the features towards change from base values is shown in the plot. 
* 'RESOURCES' contributes towards a more positive values. All the other features contribute to more negative value for the prediction in row 2. 

In [None]:
shap.summary_plot(shap_values, X_train)

Contribution of all the features and all the instances of the features towards predictions are shown using the red/blue dots. 

### Cross validation to check for overfitting

* Explanation of what cv() function in Catboost does: "The dataset is split into N folds. N–1 folds are used for training, and one fold is used for model performance estimation. N models are updated on each iteration K. Each model is evaluated on its' own validation dataset on each iteration. This produces N metric values on each iteration K. "

In [None]:
model = CatBoostClassifier(border_count=248, depth=4, l2_leaf_reg=4.830204209625978,
                           scale_pos_weight=0.4107081177319144,
                           loss_function='Logloss',
                           eval_metric='AUC',
                           use_best_model=True,
                          early_stopping_rounds=100)
cv_data = cv(Pool(X_train, y_train, cat_features=cat_features), params=model.get_params(),
             verbose=False)

In [None]:
score = np.max(cv_data['test-AUC-mean'])
print('AUC score from cross-validation: ', score)

In [None]:
cv_data['test-AUC-mean'].plot()
plt.xlabel('Iterations')
plt.ylabel('test-AUC-Mean')

## Other Boosting Algorithms <a class="anchor" id="ninth"></a>

### Xgboost

In [None]:
clf = xgb.XGBClassifier()
clf.fit(X_train, y_train)

In [None]:
performance(clf, X_val, y_val)

### LightGBM

In [None]:
train_data = lgb.Dataset(X_train, label=y_train)

In [None]:
param = {'objective': 'binary'}
param['metric'] = 'auc'
bst = lgb.train(train_set=train_data, params=param)

In [None]:
y_pred_prob=bst.predict(X_val)
y_pred=np.round(y_pred_prob)

# Confusion matrix
print(confusion_matrix(y_val, y_pred))

In [None]:
# AUC score
print("AUC score: ", roc_auc_score(y_val, y_pred_prob))

In [None]:
# Logloss
print("Logloss : ", log_loss(y_val, y_pred_prob))

In [None]:
# Accuracy, Precision, Recall, F1 score
print(classification_report(y_val, y_pred))