<a href="https://colab.research.google.com/github/RM-RAMASAMY/decision_trees/blob/main/gbm_classifier_techniques/catboost_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [25]:
from google.colab import output
output.enable_custom_widget_manager()

In [1]:
# prompt: pull data from https://www.kaggle.com/competitions/amazon-employee-access-challenge/data

!pip install kaggle

# Upload your kaggle.json file (API key)
from google.colab import files
files.upload()

# Create the .kaggle directory and move the JSON file
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download the dataset
!kaggle competitions download -c amazon-employee-access-challenge

# Unzip the downloaded files
!unzip amazon-employee-access-challenge.zip



Saving kaggle.json to kaggle.json
Downloading amazon-employee-access-challenge.zip to /content
  0% 0.00/1.53M [00:00<?, ?B/s]
100% 1.53M/1.53M [00:00<00:00, 90.7MB/s]
Archive:  amazon-employee-access-challenge.zip
  inflating: sampleSubmission.csv    
  inflating: test.csv                
  inflating: train.csv               


<a class="anchor" id="0"></a>
# **CatBoost Classifier in Python**

In this kernel, we will discuss an open sourced library - **CatBoost** developed and contributed by Yandex. CatBoost can use categorical features directly and is scalable in nature.

# **1. Introduction to CatBoost** <a class="anchor" id="1"></a>


- CatBoost documentation says that-

  **"CatBoost is a high-performance open source library for gradient boosting on decision trees.""**
  
  
- So, CatBoost is an algorithm for gradient boosting on decision trees.


- It is a readymade classifier in scikit-learn’s conventions terms that would deal with categorical features automatically.


- It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML.


- It can work with diverse data types to help solve a wide range of problems (described later) that businesses face today.


- It is developed by Yandex researchers and engineers, and is used for search, recommendation systems, personal assistant, self-driving cars, weather prediction and many other tasks.


- Also, it provides best-in-class accuracy.


- It is especially powerful in two ways:


  - 1.It yields state-of-the-art results without extensive data training typically required by other machine learning methods, and
  
  - 2.Provides powerful out-of-the-box support for the more descriptive data formats that accompany many business problems.


- **“CatBoost”** name comes from two words - **“Category”** and **“Boosting”**.


- It works well with multiple categories of data, such as audio, text, image including historical data.


- **“Boost”** comes from gradient boosting machine learning algorithm as this library is based on gradient boosting library. Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges like fraud detection, recommendation items, forecasting and it performs well also. It can also return very good results with relatively less data, unlike DL models that need to learn from a massive amount of data.


- It is in open-source and can be used by anyone.

# **2. Advantages of CatBoost library** <a class="anchor" id="2"></a>




Advantages of CatBoost library are as follows:-


- **Performance**: CatBoost provides state of the art results and it is competitive with any leading machine learning algorithm on the performance front.


- **Handling Categorical features automatically**: We can use CatBoost without any explicit pre-processing to convert categories into numbers. CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and combinations of categorical and numerical features.


- **Robust**: It reduces the need for extensive hyper-parameter tuning and lower the chances of overfitting also which leads to more generalized models. Although, CatBoost has multiple parameters to tune and it contains parameters like the number of trees, learning rate, regularization, tree depth, fold size, bagging temperature and others.


- **Easy-to-use**: We can use CatBoost from the command line, using an user-friendly API for both Python and R.


# **3. Comparision of CatBoost and other Boosting algorithms** <a class="anchor" id="3"></a>




- We have multiple boosting libraries like XGBoost, H2O and LightGBM and all of these perform well on variety of problems.

- CatBoost developer have compared the performance with competitors on standard ML datasets.

- This comparision is depicted in the following diagram:



![Comparision of CatBoost and other Boosting algorithms](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/08/13153401/Screen-Shot-2017-08-13-at-3.33.33-PM-768x443.png)

- The comparison above shows the log-loss value for test data and it is lowest in the case of CatBoost in most cases. It clearly signifies that CatBoost mostly performs better for both tuned and default models.

- In addition to this, CatBoost does not require conversion of data set to any specific format like XGBoost and LightGBM.

# **4. Implementation of CatBoost in Python** <a class="anchor" id="4"></a>




- Now, we will present implementation of CatBoost in Python.

- The first step is to load the required libraries.

### **4.1 Load libraries** <a class="anchor" id="4.1"></a>



In [4]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

### **4.2 Read dataset** <a class="anchor" id="4.2"></a>



In [5]:
train_df = pd.read_csv('train.csv')

In [6]:
test_df = pd.read_csv('test.csv')

### **4.3 EDA** <a class="anchor" id="4.3"></a>




- Now, that we have imported our dataset, its time to gain some insights about our data.

- Let's preview the dataset.

### **Preview the dataset**

In [7]:
train_df.head()

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,1,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,1,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,1,42680,5905,117929,117930,119569,119323,123932,19793,119325


In [8]:
test_df.head()

Unnamed: 0,id,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,78766,72734,118079,118080,117878,117879,118177,19721,117880
1,2,40644,4378,117961,118327,118507,118863,122008,118398,118865
2,3,75443,2395,117961,118300,119488,118172,301534,249618,118175
3,4,43219,19986,117961,118225,118403,120773,136187,118960,120774
4,5,42093,50015,117961,118343,119598,118422,300136,118424,118425


### **View summary of dataframe**

In [9]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32769 entries, 0 to 32768
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   ACTION            32769 non-null  int64
 1   RESOURCE          32769 non-null  int64
 2   MGR_ID            32769 non-null  int64
 3   ROLE_ROLLUP_1     32769 non-null  int64
 4   ROLE_ROLLUP_2     32769 non-null  int64
 5   ROLE_DEPTNAME     32769 non-null  int64
 6   ROLE_TITLE        32769 non-null  int64
 7   ROLE_FAMILY_DESC  32769 non-null  int64
 8   ROLE_FAMILY       32769 non-null  int64
 9   ROLE_CODE         32769 non-null  int64
dtypes: int64(10)
memory usage: 2.5 MB


In [10]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58921 entries, 0 to 58920
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   id                58921 non-null  int64
 1   RESOURCE          58921 non-null  int64
 2   MGR_ID            58921 non-null  int64
 3   ROLE_ROLLUP_1     58921 non-null  int64
 4   ROLE_ROLLUP_2     58921 non-null  int64
 5   ROLE_DEPTNAME     58921 non-null  int64
 6   ROLE_TITLE        58921 non-null  int64
 7   ROLE_FAMILY_DESC  58921 non-null  int64
 8   ROLE_FAMILY       58921 non-null  int64
 9   ROLE_CODE         58921 non-null  int64
dtypes: int64(10)
memory usage: 4.5 MB


We can see that there are no missing values in the dataset.

### **View unique values in dataset**

In [11]:
train_df.nunique()

Unnamed: 0,0
ACTION,2
RESOURCE,7518
MGR_ID,4243
ROLE_ROLLUP_1,128
ROLE_ROLLUP_2,177
ROLE_DEPTNAME,449
ROLE_TITLE,343
ROLE_FAMILY_DESC,2358
ROLE_FAMILY,67
ROLE_CODE,343


In [12]:
test_df.nunique()

Unnamed: 0,0
id,58921
RESOURCE,4971
MGR_ID,4689
ROLE_ROLLUP_1,126
ROLE_ROLLUP_2,177
ROLE_DEPTNAME,466
ROLE_TITLE,351
ROLE_FAMILY_DESC,2749
ROLE_FAMILY,68
ROLE_CODE,351


### **Findings of EDA**

- All the features are categorical.

- The categorical features have a lot of unique values, we won't use one hot encoding, but depending on the dataset it may be a good idea to adjust one_hot_max_size.

- There are no missing values in the dataset.

### **4.4 Data Preparation** <a class="anchor" id="4.4"></a>




### **Declare feature vector and target variable**

In [13]:
X = train_df.drop("ACTION", axis=1)
y = train_df["ACTION"]


### Categorical features declaration

In [14]:
cat_features = list(range(0, X.shape[1]))
print(cat_features)

[0, 1, 2, 3, 4, 5, 6, 7, 8]


### 4.5 **Split data into train and validation set** <a class="anchor" id="4.5"></a>




In [15]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

### **4.6 CatBoost implementation** <a class="anchor" id="4.6"></a>




In [17]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [18]:
from catboost import CatBoostClassifier

clf = CatBoostClassifier(
    iterations=5,
    learning_rate=0.1,
    #loss_function='CrossEntropy'
)


clf.fit(X_train, y_train,
        cat_features=cat_features,
        eval_set=(X_val, y_val),
        verbose=False
)

print('CatBoost model is fitted: ' + str(clf.is_fitted()))
print('CatBoost model parameters:')
print(clf.get_params())

CatBoost model is fitted: True
CatBoost model parameters:
{'iterations': 5, 'learning_rate': 0.1}


### **4.7 Stdout of the training**  <a class="anchor" id="4.7"></a>



- **Stdout** stands for Standard Output in Python.

In [19]:
from catboost import CatBoostClassifier
clf = CatBoostClassifier(
    iterations=10,
#     verbose=5,
)

clf.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_val, y_val),
)

Learning rate set to 0.5
0:	learn: 0.3971379	test: 0.3960691	best: 0.3960691 (0)	total: 42.4ms	remaining: 382ms
1:	learn: 0.2948071	test: 0.2924021	best: 0.2924021 (1)	total: 96ms	remaining: 384ms
2:	learn: 0.2485317	test: 0.2454599	best: 0.2454599 (2)	total: 141ms	remaining: 329ms
3:	learn: 0.2234301	test: 0.2191836	best: 0.2191836 (3)	total: 200ms	remaining: 300ms
4:	learn: 0.1999100	test: 0.1935203	best: 0.1935203 (4)	total: 246ms	remaining: 246ms
5:	learn: 0.1911956	test: 0.1831193	best: 0.1831193 (5)	total: 278ms	remaining: 185ms
6:	learn: 0.1854231	test: 0.1763719	best: 0.1763719 (6)	total: 335ms	remaining: 143ms
7:	learn: 0.1818392	test: 0.1732419	best: 0.1732419 (7)	total: 383ms	remaining: 95.7ms
8:	learn: 0.1789139	test: 0.1679366	best: 0.1679366 (8)	total: 413ms	remaining: 45.9ms
9:	learn: 0.1777264	test: 0.1656210	best: 0.1656210 (9)	total: 453ms	remaining: 0us

bestTest = 0.1656210392
bestIteration = 9



<catboost.core.CatBoostClassifier at 0x7cba8a551de0>

### **4.8 Model predictions** <a class="anchor" id="4.8"></a>



In [21]:
print(clf.predict_proba(X_val))

[[0.02985352 0.97014648]
 [0.01955598 0.98044402]
 [0.01780024 0.98219976]
 ...
 [0.04551625 0.95448375]
 [0.11208538 0.88791462]
 [0.02216136 0.97783864]]


In [22]:
print(clf.predict(X_val))

[1 1 1 ... 1 1 1]


### **4.9 Metrics calculation and graph plotting** <a class="anchor" id="4.9"></a>



In [37]:
from catboost import CatBoostClassifier

clf = CatBoostClassifier(
    iterations=50,
    random_seed=42,
    learning_rate=0.5,
    custom_loss=['AUC', 'Accuracy']
)

clf.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_val, y_val),
    verbose=False,
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x7cba89966fb0>

In [29]:
# Get the loss values for the training and validation sets
train_loss = clf.get_evals_result()['learn']['Logloss']
val_loss = clf.get_evals_result()['validation']['Logloss']

# Print or plot the loss values
print(f"Training Loss: {train_loss}")
print(f"Validation Loss: {val_loss}")

Training Loss: [0.3969304953585906, 0.2933474640919105, 0.2466417254928546, 0.2142165322757052, 0.19715630795232608, 0.1893637183915301, 0.18341997994238496, 0.1817675692763587, 0.17987449455823548, 0.17714750364382495, 0.17559690034521233, 0.1752024413949654, 0.17486013331381933, 0.17430905063379457, 0.17297524131762965, 0.17263610461822354, 0.1723376425128527, 0.17174387245019673, 0.17173211485514986, 0.17169646287068024, 0.17092017933808246, 0.17027199077710994, 0.1699475401302213, 0.16925895225983895, 0.16916669100103138, 0.16912002912139862, 0.16908944814589058, 0.16843075766985982, 0.16840942928023023, 0.1683841412807354, 0.1683656323432173, 0.16816294741859408, 0.1681544779213337, 0.16753862143426373, 0.16740396979709143, 0.1673925451926624, 0.16733981453165303, 0.16724781494857371, 0.16708782249710655, 0.16682288616147933, 0.16675542599798168, 0.16665589736241704, 0.16595666754805458, 0.1659128586473469, 0.1659056553687471, 0.16585992482681677, 0.16582494186163207, 0.1657871579

In [30]:
from sklearn.metrics import accuracy_score

# Make predictions on the validation set
y_pred = clf.predict(X_val)

# Calculate accuracy
accuracy = accuracy_score(y_val, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9476655477570949


In [31]:
# Get feature importance
feature_importance = clf.get_feature_importance()

# Print feature importance
for feature, importance in zip(X_train.columns, feature_importance):
    print(f"{feature}: {importance}")

RESOURCE: 18.808315295952255
MGR_ID: 38.705149889202616
ROLE_ROLLUP_1: 2.7552496039364525
ROLE_ROLLUP_2: 8.574247237441977
ROLE_DEPTNAME: 6.044169933388906
ROLE_TITLE: 4.632810464339024
ROLE_FAMILY_DESC: 10.821455465487272
ROLE_FAMILY: 3.459510158798528
ROLE_CODE: 6.199091951452958


# **5. Results and Conclusion**   <a class="anchor" id="5"></a>




- In this kernel, we have discuss **CatBoost**, which is a high-performance open source library for gradient boosting on decision trees.

- We have also discuss the advantages of **CatBoost** library.

- We have also present the comparision between **CatBoost** and other **Boosting** algorithms.

- We have also present the baseline implementation of CatBoost in Python.

So, now we will come to the end of this kernel.

I hope you find this kernel useful and enjoyable.

Your comments and feedback are most welcome.

Thank you


[Go to Top](#0)

---

