# Solving classification problems with CatBoost

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/catboost/tutorials/blob/master/classification/classification_tutorial.ipynb)

In this tutorial we will use dataset Amazon Employee Access Challenge from [Kaggle](https://www.kaggle.com) competition for our experiments. Data can be downloaded [here](https://www.kaggle.com/c/amazon-employee-access-challenge/data).

Link to [Youtube video](https://youtu.be/xl1fwCza9C8?t=644). 

## Libraries installation

In [1]:
#!pip install --user --upgrade catboost
#!pip install --user --upgrade ipywidgets
#!pip install shap
#!pip install sklearn
#!pip install --upgrade numpy
#!jupyter nbextension enable --py widgetsnbextension

In [1]:
import catboost
print(catboost.__version__)
!python --version

0.26.1
Python 3.9.12


## Reading the data

In [2]:
import pandas as pd
import os
import numpy as np
np.set_printoptions(precision=4)
import catboost
from catboost import *
from catboost import datasets

Here in all columns we have numbers, these numbers are hashes of strings, so they should be considered as categorical variables.

In [3]:
(train_df, test_df) = catboost.datasets.amazon()
train_df.head()

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,1,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,1,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,1,42680,5905,117929,117930,119569,119323,123932,19793,119325


## Preparing your data

Categorical features declaration

In [4]:
y = train_df.ACTION
X = train_df.drop('ACTION', axis=1)

cat_features = list(range(0, X.shape[1]))
print(cat_features)

[0, 1, 2, 3, 4, 5, 6, 7, 8]


Looking on label balance in dataset

In [5]:
print('Labels: {}'.format(set(y)))
print('Zero count = {}, One count = {}'.format(len(y) - sum(y), sum(y)))

Labels: {0, 1}
Zero count = 1897, One count = 30872


Ways to create Pool class. Catboost is able to work with different file types (`.tsv`, `.csv`) and also with or without headers.

In [6]:
# create folder where we will store the data
dataset_dir = './amazon'
if not os.path.exists(dataset_dir):
    os.makedirs(dataset_dir)


# Create train/test files.

# First create a .tsv files, where separator is \t and there is no header.
train_df.to_csv(
    os.path.join(dataset_dir, 'train.tsv'),
    index=False, sep='\t', header=False
)
test_df.to_csv(
    os.path.join(dataset_dir, 'test.tsv'),
    index=False, sep='\t', header=False
)

# Also can create .csv files, where separator is , and header is presented.
train_df.to_csv(
    os.path.join(dataset_dir, 'train.csv'),
    index=False, sep=',', header=True
)
test_df.to_csv(
    os.path.join(dataset_dir, 'test.csv'),
    index=False, sep=',', header=True
)

In [7]:
!head amazon/train.csv

ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,17183,1540,117961,118343,123125,118536,118536,308574,118539
1,36724,14457,118219,118220,117884,117879,267952,19721,117880
1,36135,5396,117961,118343,119993,118321,240983,290919,118322
1,42680,5905,117929,117930,119569,119323,123932,19793,119325
0,45333,14561,117951,117952,118008,118568,118568,19721,118570
1,25993,17227,117961,118343,123476,118980,301534,118295,118982
1,19666,4209,117961,117969,118910,126820,269034,118638,126822
1,31246,783,117961,118413,120584,128230,302830,4673,128231


Create a column description file for catboost, so it understands where is the target val, where are the features and so on. Here we will specify column types and names.

In [11]:
from catboost.utils import create_cd

feature_names = dict()
for i, column_name in enumerate(train_df.columns):
    # skip first column, the target column
    if i != 0:
        feature_names[i-1] = column_name

# Notice that in feature_names indices start from zero (target index is zero, here it is not given)
feature_names

{0: 'RESOURCE',
 1: 'MGR_ID',
 2: 'ROLE_ROLLUP_1',
 3: 'ROLE_ROLLUP_2',
 4: 'ROLE_DEPTNAME',
 5: 'ROLE_TITLE',
 6: 'ROLE_FAMILY_DESC',
 7: 'ROLE_FAMILY',
 8: 'ROLE_CODE'}

In [9]:
create_cd(
    label=0,  # A zero-based index of the target variable. 
    cat_features=list(range(1, train_df.columns.shape[0])),
    feature_names=feature_names,
    output_path=os.path.join(dataset_dir, 'train.cd')
)

!cat amazon/train.cd

0	Label	
1	Categ	RESOURCE
2	Categ	MGR_ID
3	Categ	ROLE_ROLLUP_1
4	Categ	ROLE_ROLLUP_2
5	Categ	ROLE_DEPTNAME
6	Categ	ROLE_TITLE
7	Categ	ROLE_FAMILY_DESC
8	Categ	ROLE_FAMILY
9	Categ	ROLE_CODE


Imagine we have some non categorical features. This numerical column will not be listed in the `.cd` file.

In [10]:
# feature 0 (true index is 1) is numerical (just for example)
if 0 in feature_names:
    del feature_names[0]

create_cd(
    label=0,  # A zero-based index of the target variable. 
    cat_features=[2, 3, 4, 5, 6, 7, 8, 9],
    feature_names=feature_names,
    output_path=os.path.join(dataset_dir, 'tmp.cd')
)

!cat amazon/tmp.cd

0	Label	
2	Categ	MGR_ID
3	Categ	ROLE_ROLLUP_1
4	Categ	ROLE_ROLLUP_2
5	Categ	ROLE_DEPTNAME
6	Categ	ROLE_TITLE
7	Categ	ROLE_FAMILY_DESC
8	Categ	ROLE_FAMILY
9	Categ	ROLE_CODE


Now create a dataset for training catboost model.

In [12]:
# 1. Create dataset from pandas dataframe and list of categorical features.
pool1 = Pool(data=X, label=y, cat_features=cat_features)

# 2. Create dataset from file. Need to specify header and column description file
pool2 = Pool(
    data=os.path.join(dataset_dir, 'train.csv'), 
    delimiter=',', 
    column_description=os.path.join(dataset_dir, 'train.cd'),
    has_header=True
)

# 3. Dataset without target, for example for testing. Create from pandas dataframe
pool3 = Pool(data=X, cat_features=cat_features)


# 4. From FeaturesData class
# Fastest way to create a Pool is to create it from numpy matrix.
# The creation of pools from this representation is much faster than 
# from generic numpy.ndarray, pandas.DataFrame or pandas.Series 
# if the dataset contains both numerical and categorical features, 
# most of which are numerical.

# This method will have zero memory overhead.

# FeaturesData takes numpy array.
# For FeaturesData class categorial features must have type str
X_prepared = X.values.astype(str).astype(object)
data = FeaturesData(
    num_feature_data=None,  # dataset does not contain numerical features
    cat_feature_data=X_prepared,
    cat_feature_names=X.columns.values.tolist(),
)

pool4 = Pool(
    data=data,
    label=y.values
)

print('Dataset shape')
print('dataset 1:' + str(pool1.shape) +
      '\ndataset 2:' + str(pool2.shape) + 
      '\ndataset 3:' + str(pool3.shape) +
      '\ndataset 4:' + str(pool4.shape))

print('\n')
print('Column names')
print('dataset 1:')
print(pool1.get_feature_names()) 
print('\ndataset 2:')
print(pool2.get_feature_names())
print('\ndataset 3:')
print(pool3.get_feature_names())
print('\ndataset 4:')
print(pool4.get_feature_names())

Dataset shape
dataset 1:(32769, 9)
dataset 2:(32769, 9)
dataset 3:(32769, 9)
dataset 4:(32769, 9)


Column names
dataset 1:
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 2:
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 3:
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 4:
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']


In [13]:
pool1.get_label()

array([1, 1, 1, ..., 1, 1, 1])