# CatBoost Classifier

In machine learning, we have to deal with categorical data at some point of time. In sklearn, we are required to convert these categories into the numerical format. In order to do this conversion, we use pre-processing methods like “label encoding”, “one hot encoding”, etc.


In this notebook, we will use an open sourced library - **CatBoost** developed and contributed by Yandex. CatBoost can use categorical features directly and is scalable in nature.

# Install CatBoost

> `$ pip install catboost`

# Dataset
The data consists of real historical data collected from 2010 & 2011. Employees are manually allowed or denied access to resources over time.

**ACTION**: ACTION is 1 if the resource was approved, 0 if the resource was not **(label)**<br>
**RESOURCE**: An ID for each resource<br>
**MGR_ID**: The EMPLOYEE ID of the manager of the current EMPLOYEE ID record; an employee may have only one manager at a time<br>
**ROLE_ROLLUP_1**: Company role grouping category id 1 (e.g. US Engineering)<br>
**ROLE_ROLLUP_2**: Company role grouping category id 2 (e.g. US Retail)<br>
**ROLE_DEPTNAME**: Company role department description (e.g. Retail)<br>
**ROLE_TITLE**: Company role business title description (e.g. Senior Engineering Retail Manager)<br>
**ROLE_FAMILY_DESC**: Company role family extended description (e.g. Retail Manager, Software Engineering)<br>
**ROLE_FAMILY**: Company role family description (e.g. Retail Manager)<br>
**ROLE_CODE**: Company role code; this code is unique to each role (e.g. Manager)<br>

Based on this data, we have to build a model, learned using historical data, that will determine an employee's access needs, such that manual access transactions (grants and revokes) are minimized as the employee's attributes change over time. The model will take an employee's role information and a resource code and will return whether or not access should be granted.

## Load Packages

In [None]:
# import numpy as np 
import pandas as pd 

## Read dataset

In [None]:
df = pd.read_csv('employee_data.csv')

# EDA

## Preview the dataset

In [None]:
df.shape

In [None]:
df.head()

## View summary of dataframe

In [None]:
df.info()

We can see that there are no missing values in the dataset.

## Describe Dataframe

**EXERCISE**

## Unique values in dataset

**EXERCISE**

# Data Preparation

## Seperate feature vector and target variable

**EXERCISE**

In [None]:
X = 
y = 

## Categorical features declaration

**EXERCISE:** Find Categorical features from all the features and put them in a list below.

In [None]:
cat_features = [ ]

## Split data into train and validation set

In [None]:
from sklearn.model_selection import train_test_split

**EXERCISE:** Decide test train split ratio 80-20 or 30-70 or ....

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= , random_state=0)

# CatBoost model Training

In [None]:
from catboost import CatBoostClassifier

**EXERCISE:** Define the CatBoost Model

In [None]:
clf = CatBoostClassifier(task_type=, iterations=, 
                              random_state = , 
                              eval_metric= )

In [None]:
clf.fit(X_train, y_train, 
        cat_features=, 
        eval_set=(X_test, y_tset), 
        plot=False
)

# Predictions

In [None]:
from utils import predict_and_evaluate

In [None]:
res = predict_and_evaluate(clf, X_test, y_test)