In [None]:
'''
CatBoost
    - “CatBoost” name comes from two words “Category” and “Boosting”.
    - A machine learning library to handle categorical (CAT) data automatically
    - the library works well with multiple Categories of data, such as audio, text, image including historical data.
    - “Boost” comes from gradient boosting machine learning algorithm as this library is based on gradient boosting library.
    - CatBoost algorithm is another member of the gradient boosting technique on decision trees.
    - In addition to regression and classification, CatBoost can be used in ranking, recommendation systems, forecasting 
      and even personal assistants.

In [None]:
'''
Categorical Features Support:
    With other machine learning algorithms. After preprocessing and cleaning your data, the data has to be converted into 
    numerical features so that the machine can understand and make predictions.
    
    This is same like, for any text related models we convert the text data into to numerical data it is know as word 
    embedding techniques.
    
    This process of encoding or conversion is time-consuming. CatBoost supports working with non-numeric factors, 
    and this saves some time plus improves your training results.
    

In [None]:
'''
Missing Values Handling
    CatBoost supports three modes for processing

    missing values, 
    "Forbidden,”
    "Min," and "Max.” 
    For "Forbidden,” CatBoost treats missing values as not supported. 

    The presence of the missing values is interpreted as errors. For "Min,” missing values are processed as the minimum value 
    for a feature.

    With this method, the split that separates missing values from all other values is considered when selecting splits. 

    "Max" works just the same as "Min,” but the difference is the change from minimum to maximum values.

    The method of handling missing values for LightGBM and XGBoost is similar. The missing values will be allocated to the 
    side that reduces the loss in each split.

In [None]:
'''
When to use catBoost
    - Short training time on a robust data
    - Working on a  small dataset
    - When you are working on Categorical dataset
    - 

### CatBoost Example

In [2]:
## import the libraries needed
import pandas as pd
import numpy as np

# Here we import our dataset from the CatBoost dataset library
from catboost.datasets import titanic

titanic_train, titanic_test = titanic()

## This is because "Survived" is the target
column_sort = [ 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
'Fare', 'Cabin', 'Embarked','Survived']

train = titanic_train[column_sort]
train.set_index('Pclass') ## Not necessary just to get of the default index

test = titanic_test
train.head()

## let's say '2' so it is not dormant and we can merge the DataFrame later
test['Survived'] = 2  ## The numpy background of pandas allows this to work
test.sample(5) ## shows five random rows in the dataset

df = pd.concat([train,test],ignore_index = False)

## Some features (such as Name, and Age) are irrelevant so we delete them
df = df.drop(['Name', 'Age'], axis=1)

## The data is not clean so we check all the columns for missing values
df.isnull().sum(axis=0)

## "Fare", "Cabin", "Embarked", and "PassengerId" have missing values, we have to fix this
df['Embarked'] = df['Embarked'].fillna('S') 
df['Cabin'] = df['Cabin'].fillna('Undefined')
df.fillna(-999, inplace=True)

## Now that the data looks good, we have to separate the train from the test set
train = df[df.Survived != 2]

test = df[df.Survived == 2]
test = test.drop(['Survived'], axis=1) ## drop the placeholder we created earlier in the test set

## Pop out the training features from the target variable
target = train.pop('Survived')
target.head()

## Let's ensure the model is trained and fit well
cat_features_index = np.where(train.dtypes != float)[0]

## Split the data into a train and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train, target,
train_size=0.85, random_state=1234)

## Import the CatBoostClassifier to fit the model and run a prediction
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    custom_loss=['Accuracy'],
    random_seed=42)

## Set the metric for evaluation
model = CatBoostClassifier(eval_metric='Accuracy',
use_best_model=True,  random_seed=42)

model.fit(X_train, y_train, cat_features=cat_features_index,
eval_set=(X_test, y_test))


from catboost import cv
from sklearn.metrics import accuracy_score

print('the test accuracy is :{:.6f}'.format(accuracy_score(
y_test, model.predict(X_test))))

Learning rate set to 0.029583
0:	learn: 0.8031704	test: 0.8059701	best: 0.8059701 (0)	total: 208ms	remaining: 3m 28s
1:	learn: 0.8137384	test: 0.8059701	best: 0.8059701 (0)	total: 276ms	remaining: 2m 17s
2:	learn: 0.8150594	test: 0.8059701	best: 0.8059701 (0)	total: 330ms	remaining: 1m 49s
3:	learn: 0.8177015	test: 0.8059701	best: 0.8059701 (0)	total: 394ms	remaining: 1m 38s
4:	learn: 0.8163804	test: 0.8059701	best: 0.8059701 (0)	total: 471ms	remaining: 1m 33s
5:	learn: 0.8163804	test: 0.8059701	best: 0.8059701 (0)	total: 500ms	remaining: 1m 22s
6:	learn: 0.8150594	test: 0.8059701	best: 0.8059701 (0)	total: 540ms	remaining: 1m 16s
7:	learn: 0.8163804	test: 0.8059701	best: 0.8059701 (0)	total: 564ms	remaining: 1m 9s
8:	learn: 0.8110964	test: 0.8059701	best: 0.8059701 (0)	total: 600ms	remaining: 1m 6s
9:	learn: 0.8137384	test: 0.8059701	best: 0.8059701 (0)	total: 643ms	remaining: 1m 3s
10:	learn: 0.8150594	test: 0.8059701	best: 0.8059701 (0)	total: 687ms	remaining: 1m 1s
11:	learn: 0.816

95:	learn: 0.8520476	test: 0.8059701	best: 0.8134328 (56)	total: 5.45s	remaining: 51.3s
96:	learn: 0.8520476	test: 0.8059701	best: 0.8134328 (56)	total: 5.52s	remaining: 51.4s
97:	learn: 0.8507266	test: 0.8059701	best: 0.8134328 (56)	total: 5.55s	remaining: 51.1s
98:	learn: 0.8507266	test: 0.8059701	best: 0.8134328 (56)	total: 5.61s	remaining: 51.1s
99:	learn: 0.8507266	test: 0.8059701	best: 0.8134328 (56)	total: 5.65s	remaining: 50.8s
100:	learn: 0.8494055	test: 0.8059701	best: 0.8134328 (56)	total: 5.69s	remaining: 50.6s
101:	learn: 0.8520476	test: 0.8059701	best: 0.8134328 (56)	total: 5.73s	remaining: 50.4s
102:	learn: 0.8520476	test: 0.8059701	best: 0.8134328 (56)	total: 5.76s	remaining: 50.2s
103:	learn: 0.8507266	test: 0.8059701	best: 0.8134328 (56)	total: 5.81s	remaining: 50s
104:	learn: 0.8520476	test: 0.8059701	best: 0.8134328 (56)	total: 5.86s	remaining: 49.9s
105:	learn: 0.8520476	test: 0.8059701	best: 0.8134328 (56)	total: 5.91s	remaining: 49.9s
106:	learn: 0.8520476	test: 

188:	learn: 0.8705416	test: 0.8059701	best: 0.8134328 (56)	total: 10.6s	remaining: 45.7s
189:	learn: 0.8705416	test: 0.8059701	best: 0.8134328 (56)	total: 10.7s	remaining: 45.7s
190:	learn: 0.8705416	test: 0.8059701	best: 0.8134328 (56)	total: 10.8s	remaining: 45.7s
191:	learn: 0.8705416	test: 0.8059701	best: 0.8134328 (56)	total: 10.8s	remaining: 45.6s
192:	learn: 0.8705416	test: 0.8059701	best: 0.8134328 (56)	total: 10.9s	remaining: 45.6s
193:	learn: 0.8705416	test: 0.8059701	best: 0.8134328 (56)	total: 11s	remaining: 45.6s
194:	learn: 0.8718626	test: 0.8059701	best: 0.8134328 (56)	total: 11s	remaining: 45.5s
195:	learn: 0.8718626	test: 0.8059701	best: 0.8134328 (56)	total: 11.1s	remaining: 45.5s
196:	learn: 0.8718626	test: 0.8059701	best: 0.8134328 (56)	total: 11.2s	remaining: 45.5s
197:	learn: 0.8718626	test: 0.8059701	best: 0.8134328 (56)	total: 11.2s	remaining: 45.4s
198:	learn: 0.8718626	test: 0.8059701	best: 0.8134328 (56)	total: 11.3s	remaining: 45.4s
199:	learn: 0.8718626	tes

283:	learn: 0.8916777	test: 0.8134328	best: 0.8208955 (248)	total: 16.3s	remaining: 41s
284:	learn: 0.8916777	test: 0.8134328	best: 0.8208955 (248)	total: 16.3s	remaining: 41s
285:	learn: 0.8916777	test: 0.8134328	best: 0.8208955 (248)	total: 16.4s	remaining: 41s
286:	learn: 0.8916777	test: 0.8134328	best: 0.8208955 (248)	total: 16.5s	remaining: 40.9s
287:	learn: 0.8916777	test: 0.8134328	best: 0.8208955 (248)	total: 16.5s	remaining: 40.9s
288:	learn: 0.8929987	test: 0.8208955	best: 0.8208955 (248)	total: 16.6s	remaining: 40.9s
289:	learn: 0.8929987	test: 0.8208955	best: 0.8208955 (248)	total: 16.7s	remaining: 40.9s
290:	learn: 0.8929987	test: 0.8208955	best: 0.8208955 (248)	total: 16.8s	remaining: 40.9s
291:	learn: 0.8929987	test: 0.8208955	best: 0.8208955 (248)	total: 16.8s	remaining: 40.8s
292:	learn: 0.8929987	test: 0.8134328	best: 0.8208955 (248)	total: 16.9s	remaining: 40.8s
293:	learn: 0.8943197	test: 0.8134328	best: 0.8208955 (248)	total: 17s	remaining: 40.9s
294:	learn: 0.8929

378:	learn: 0.9022457	test: 0.8208955	best: 0.8283582 (334)	total: 21.9s	remaining: 36s
379:	learn: 0.9022457	test: 0.8208955	best: 0.8283582 (334)	total: 22s	remaining: 35.9s
380:	learn: 0.9022457	test: 0.8208955	best: 0.8283582 (334)	total: 22s	remaining: 35.8s
381:	learn: 0.9022457	test: 0.8208955	best: 0.8283582 (334)	total: 22.1s	remaining: 35.7s
382:	learn: 0.9022457	test: 0.8208955	best: 0.8283582 (334)	total: 22.1s	remaining: 35.6s
383:	learn: 0.9022457	test: 0.8283582	best: 0.8283582 (334)	total: 22.1s	remaining: 35.5s
384:	learn: 0.9022457	test: 0.8283582	best: 0.8283582 (334)	total: 22.2s	remaining: 35.5s
385:	learn: 0.9022457	test: 0.8283582	best: 0.8283582 (334)	total: 22.2s	remaining: 35.4s
386:	learn: 0.9022457	test: 0.8283582	best: 0.8283582 (334)	total: 22.3s	remaining: 35.3s
387:	learn: 0.9022457	test: 0.8283582	best: 0.8283582 (334)	total: 22.3s	remaining: 35.2s
388:	learn: 0.9022457	test: 0.8208955	best: 0.8283582 (334)	total: 22.4s	remaining: 35.1s
389:	learn: 0.90

472:	learn: 0.9101717	test: 0.8134328	best: 0.8283582 (334)	total: 27.1s	remaining: 30.1s
473:	learn: 0.9101717	test: 0.8134328	best: 0.8283582 (334)	total: 27.1s	remaining: 30.1s
474:	learn: 0.9114927	test: 0.8134328	best: 0.8283582 (334)	total: 27.1s	remaining: 30s
475:	learn: 0.9101717	test: 0.8134328	best: 0.8283582 (334)	total: 27.2s	remaining: 29.9s
476:	learn: 0.9101717	test: 0.8134328	best: 0.8283582 (334)	total: 27.2s	remaining: 29.9s
477:	learn: 0.9101717	test: 0.8134328	best: 0.8283582 (334)	total: 27.3s	remaining: 29.8s
478:	learn: 0.9114927	test: 0.8134328	best: 0.8283582 (334)	total: 27.4s	remaining: 29.8s
479:	learn: 0.9114927	test: 0.8134328	best: 0.8283582 (334)	total: 27.4s	remaining: 29.7s
480:	learn: 0.9114927	test: 0.8134328	best: 0.8283582 (334)	total: 27.5s	remaining: 29.7s
481:	learn: 0.9114927	test: 0.8134328	best: 0.8283582 (334)	total: 27.6s	remaining: 29.6s
482:	learn: 0.9114927	test: 0.8134328	best: 0.8283582 (334)	total: 27.6s	remaining: 29.6s
483:	learn: 

565:	learn: 0.9207398	test: 0.8134328	best: 0.8283582 (334)	total: 31.9s	remaining: 24.5s
566:	learn: 0.9207398	test: 0.8134328	best: 0.8283582 (334)	total: 32s	remaining: 24.4s
567:	learn: 0.9207398	test: 0.8134328	best: 0.8283582 (334)	total: 32.1s	remaining: 24.4s
568:	learn: 0.9207398	test: 0.8134328	best: 0.8283582 (334)	total: 32.1s	remaining: 24.3s
569:	learn: 0.9207398	test: 0.8134328	best: 0.8283582 (334)	total: 32.2s	remaining: 24.3s
570:	learn: 0.9207398	test: 0.8134328	best: 0.8283582 (334)	total: 32.2s	remaining: 24.2s
571:	learn: 0.9207398	test: 0.8134328	best: 0.8283582 (334)	total: 32.3s	remaining: 24.2s
572:	learn: 0.9207398	test: 0.8134328	best: 0.8283582 (334)	total: 32.3s	remaining: 24.1s
573:	learn: 0.9207398	test: 0.8134328	best: 0.8283582 (334)	total: 32.4s	remaining: 24s
574:	learn: 0.9207398	test: 0.8134328	best: 0.8283582 (334)	total: 32.4s	remaining: 23.9s
575:	learn: 0.9220608	test: 0.8134328	best: 0.8283582 (334)	total: 32.5s	remaining: 23.9s
576:	learn: 0.

659:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37s	remaining: 19.1s
660:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37s	remaining: 19s
661:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37.1s	remaining: 18.9s
662:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37.1s	remaining: 18.9s
663:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37.2s	remaining: 18.8s
664:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37.2s	remaining: 18.8s
665:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37.3s	remaining: 18.7s
666:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37.4s	remaining: 18.6s
667:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37.4s	remaining: 18.6s
668:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37.4s	remaining: 18.5s
669:	learn: 0.9273448	test: 0.8134328	best: 0.8283582 (334)	total: 37.5s	remaining: 18.5s
670:	learn: 0.92

752:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 42.6s	remaining: 14s
753:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 42.7s	remaining: 13.9s
754:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 42.7s	remaining: 13.9s
755:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 42.8s	remaining: 13.8s
756:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 42.9s	remaining: 13.8s
757:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 43s	remaining: 13.7s
758:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 43s	remaining: 13.7s
759:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 43.1s	remaining: 13.6s
760:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 43.2s	remaining: 13.6s
761:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 43.2s	remaining: 13.5s
762:	learn: 0.9326288	test: 0.8134328	best: 0.8283582 (334)	total: 43.3s	remaining: 13.4s
763:	learn: 0.93

847:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 47.5s	remaining: 8.52s
848:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 47.6s	remaining: 8.46s
849:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 47.7s	remaining: 8.41s
850:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 47.7s	remaining: 8.35s
851:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 47.8s	remaining: 8.3s
852:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 47.9s	remaining: 8.25s
853:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 47.9s	remaining: 8.19s
854:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 48s	remaining: 8.14s
855:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 48.1s	remaining: 8.08s
856:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 48.1s	remaining: 8.03s
857:	learn: 0.9339498	test: 0.8208955	best: 0.8283582 (334)	total: 48.2s	remaining: 7.97s
858:	learn: 0

943:	learn: 0.9339498	test: 0.8283582	best: 0.8283582 (334)	total: 52.8s	remaining: 3.13s
944:	learn: 0.9339498	test: 0.8283582	best: 0.8283582 (334)	total: 52.9s	remaining: 3.08s
945:	learn: 0.9339498	test: 0.8283582	best: 0.8283582 (334)	total: 52.9s	remaining: 3.02s
946:	learn: 0.9339498	test: 0.8283582	best: 0.8283582 (334)	total: 53s	remaining: 2.96s
947:	learn: 0.9339498	test: 0.8283582	best: 0.8283582 (334)	total: 53s	remaining: 2.91s
948:	learn: 0.9339498	test: 0.8283582	best: 0.8283582 (334)	total: 53s	remaining: 2.85s
949:	learn: 0.9339498	test: 0.8283582	best: 0.8283582 (334)	total: 53.1s	remaining: 2.79s
950:	learn: 0.9339498	test: 0.8283582	best: 0.8283582 (334)	total: 53.1s	remaining: 2.74s
951:	learn: 0.9352708	test: 0.8283582	best: 0.8283582 (334)	total: 53.2s	remaining: 2.68s
952:	learn: 0.9352708	test: 0.8283582	best: 0.8283582 (334)	total: 53.2s	remaining: 2.62s
953:	learn: 0.9352708	test: 0.8283582	best: 0.8283582 (334)	total: 53.3s	remaining: 2.57s
954:	learn: 0.93