<a href="https://www.kaggle.com/code/rinichristy/catboost-classification-regression?scriptVersionId=91465588" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **CatBoost Classification & Regression**

## **1. CatBoost Classification**

### **Data Preprocessing**

#### **Importing the libraries**

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### **Importing the dataset**

In [2]:
ds = pd.read_csv('../input/churn-modelling/Churn_Modelling.csv')
ds.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
X = ds.iloc[:, 3:-1].values
y = ds.iloc[:, -1].values
print(X)
print(y)

[[619 'France' 'Female' ... 1 1 101348.88]
 [608 'Spain' 'Female' ... 0 1 112542.58]
 [502 'France' 'Female' ... 1 0 113931.57]
 ...
 [709 'France' 'Female' ... 0 1 42085.58]
 [772 'Germany' 'Male' ... 1 0 92888.52]
 [792 'France' 'Female' ... 1 0 38190.78]]
[1 0 1 ... 1 1 0]


#### **Encoding categorical data**

In [4]:
# Label Encoding the "Gender" column
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:, 2] = le.fit_transform(X[:, 2])
print(X)

[[619 'France' 0 ... 1 1 101348.88]
 [608 'Spain' 0 ... 0 1 112542.58]
 [502 'France' 0 ... 1 0 113931.57]
 ...
 [709 'France' 0 ... 0 1 42085.58]
 [772 'Germany' 1 ... 1 0 92888.52]
 [792 'France' 0 ... 1 0 38190.78]]


In [5]:
# One Hot Encoding the "Geography" column
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)

[[1.0 0.0 0.0 ... 1 1 101348.88]
 [0.0 0.0 1.0 ... 0 1 112542.58]
 [1.0 0.0 0.0 ... 1 0 113931.57]
 ...
 [1.0 0.0 0.0 ... 0 1 42085.58]
 [0.0 1.0 0.0 ... 1 0 92888.52]
 [1.0 0.0 0.0 ... 1 0 38190.78]]


#### **Splitting the dataset into the Training set and Test set**

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### **1A. Training XGBoost on the Training set**

In [7]:
from xgboost import XGBClassifier
classifier = XGBClassifier(learning_rate=0.05, n_estimators=200, objective='binary:logistic', 
                           use_label_encoder=False, disable_default_eval_metric = True)
#classifier = XGBClassifier()
classifier.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1,
              disable_default_eval_metric=True, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.05, max_delta_step=0,
              max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

#### **Predicting the Test set results using XGBoost Classification model**

In [8]:
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

#### **Making the Confusion Matrix & Evaluation Matrix of XGBoost Classification model**

In [9]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

#Evaluation Metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
print('Accuracy of XGBoost Model is ', accuracy_score(y_test, y_pred))
print('\n', '\n','Confusion Matrix of XGBoost Model:' '\n', confusion_matrix(y_test, y_pred))
print('\n', '\n','Classification Report for XGBoost Model:' '\n',classification_report(y_test, y_pred))

Accuracy of XGBoost Model is  0.8685

 
 Confusion Matrix of XGBoost Model:
 [[1524   71]
 [ 192  213]]

 
 Classification Report for XGBoost Model:
               precision    recall  f1-score   support

           0       0.89      0.96      0.92      1595
           1       0.75      0.53      0.62       405

    accuracy                           0.87      2000
   macro avg       0.82      0.74      0.77      2000
weighted avg       0.86      0.87      0.86      2000



#### **Applying k-Fold Cross Validation on XGBoost Classification model**

In [10]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy of XG Boost Classification: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation of XG Boost Classification: {:.2f} %".format(accuracies.std()*100))

Accuracy of XG Boost Classification: 86.10 %
Standard Deviation of XG Boost Classification: 0.98 %


### **1B. Building CatBoost Classification Model**

In [11]:
from catboost import CatBoostClassifier
classifier = CatBoostClassifier()
classifier.fit(X_train, y_train)

Learning rate set to 0.025035
0:	learn: 0.6722585	total: 55.3ms	remaining: 55.2s
1:	learn: 0.6530471	total: 58.4ms	remaining: 29.1s
2:	learn: 0.6342740	total: 62ms	remaining: 20.6s
3:	learn: 0.6174746	total: 65.5ms	remaining: 16.3s
4:	learn: 0.6017924	total: 68.9ms	remaining: 13.7s
5:	learn: 0.5864114	total: 71.5ms	remaining: 11.8s
6:	learn: 0.5757775	total: 74.7ms	remaining: 10.6s
7:	learn: 0.5625984	total: 77.4ms	remaining: 9.59s
8:	learn: 0.5492111	total: 80ms	remaining: 8.81s
9:	learn: 0.5382108	total: 82.7ms	remaining: 8.19s
10:	learn: 0.5272909	total: 85.5ms	remaining: 7.68s
11:	learn: 0.5178895	total: 88.2ms	remaining: 7.26s
12:	learn: 0.5090407	total: 90.9ms	remaining: 6.9s
13:	learn: 0.5009555	total: 93.7ms	remaining: 6.6s
14:	learn: 0.4923603	total: 96.3ms	remaining: 6.33s
15:	learn: 0.4837924	total: 99ms	remaining: 6.09s
16:	learn: 0.4780842	total: 102ms	remaining: 5.88s
17:	learn: 0.4713495	total: 105ms	remaining: 5.71s
18:	learn: 0.4648379	total: 108ms	remaining: 5.57s
19:

<catboost.core.CatBoostClassifier at 0x7f881f8eb8d0>

#### **Making the Confusion Matrix & Evaluation Matrix of CatBoost Classification model**

In [12]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

#Evaluation Metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
print('Accuracy of CatBoost Classification Model is ', accuracy_score(y_test, y_pred))
print('\n', '\n','Confusion Matrix of CatBoost Classification Model:' '\n', confusion_matrix(y_test, y_pred))
print('\n', '\n','Classification Report for CatBoost Classification Model:' '\n',classification_report(y_test, y_pred))

Accuracy of CatBoost Classification Model is  0.8685

 
 Confusion Matrix of CatBoost Classification Model:
 [[1524   71]
 [ 192  213]]

 
 Classification Report for CatBoost Classification Model:
               precision    recall  f1-score   support

           0       0.89      0.96      0.92      1595
           1       0.75      0.53      0.62       405

    accuracy                           0.87      2000
   macro avg       0.82      0.74      0.77      2000
weighted avg       0.86      0.87      0.86      2000



#### **Predicting the Test set results using CatBoost Classification model**

In [13]:
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

#### **Making the Confusion Matrix of CatBoost Classification model**

In [14]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[1514   81]
 [ 190  215]]


#### **Applying k-Fold Cross Validation on CatBoost Classification Model**

In [15]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print('\n', '\n',"Mean Accuracy of CatBoost Classification: {:.2f} %".format(accuracies.mean()*100))
print('\n', "Standard Deviation of CatBoost Classification: {:.2f} %".format(accuracies.std()*100))

Learning rate set to 0.023934
0:	learn: 0.6731310	total: 2.97ms	remaining: 2.97s
1:	learn: 0.6557754	total: 5.66ms	remaining: 2.83s
2:	learn: 0.6377491	total: 8.55ms	remaining: 2.84s
3:	learn: 0.6215203	total: 11.2ms	remaining: 2.79s
4:	learn: 0.6062518	total: 13.6ms	remaining: 2.7s
5:	learn: 0.5914063	total: 16.3ms	remaining: 2.69s
6:	learn: 0.5779694	total: 19.7ms	remaining: 2.8s
7:	learn: 0.5653432	total: 22.8ms	remaining: 2.83s
8:	learn: 0.5529386	total: 26.6ms	remaining: 2.92s
9:	learn: 0.5421471	total: 29.8ms	remaining: 2.95s
10:	learn: 0.5315658	total: 32.8ms	remaining: 2.95s
11:	learn: 0.5223779	total: 35.6ms	remaining: 2.93s
12:	learn: 0.5137582	total: 38.1ms	remaining: 2.89s
13:	learn: 0.5058847	total: 40.7ms	remaining: 2.87s
14:	learn: 0.4974732	total: 43.6ms	remaining: 2.86s
15:	learn: 0.4891213	total: 46.8ms	remaining: 2.88s
16:	learn: 0.4835104	total: 50.1ms	remaining: 2.9s
17:	learn: 0.4758838	total: 53.5ms	remaining: 2.92s
18:	learn: 0.4702250	total: 56.9ms	remaining: 2

## **2. CatBoost Regression**

### **Data Preprocessing**

#### **Importing the dataset**

In [16]:
df = pd.read_csv('../input/combined-cycle-power-plant-data-set-uci-data/Power Plant Data.csv')
print(df.head())
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

      AT      V       AP     RH      PE
0  14.96  41.76  1024.07  73.17  463.26
1  25.18  62.96  1020.04  59.08  444.37
2   5.11  39.40  1012.16  92.14  488.56
3  20.86  57.32  1010.24  76.64  446.48
4  10.82  37.50  1009.23  96.62  473.90


### **Data Set Information:**

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

**Attribute Information:**

Features consist of hourly average ambient variables
* Temperature (AT) in the range 1.81°C and 37.11°C,
* Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
* Relative Humidity (RH) in the range 25.56% to 100.16%
* Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
* Net hourly electrical energy output (PE) 420.26-495.76 MW

#### **Splitting the dataset into the Training set and Test set**

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### **2A. Define a Multiple Linear Regression model**

In [18]:
#Training the Multiple Linear Regression model on the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Predicting the Test set results
y_pred = regressor.predict(X_test).round(2)
# OR, np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[431.43 431.23]
 [458.56 460.01]
 [462.75 461.14]
 ...
 [469.52 473.26]
 [442.42 438.  ]
 [461.88 463.28]]


#### **Evaluating the Model Performance of Multiple Linear Regression model**

In [19]:
k = X_test.shape[1]
k
n = len(X_test)
n

from sklearn import metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score, r2_score
print('Mean Absolute Error(MAE) of multiple linear regression:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error(MSE) of multiple linear regression:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error (RMSE) of multiple linear regression:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('Explained Variance Score (EVS) of multiple linear regression:',explained_variance_score(y_test, y_pred))
print('R2 of multiple linear regression:',metrics.r2_score(y_test, y_pred))
print('R2 rounded of multiple linear regression:',(metrics.r2_score(y_test, y_pred)).round(2))
r2 = r2_score(y_test, y_pred)
r2_rounded = r2_score(y_test, y_pred).round(2)
adjusted_r2 = (1- (1-r2)*(n-1)/(n-k-1)).round(3)
print('Adjusted_r2 of multiple linear regression: ', (1- (1-r2)*(n-1)/(n-k-1)).round(3))
accuracy = regressor.score(X_test, y_test)
print("Accuracy of multiple linear regression: {}".format(accuracy))

Mean Absolute Error(MAE) of multiple linear regression: 3.5666300940438864
Mean Squared Error(MSE) of multiple linear regression: 19.73409942528735
Root Mean Squared Error (RMSE) of multiple linear regression: 4.442307894021682
Explained Variance Score (EVS) of multiple linear regression: 0.9325314369725277
R2 of multiple linear regression: 0.9325301874814955
R2 rounded of multiple linear regression: 0.93
Adjusted_r2 of multiple linear regression:  0.932
Accuracy of multiple linear regression: 0.9325315554761303


### **2B. Define the xgboost regressor model**

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from xgboost import XGBRegressor
regressor = XGBRegressor(objective ='reg:squarederror', learning_rate = 0.1, max_depth = 5, n_estimators = 100)
regressor.fit(X_train, y_train)

#Predicting the Test set results
y_pred = regressor.predict(X_test).round(2)
# OR, np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# predict the score of the trained model using the testing dataset
result = regressor.score(X_test, y_test)
print("Accuracy : {}".format(result))

[[434.23999023 431.23      ]
 [457.26998901 460.01      ]
 [463.36999512 461.14      ]
 ...
 [470.88000488 473.26      ]
 [438.86999512 438.        ]
 [463.04000854 463.28      ]]
Accuracy : 0.9591821869823342


#### **Evaluating the Model Performance of XGBoost Regression Model**

In [21]:
k = X_test.shape[1]
k
n = len(X_test)
n

from sklearn import metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score, r2_score
print('Mean Absolute Error(MAE) of XG Boost regression:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error(MSE) of XG Boost regression:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error (RMSE) of XG Boost regression:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('Explained Variance Score (EVS) of XG Boost regression:',explained_variance_score(y_test, y_pred))
print('R2 of XG Boost regression:',metrics.r2_score(y_test, y_pred))
print('R2 rounded of XG Boost regression:',(metrics.r2_score(y_test, y_pred)).round(2))
r2 = r2_score(y_test, y_pred)
r2_rounded = r2_score(y_test, y_pred).round(2)
adjusted_r2 = (1- (1-r2)*(n-1)/(n-k-1)).round(3)
print('Adjusted_r2 of XG Boost regression: ', (1- (1-r2)*(n-1)/(n-k-1)).round(3))
accuracy = regressor.score(X_test, y_test)
print("Accuracy of XG Boost regression: {}".format(accuracy))

Mean Absolute Error(MAE) of XG Boost regression: 2.639874472304198
Mean Squared Error(MSE) of XG Boost regression: 11.93954488691334
Root Mean Squared Error (RMSE) of XG Boost regression: 3.455364653247663
Explained Variance Score (EVS) of XG Boost regression: 0.9591917782765134
R2 of XG Boost regression: 0.9591793454712169
R2 rounded of XG Boost regression: 0.96
Adjusted_r2 of XG Boost regression:  0.959
Accuracy of XG Boost regression: 0.9591821869823342


### **2C. Define the CatBoost regressor model using CatBoostRegressor class**

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from catboost import CatBoostRegressor
regressor = CatBoostRegressor(loss_function='RMSE', learning_rate = 0.1, max_depth = 5, n_estimators = 100)
regressor.fit(X_train, y_train)

#Predicting the Test set results
y_pred = regressor.predict(X_test).round(2)
# OR, np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# predict the score of the trained model using the testing dataset
result = regressor.score(X_test, y_test)
print("Accuracy : {}".format(result))

0:	learn: 15.6106973	total: 1.65ms	remaining: 164ms
1:	learn: 14.2567769	total: 3.21ms	remaining: 157ms
2:	learn: 13.0914175	total: 4.63ms	remaining: 150ms
3:	learn: 12.0216303	total: 6.32ms	remaining: 152ms
4:	learn: 11.0669715	total: 8.19ms	remaining: 156ms
5:	learn: 10.2068337	total: 10.1ms	remaining: 157ms
6:	learn: 9.4534428	total: 11.8ms	remaining: 157ms
7:	learn: 8.7739114	total: 13.2ms	remaining: 152ms
8:	learn: 8.1764425	total: 15ms	remaining: 151ms
9:	learn: 7.6492095	total: 16.7ms	remaining: 150ms
10:	learn: 7.1955004	total: 18.4ms	remaining: 149ms
11:	learn: 6.7917389	total: 20ms	remaining: 147ms
12:	learn: 6.4324705	total: 21.6ms	remaining: 144ms
13:	learn: 6.1140818	total: 23ms	remaining: 142ms
14:	learn: 5.8414508	total: 24.8ms	remaining: 140ms
15:	learn: 5.6083921	total: 26.3ms	remaining: 138ms
16:	learn: 5.4018868	total: 27.8ms	remaining: 136ms
17:	learn: 5.2341144	total: 29.5ms	remaining: 134ms
18:	learn: 5.0715047	total: 31.1ms	remaining: 132ms
19:	learn: 4.9356017	t


#### **Evaluating the Model Performance of CatBoost Regression Model**

In [23]:
k = X_test.shape[1]
k
n = len(X_test)
n

from sklearn import metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score, r2_score
print('Mean Absolute Error(MAE) of CatBoost regression:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error(MSE) of CatBoost regression:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error (RMSE) of CatBoost regression:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('Explained Variance Score (EVS) of CatBoost regression:',explained_variance_score(y_test, y_pred))
print('R2 of CatBoost regression:',metrics.r2_score(y_test, y_pred))
print('R2 rounded of CatBoost regression:',(metrics.r2_score(y_test, y_pred)).round(2))
r2 = r2_score(y_test, y_pred)
r2_rounded = r2_score(y_test, y_pred).round(2)
adjusted_r2 = (1- (1-r2)*(n-1)/(n-k-1)).round(3)
print('Adjusted_r2 of CatBoost regression: ', (1- (1-r2)*(n-1)/(n-k-1)).round(3))
accuracy = regressor.score(X_test, y_test)
print("Accuracy of CatBoost regression: {}".format(accuracy))

Mean Absolute Error(MAE) of CatBoost regression: 2.9016718913270627
Mean Squared Error(MSE) of CatBoost regression: 13.99408652037617
Root Mean Squared Error (RMSE) of CatBoost regression: 3.7408670813564293
Explained Variance Score (EVS) of CatBoost regression: 0.9521996419968053
R2 of CatBoost regression: 0.9521549793811398
R2 rounded of CatBoost regression: 0.95
Adjusted_r2 of CatBoost regression:  0.952
Accuracy of CatBoost regression: 0.952153413155339
