<a href="https://www.kaggle.com/code/rinichristy/xgboost-classification-regression?scriptVersionId=91354696" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **XGBoost Classification & Regression**

## **XGBoost Classification**
### **Data Preprocessing**
#### **Importing the libraries**

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### **Importing the dataset**

In [2]:
ds = pd.read_csv('../input/churn-modelling/Churn_Modelling.csv')
ds.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
X = ds.iloc[:, 3:-1].values
y = ds.iloc[:, -1].values
print(X)
print(y)

[[619 'France' 'Female' ... 1 1 101348.88]
 [608 'Spain' 'Female' ... 0 1 112542.58]
 [502 'France' 'Female' ... 1 0 113931.57]
 ...
 [709 'France' 'Female' ... 0 1 42085.58]
 [772 'Germany' 'Male' ... 1 0 92888.52]
 [792 'France' 'Female' ... 1 0 38190.78]]
[1 0 1 ... 1 1 0]


#### **Encoding categorical data**

In [4]:
# Label Encoding the "Gender" column
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:, 2] = le.fit_transform(X[:, 2])
print(X)

[[619 'France' 0 ... 1 1 101348.88]
 [608 'Spain' 0 ... 0 1 112542.58]
 [502 'France' 0 ... 1 0 113931.57]
 ...
 [709 'France' 0 ... 0 1 42085.58]
 [772 'Germany' 1 ... 1 0 92888.52]
 [792 'France' 0 ... 1 0 38190.78]]


In [5]:
# One Hot Encoding the "Geography" column
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)

[[1.0 0.0 0.0 ... 1 1 101348.88]
 [0.0 0.0 1.0 ... 0 1 112542.58]
 [1.0 0.0 0.0 ... 1 0 113931.57]
 ...
 [1.0 0.0 0.0 ... 0 1 42085.58]
 [0.0 1.0 0.0 ... 1 0 92888.52]
 [1.0 0.0 0.0 ... 1 0 38190.78]]


#### **Splitting the dataset into the Training set and Test set**

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#### **Training XGBoost on the Training set**

In [7]:
from xgboost import XGBClassifier
classifier = XGBClassifier(learning_rate=0.05, n_estimators=200, objective='binary:logistic', 
                           use_label_encoder=False, disable_default_eval_metric = True)
#classifier = XGBClassifier()
classifier.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1,
              disable_default_eval_metric=True, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.05, max_delta_step=0,
              max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

#### **Predicting the Test set results**

In [8]:
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

#### **Making the Confusion Matrix**

In [9]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[1524   71]
 [ 192  213]]


#### **Applying k-Fold Cross Validation**

In [10]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy of XG Boost Classification: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation of XG Boost Classification: {:.2f} %".format(accuracies.std()*100))

Accuracy of XG Boost Classification: 86.10 %
Standard Deviation of XG Boost Classification: 0.98 %


## **XGBoost Regression**
### **Data Preprocessing**
**Data Set Information:**

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

**Attribute Information:**

Features consist of hourly average ambient variables
* Temperature (AT) in the range 1.81°C and 37.11°C,
* Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
* Relative Humidity (RH) in the range 25.56% to 100.16%
* Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
* Net hourly electrical energy output (PE) 420.26-495.76 MW

#### **Importing the dataset**

In [11]:
df = pd.read_csv('../input/combined-cycle-power-plant-data-set-uci-data/Power Plant Data.csv')
print(df.head())
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

      AT      V       AP     RH      PE
0  14.96  41.76  1024.07  73.17  463.26
1  25.18  62.96  1020.04  59.08  444.37
2   5.11  39.40  1012.16  92.14  488.56
3  20.86  57.32  1010.24  76.64  446.48
4  10.82  37.50  1009.23  96.62  473.90


#### **Splitting the dataset into the Training set and Test set**

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#### **Multiple Linear Regression**

In [13]:
#Training the Multiple Linear Regression model on the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Predicting the Test set results
y_pred = regressor.predict(X_test).round(2)
# OR, np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[431.43 431.23]
 [458.56 460.01]
 [462.75 461.14]
 ...
 [469.52 473.26]
 [442.42 438.  ]
 [461.88 463.28]]


In [14]:
#Evaluating the Model Performance
k = X_test.shape[1]
k
n = len(X_test)
n

from sklearn import metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score, r2_score
print('Mean Absolute Error(MAE) of multiple linear regression:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error(MSE) of multiple linear regression:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error (RMSE) of multiple linear regression:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('Explained Variance Score (EVS) of multiple linear regression:',explained_variance_score(y_test, y_pred))
print('R2 of multiple linear regression:',metrics.r2_score(y_test, y_pred))
print('R2 rounded of multiple linear regression:',(metrics.r2_score(y_test, y_pred)).round(2))
r2 = r2_score(y_test, y_pred)
r2_rounded = r2_score(y_test, y_pred).round(2)
adjusted_r2 = (1- (1-r2)*(n-1)/(n-k-1)).round(3)
print('Adjusted_r2 of multiple linear regression: ', (1- (1-r2)*(n-1)/(n-k-1)).round(3))
accuracy = regressor.score(X_test, y_test)
print("Accuracy of multiple linear regression: {}".format(accuracy))

Mean Absolute Error(MAE) of multiple linear regression: 3.5666300940438864
Mean Squared Error(MSE) of multiple linear regression: 19.73409942528735
Root Mean Squared Error (RMSE) of multiple linear regression: 4.442307894021682
Explained Variance Score (EVS) of multiple linear regression: 0.9325314369725277
R2 of multiple linear regression: 0.9325301874814955
R2 rounded of multiple linear regression: 0.93
Adjusted_r2 of multiple linear regression:  0.932
Accuracy of multiple linear regression: 0.9325315554761303


#### **Train an XGBoost regressor model**

In [15]:
from xgboost import XGBRegressor
regressor = XGBRegressor(objective ='reg:squarederror', learning_rate = 0.1, max_depth = 5, n_estimators = 100)
regressor.fit(X_train, y_train)

#Predicting the Test set results
y_pred = regressor.predict(X_test).round(2)
# OR, np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# predict the score of the trained model using the testing dataset
result = regressor.score(X_test, y_test)
print("Accuracy : {}".format(result))

[[434.23999023 431.23      ]
 [457.26998901 460.01      ]
 [463.36999512 461.14      ]
 ...
 [470.88000488 473.26      ]
 [438.86999512 438.        ]
 [463.04000854 463.28      ]]
Accuracy : 0.9591821869823342


In [16]:
#Evaluating the Model Performance
k = X_test.shape[1]
k
n = len(X_test)
n

from sklearn import metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score, r2_score
print('Mean Absolute Error(MAE) of XG Boost regression:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error(MSE) of XG Boost regression:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error (RMSE) of XG Boost regression:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('Explained Variance Score (EVS) of XG Boost regression:',explained_variance_score(y_test, y_pred))
print('R2 of XG Boost regression:',metrics.r2_score(y_test, y_pred))
print('R2 rounded of XG Boost regression:',(metrics.r2_score(y_test, y_pred)).round(2))
r2 = r2_score(y_test, y_pred)
r2_rounded = r2_score(y_test, y_pred).round(2)
adjusted_r2 = (1- (1-r2)*(n-1)/(n-k-1)).round(3)
print('Adjusted_r2 of XG Boost regression: ', (1- (1-r2)*(n-1)/(n-k-1)).round(3))
accuracy = regressor.score(X_test, y_test)
print("Accuracy of XG Boost regression: {}".format(accuracy))

Mean Absolute Error(MAE) of XG Boost regression: 2.639874472304198
Mean Squared Error(MSE) of XG Boost regression: 11.93954488691334
Root Mean Squared Error (RMSE) of XG Boost regression: 3.455364653247663
Explained Variance Score (EVS) of XG Boost regression: 0.9591917782765134
R2 of XG Boost regression: 0.9591793454712169
R2 rounded of XG Boost regression: 0.96
Adjusted_r2 of XG Boost regression:  0.959
Accuracy of XG Boost regression: 0.9591821869823342
