# XGBoost Classifier

## Part 1 - Data Preprocessing

### Importing the dataset

In [2]:
import pandas as pd
dataset = pd.read_csv('churn_modelling.csv')

In [3]:
dataset.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [None]:
# The dataset we are using is related to customer churn modeling. It contains information about customers
# such as their credit score, geography, gender, age, tenure, balance, number of products, whether they
# have a credit card, if they are an active member, their estimated salary, and whether they exited the
# service (churned).
#
# Our goal is to build a predictive model using the XGBoost Classifier to determine the likelihood of a
# customer churning based on these features. We will preprocess the data, handle categorical variables,
# and then train the model to evaluate its performance.


### Checking missing data

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       10000 non-null  int64  
 1   Surname          10000 non-null  object 
 2   CreditScore      10000 non-null  int64  
 3   Geography        10000 non-null  object 
 4   Gender           10000 non-null  object 
 5   Age              10000 non-null  int64  
 6   Tenure           10000 non-null  int64  
 7   Balance          10000 non-null  float64
 8   NumOfProducts    10000 non-null  int64  
 9   HasCrCard        10000 non-null  int64  
 10  IsActiveMember   10000 non-null  int64  
 11  EstimatedSalary  10000 non-null  float64
 12  Exited           10000 non-null  int64  
dtypes: float64(2), int64(8), object(3)
memory usage: 1015.8+ KB


### Handling categorical variables

CustomerId and Surname columns

In [5]:
dataset.drop(['CustomerId', 'Surname'], axis = 1, inplace = True)

In [6]:
dataset.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Geography column

In [7]:
dataset['Geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [8]:
geography_dummies = pd.get_dummies(dataset['Geography'], drop_first = True)

In [9]:
geography_dummies

Unnamed: 0,Germany,Spain
0,False,False
1,False,True
2,False,False
3,False,False
4,False,True
...,...,...
9995,False,False
9996,False,False
9997,False,False
9998,True,False


In [10]:
dataset = pd.concat([geography_dummies, dataset], axis = 1)

In [11]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,False,True,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,False,False,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,False,False,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,False,True,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [12]:
dataset.drop(['Geography'], axis = 1, inplace = True)

In [13]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,619,Female,42,2,0.0,1,1,1,101348.88,1
1,False,True,608,Female,41,1,83807.86,1,0,1,112542.58,0
2,False,False,502,Female,42,8,159660.8,3,1,0,113931.57,1
3,False,False,699,Female,39,1,0.0,2,0,0,93826.63,0
4,False,True,850,Female,43,2,125510.82,1,1,1,79084.1,0


Gender column

In [14]:
dataset['Gender'].unique()

array(['Female', 'Male'], dtype=object)

In [15]:
dataset['Gender'] = dataset['Gender'].apply(lambda x: 0 if x == 'Female' else 1)

In [16]:
dataset.head(10)

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,619,0,42,2,0.0,1,1,1,101348.88,1
1,False,True,608,0,41,1,83807.86,1,0,1,112542.58,0
2,False,False,502,0,42,8,159660.8,3,1,0,113931.57,1
3,False,False,699,0,39,1,0.0,2,0,0,93826.63,0
4,False,True,850,0,43,2,125510.82,1,1,1,79084.1,0
5,False,True,645,1,44,8,113755.78,2,1,0,149756.71,1
6,False,False,822,1,50,7,0.0,2,1,1,10062.8,0
7,True,False,376,0,29,4,115046.74,4,1,0,119346.88,1
8,False,False,501,1,44,4,142051.07,2,0,1,74940.5,0
9,False,False,684,1,27,2,134603.88,1,1,1,71725.73,0


### Creating the Training Set and the Test Set

Getting the inputs and output

In [17]:
X = dataset.iloc[:, :-1]

In [18]:
y = dataset.iloc[:, -1]

In [19]:
X

array([[False, False, 619, ..., 1, 1, 101348.88],
       [False, True, 608, ..., 0, 1, 112542.58],
       [False, False, 502, ..., 1, 0, 113931.57],
       ...,
       [False, False, 709, ..., 0, 1, 42085.58],
       [True, False, 772, ..., 1, 0, 92888.52],
       [False, False, 792, ..., 1, 0, 38190.78]],
      shape=(10000, 11), dtype=object)

In [20]:
y

array([1, 0, 1, ..., 1, 1, 0], shape=(10000,))

Getting the Training Set and the Test Set

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Part 2 - Building and training the model

### Building the model

In [22]:
import xgboost
model = xgboost.XGBClassifier(max_depth = 4, learning_rate = 0.1, n_estimators = 100)

### Training the model

In [23]:
model.fit(X_train, y_train)

### Inference

In [24]:
y_pred = model.predict(X_test)

In [25]:
y_pred

array([0, 0, 0, ..., 0, 0, 0], shape=(2000,))

In [26]:
y_test

array([0, 1, 0, ..., 0, 0, 0], shape=(2000,))

### Predicting the result of a single observation

**Homework**

Use our model to predict if the customer with the following informations will leave the bank:

Geography: France

Credit Score: 600

Gender: Male

Age: 40 years old

Tenure: 3 years

Balance: \$ 60000

Number of Products: 2

Does this customer have a credit card? Yes

Is this customer an Active Member: Yes

Estimated Salary: \$ 50000

So, should we say goodbye to that customer?

In [27]:
model.predict([[0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])

array([0])

**Solution**

## Part 3: Evaluating the model

### Making the Confusion Matrix

In [28]:
from sklearn.metrics import confusion_matrix

# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Extract true negatives, false positives, false negatives, and true positives
tn, fp, fn, tp = cm.ravel()

# Print the confusion matrix and its components
print("Confusion Matrix:\n", cm)
print(f"True Negatives (TN): {tn}")
print(f"False Positives (FP): {fp}")
print(f"False Negatives (FN): {fn}")
print(f"True Positives (TP): {tp}")

Confusion Matrix:
 [[1520   75]
 [ 190  215]]
True Negatives (TN): 1520
False Positives (FP): 75
False Negatives (FN): 190
True Positives (TP): 215


### Accuracy

In [29]:
(1521+208)/(1521+208+74+197)

0.8645

In [30]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8675

### k-Fold Cross Validation

In [32]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model,
                             X = X,
                             y = y,
                             scoring = 'accuracy',
                             cv = 10)
print(f"Average Accuracy: {accuracies.mean()*100} %")
print(f"Standard Deviation: {accuracies.std()*100} %")

Average Accuracy: 86.38999999999999 %
Standard Deviation: 0.8607554821202136 %
