# LightGBM Classifier

## Part 1 - Data Preprocessing

### Importing the dataset

In [4]:
import pandas as pd
dataset = pd.read_csv('churn_modelling.csv')

In [5]:
dataset.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [6]:
# The dataset contains customer information for a bank, including features such as CreditScore, Geography, Gender, Age, Tenure, Balance, 
# NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary. The target variable is 'Exited', which indicates whether a customer 
# has exited the bank. The goal is to predict the 'Exited' status based on the given features.


### Checking missing data

In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       10000 non-null  int64  
 1   Surname          10000 non-null  object 
 2   CreditScore      10000 non-null  int64  
 3   Geography        10000 non-null  object 
 4   Gender           10000 non-null  object 
 5   Age              10000 non-null  int64  
 6   Tenure           10000 non-null  int64  
 7   Balance          10000 non-null  float64
 8   NumOfProducts    10000 non-null  int64  
 9   HasCrCard        10000 non-null  int64  
 10  IsActiveMember   10000 non-null  int64  
 11  EstimatedSalary  10000 non-null  float64
 12  Exited           10000 non-null  int64  
dtypes: float64(2), int64(8), object(3)
memory usage: 1015.8+ KB


### Handling categorical variables

CustomerId and Surname columns

In [8]:
dataset.drop(['CustomerId', 'Surname'], axis = 1, inplace = True)

In [9]:
dataset.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Geography column

In [10]:
dataset['Geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [11]:
geography_dummies = pd.get_dummies(dataset['Geography'], drop_first = True)

In [12]:
geography_dummies

Unnamed: 0,Germany,Spain
0,False,False
1,False,True
2,False,False
3,False,False
4,False,True
...,...,...
9995,False,False
9996,False,False
9997,False,False
9998,True,False


In [13]:
dataset = pd.concat([geography_dummies, dataset], axis = 1)

In [14]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,False,True,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,False,False,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,False,False,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,False,True,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [15]:
dataset.drop(['Geography'], axis = 1, inplace = True)

In [16]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,619,Female,42,2,0.0,1,1,1,101348.88,1
1,False,True,608,Female,41,1,83807.86,1,0,1,112542.58,0
2,False,False,502,Female,42,8,159660.8,3,1,0,113931.57,1
3,False,False,699,Female,39,1,0.0,2,0,0,93826.63,0
4,False,True,850,Female,43,2,125510.82,1,1,1,79084.1,0


Gender column

In [17]:
dataset['Gender'].unique()

array(['Female', 'Male'], dtype=object)

In [18]:
dataset['Gender'] = dataset['Gender'].apply(lambda x: 0 if x == 'Female' else 1)

In [19]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,619,0,42,2,0.0,1,1,1,101348.88,1
1,False,True,608,0,41,1,83807.86,1,0,1,112542.58,0
2,False,False,502,0,42,8,159660.8,3,1,0,113931.57,1
3,False,False,699,0,39,1,0.0,2,0,0,93826.63,0
4,False,True,850,0,43,2,125510.82,1,1,1,79084.1,0


### Creating the Training Set and the Test Set

Getting the inputs and output

In [20]:
X = dataset.iloc[:, :-1]

In [21]:
y = dataset.iloc[:, -1]

In [22]:
X

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,False,False,619,0,42,2,0.00,1,1,1,101348.88
1,False,True,608,0,41,1,83807.86,1,0,1,112542.58
2,False,False,502,0,42,8,159660.80,3,1,0,113931.57
3,False,False,699,0,39,1,0.00,2,0,0,93826.63
4,False,True,850,0,43,2,125510.82,1,1,1,79084.10
...,...,...,...,...,...,...,...,...,...,...,...
9995,False,False,771,1,39,5,0.00,2,1,0,96270.64
9996,False,False,516,1,35,10,57369.61,1,1,1,101699.77
9997,False,False,709,0,36,7,0.00,1,0,1,42085.58
9998,True,False,772,1,42,3,75075.31,2,1,0,92888.52


In [23]:
y

0       1
1       0
2       1
3       0
4       0
       ..
9995    0
9996    0
9997    1
9998    1
9999    0
Name: Exited, Length: 10000, dtype: int64

Getting the Training Set and the Test Set

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Part 2 - Building and training the model

### Building the model

In [29]:
#pip install lightgbm
%pip install lightgbm

Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-win_amd64.whl.metadata (17 kB)
Downloading lightgbm-4.6.0-py3-none-win_amd64.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------- -------------------------------- 0.3/1.5 MB ? eta -:--:--
   ---------------------------- ----------- 1.0/1.5 MB 3.8 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 3.6 MB/s  0:00:00
Installing collected packages: lightgbm
Successfully installed lightgbm-4.6.0
Note: you may need to restart the kernel to use updated packages.


In [30]:
import lightgbm as lgb
model = lgb.LGBMClassifier()

### Training the model

In [31]:
model.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 1632, number of negative: 6368
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000971 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 856
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.204000 -> initscore=-1.361479
[LightGBM] [Info] Start training from score -1.361479


### Inference

In [32]:
y_pred = model.predict(X_test)

### Predicting the result of a single observation

**Homework**

Use our model to predict if the customer with the following informations will leave the bank:

Geography: France

Credit Score: 600

Gender: Male

Age: 40 years old

Tenure: 3 years

Balance: \$ 60000

Number of Products: 2

Does this customer have a credit card? Yes

Is this customer an Active Member: Yes

Estimated Salary: \$ 50000

So, should we say goodbye to that customer?

**Solution**

In [33]:
print(model.predict([[0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]]))

[0]




## Part 3: Evaluating the model

### Making the Confusion Matrix

In [34]:
from sklearn.metrics import confusion_matrix

# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Extract true negatives, false positives, false negatives, and true positives
tn, fp, fn, tp = cm.ravel()

# Print the confusion matrix and its components
print("Confusion Matrix:")
print(cm)
print(f"True Negatives (TN): {tn}")
print(f"False Positives (FP): {fp}")
print(f"False Negatives (FN): {fn}")
print(f"True Positives (TP): {tp}")

Confusion Matrix:
[[1506   89]
 [ 184  221]]
True Negatives (TN): 1506
False Positives (FP): 89
False Negatives (FN): 184
True Positives (TP): 221


### Accuracy

In [35]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8635

### k-Fold Cross Validation

In [36]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model,
                             X = X,
                             y = y,
                             scoring = 'accuracy',
                             cv = 10)
print(f"Accuracy: {accuracies.mean()*100} %")
print(f"Standard Deviation: {accuracies.std()*100}")

[LightGBM] [Info] Number of positive: 1833, number of negative: 7167
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001545 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 856
[LightGBM] [Info] Number of data points in the train set: 9000, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.203667 -> initscore=-1.363533
[LightGBM] [Info] Start training from score -1.363533
[LightGBM] [Info] Number of positive: 1833, number of negative: 7167
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000135 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 857
[LightGBM] [Info] Number of data points in the train set: 9000, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.203667 -> initscore=-1.363533
[LightGBM]

### Grid Search (might take some time to run)

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = [{'num_leaves': [29,30,31,32,33], 'learning_rate': [0.08,0.09,0.1,0.11,0.12], 'n_estimators': [80,90,100,110,120]}]
grid_search = GridSearchCV(estimator = model,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10)
grid_search.fit(X, y)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print(f"Best Accuracy: {best_accuracy*100}")
print("Best Parameters:", best_parameters)

In [None]:
# Explanation of GridSearchCV parameters:
# 'num_leaves': This parameter controls the maximum number of leaves in one tree. 
#               Increasing this value can improve the model's learning capacity but may lead to overfitting.
# 'learning_rate': This parameter determines the step size at each iteration while moving toward a minimum of a loss function.
#                  A smaller learning rate requires more boosting rounds but can lead to better accuracy.
# 'n_estimators': This parameter specifies the number of boosting rounds or trees in the model.
#                 More trees can improve the model's performance but also increase the risk of overfitting.
