### Customer Churn prediction 
#### Predicting Bank Customer Churn
It means knowing which customers are likely to leave or unsubscribe from any given service. For many companies, this is a very important prediction. This is because acquiring new customers often costs more than retaining existing ones. Once you’ve identified customers at risk of churn, you need to know exactly what marketing efforts you should make with each customer to maximize their likelihood of staying.

**Churn tells you how many existing customers are leaving your business, so lowering churn has a big positive impact on your revenue streams.**

In this project, you will evaluate a dataset that contains details of a bank's customers and the target variable is a binary variable reflecting the fact whether the customer left the bank (closed his account) or continues to be a customer.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import xgboost as xgb

In [3]:
df = pd.read_csv('Churn_Modelling.csv')

In [4]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [6]:
df.shape

(10000, 14)

In [7]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


##### Activity 1
Separate features and target in X and y respectively. Use only relevant features in X.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_csv('Churn_Modelling.csv')

# Separate features and target
X = data.drop(columns=['Exited'])  # Features
y = data['Exited']  # Target
random_state=0
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Print shapes to verify
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (7000, 13)
X_test shape: (3000, 13)
y_train shape: (7000,)
y_test shape: (3000,)


##### Activity 2 


**XGBoost** is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

It is the most common algorithm used for applied machine learning in competitions and has gained popularity through winning solutions in  structured and tabular data.

XGBoost has a `scikit-learn` API, which is useful if you want to use different scikit-learn classes and methods on an XGBoost model (e.g.,`predict()`, `fit()`). In this section, we'll try the API out with the `xgboost.XGBClassifier()` class and get a baseline accuracy for the rest of our work. The goal is to obtain reproducible result, so we'll set the `random_state`.



In [11]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
data = pd.read_csv('Churn_Modelling.csv')

# Drop unnecessary columns
data = data.drop(columns=['RowNumber', 'CustomerId', 'Surname'])

# Encode 'Gender' column (Male: 1, Female: 0)
label_encoder = LabelEncoder()
data['Gender'] = label_encoder.fit_transform(data['Gender'])

# One-Hot Encoding for 'Geography' column
data = pd.get_dummies(data, columns=['Geography'], drop_first=True)

# Separate features and target
X = data.drop(columns=['Exited'])  # Features
y = data['Exited']  # Target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Initialize the XGBoost Classifier
model = XGBClassifier(random_state=123)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Calculate precision, recall, and F1-score
test_precision = precision_score(y_test, y_test_pred, average='macro')
test_recall = recall_score(y_test, y_test_pred, average='macro')
test_f1score = f1_score(y_test, y_test_pred, average='macro')

# Print results
print(f"Training Accuracy: {train_accuracy}")
print(f"Testing Accuracy: {test_accuracy}")
print(f"Test Precision: {test_precision}")
print(f"Test Recall: {test_recall}")
print(f"Test F1-Score: {test_f1score}")

Training Accuracy: 0.9601428571428572
Testing Accuracy: 0.855
Test Precision: 0.79050031570402
Test Recall: 0.7354356659417244
Test F1-Score: 0.7570273553684387



##### Activity 3

**Hyperparameter tuning** is a vital part of improving the overall behavior and performance of a machine learning model. It is a type of parameter that is set before the learning process and happens outside of the model.

A lack of hyperparameter tuning can often lead to inaccurate results if the loss function is not minimized. 

> The goal is that our model produces as few errors as possible.

In [13]:
# your code goes here
import numpy as np

# Define the parameter grid
rs_param_grid = {
    'max_depth': list(range(3, 12)),  # Maximum depth of a tree
    'alpha': [0, 0.001, 0.01, 0.1, 1],  # L1 regularization term
    'subsample': [0.5, 0.75, 1],  # Fraction of samples used for training
    'learning_rate': np.linspace(0.01, 0.5, 10),  # Step size shrinkage
    'n_estimators': [10, 25, 40]  # Number of boosting rounds
}
from xgboost import XGBClassifier

# Initialize the XGBoost Classifier
xgb = XGBClassifier(random_state=123)

from sklearn.model_selection import RandomizedSearchCV

# Create the RandomizedSearchCV object
xgb_rs = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=rs_param_grid,
    cv=3,
    n_iter=5,
    verbose=2,
    random_state=123,
    scoring='f1_macro'
)

# Fit the RandomizedSearchCV object to the training data
xgb_rs.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", xgb_rs.best_params_)

# Initialize the XGBoost Classifier with the best parameters
best_model = XGBClassifier(**xgb_rs.best_params_, random_state=123)

# Train the model
best_model.fit(X_train, y_train)

# Make predictions
y_test_pred_tuned = best_model.predict(X_test)

# Calculate evaluation metrics
from sklearn.metrics import precision_score, recall_score, f1_score

test_precision_tuned = precision_score(y_test, y_test_pred_tuned, average='macro')
test_recall_tuned = recall_score(y_test, y_test_pred_tuned, average='macro')
test_f1score_tuned = f1_score(y_test, y_test_pred_tuned, average='macro')

# Print results
print(f"Tuned Test Precision: {test_precision_tuned}")
print(f"Tuned Test Recall: {test_recall_tuned}")
print(f"Tuned Test F1-Score: {test_f1score_tuned}")

Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] END alpha=1, learning_rate=0.22777777777777777, max_depth=5, n_estimators=10, subsample=0.5; total time=   0.0s
[CV] END alpha=1, learning_rate=0.22777777777777777, max_depth=5, n_estimators=10, subsample=0.5; total time=   0.0s
[CV] END alpha=1, learning_rate=0.22777777777777777, max_depth=5, n_estimators=10, subsample=0.5; total time=   0.0s
[CV] END alpha=1, learning_rate=0.11888888888888888, max_depth=6, n_estimators=40, subsample=1; total time=   0.0s
[CV] END alpha=1, learning_rate=0.11888888888888888, max_depth=6, n_estimators=40, subsample=1; total time=   0.0s
[CV] END alpha=1, learning_rate=0.11888888888888888, max_depth=6, n_estimators=40, subsample=1; total time=   0.0s
[CV] END alpha=1, learning_rate=0.11888888888888888, max_depth=8, n_estimators=40, subsample=0.75; total time=   0.0s
[CV] END alpha=1, learning_rate=0.11888888888888888, max_depth=8, n_estimators=40, subsample=0.75; total time=   0.0s
[CV] END