## Week 6 - Bank Customer Retention Prediction

In this assignment, you will build a Machine Learning model to predict whether a bank customer is likely to CHURN (i.e., exit) or not based on various features such as credit score, age, tenure, and more. 

The dataset contains information about the customers, including their demographics, banking behaviors, and whether they have exited the bank (the label to be predicted).. 


In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import  LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
%matplotlib inline


**Question 1:** Load the bank customer retention dataset into a variable called `customer_retention_df`. Next, write a function called `check_data` to check if the data has been loaded successfully.

**Question 1.1:** Remove the `RowNumber` column. Then, explore the dataset to understand the features, data types, and potential missing values.

In [2]:
# load the customer retention dataset
customer_retention_df = pd.read_csv('bank_customer_retention_prediction.csv')

# write a function called `check_data` to check data loading is successful
def check_data():
    # Check if the DataFrame is empty
    if customer_retention_df.empty:
        print("Data loading unsuccessful. DataFrame is empty.")
    else:
        # Remove the 'RowNumber' column
        customer_retention_df.drop(columns=['RowNumber'], inplace=True, errors='ignore')

        # Display the first few rows of the DataFrame
        print("Data loading successful. Here are the first few rows of the dataset:")
        print(customer_retention_df.head())

# Call the check_data function
check_data()

Data loading successful. Here are the first few rows of the dataset:
   CustomerId   Surname  CreditScore Geography  Gender  Age  Tenure  \
0    15634602  Hargrave          619    France  Female   42       2   
1    15647311      Hill          608     Spain  Female   41       1   
2    15619304      Onio          502    France  Female   42       8   
3    15701354      Boni          699    France  Female   39       1   
4    15737888  Mitchell          850     Spain  Female   43       2   

     Balance  NumOfProducts  HasCrCard  IsActiveMember  EstimatedSalary  \
0       0.00              1          1               1        101348.88   
1   83807.86              1          0               1        112542.58   
2  159660.80              3          1               0        113931.57   
3       0.00              2          0               0         93826.63   
4  125510.82              1          1               1         79084.10   

   Exited  
0       1  
1       0  
2       1  
3    

**Question 2:** Preprocess the data by handling missing values, converting categorical variables, and scaling numerical features (if needed).

**Note**: assign your final preprocessed dataset to a variable called `processed_customer_retention_df`. Failure to do this might result in you not getting a score for this question.


In [3]:
# Drop unnecessary columns
customer_retention_df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True, errors='ignore')

# Separate features and target variable
X = customer_retention_df.drop(columns=['Exited'])
y = customer_retention_df['Exited']

# Identify categorical and numerical features
categorical_features = ['Geography', 'Gender']
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Create transformers for numerical and categorical features
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first')

# Apply transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Build the preprocessing pipeline
preprocessing_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Apply the preprocessing pipeline to the features
processed_features = preprocessing_pipeline.fit_transform(X)

# Concatenate the preprocessed features with the target variable
processed_customer_retention_df = pd.concat([pd.DataFrame(processed_features), y], axis=1)

# Display the first few rows of the processed DataFrame
print("Processed DataFrame:")
processed_customer_retention_df.head()

Processed DataFrame:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,Exited
0,-0.326221,0.293517,-1.04176,-1.225848,-0.911583,0.646092,0.970243,0.021886,0.0,0.0,0.0,1
1,-0.440036,0.198164,-1.387538,0.11735,-0.911583,-1.547768,0.970243,0.216534,0.0,1.0,0.0,0
2,-1.536794,0.293517,1.032908,1.333053,2.527057,0.646092,-1.03067,0.240687,0.0,0.0,0.0,1
3,0.501521,0.007457,-1.387538,-1.225848,0.807737,-1.547768,-1.03067,-0.108918,0.0,0.0,0.0,0
4,2.063884,0.388871,-1.04176,0.785728,-0.911583,0.646092,0.970243,-0.365276,0.0,1.0,0.0,0


**Question 3:** Split your processed dataset into training and testing set by using `80:20` splitting ratio. You can use **X** and **y** variable to store your splitted dataset.

**Question 3.1:** Train an ML model using `LogisticRegression` to predict customer churn/retention. 

**Note**: Assign your model to a variable called `customer_retention_model`. Failure to do this might result in you not getting a score for this question.

In [4]:
# Prepare the data for training

# Independent variables (features)
X = processed_customer_retention_df.drop('Exited', axis=1)

# Dependent variable (target)
y = processed_customer_retention_df['Exited']

# split your dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the  logistic  regression model
customer_retention_model = LogisticRegression(random_state=42)
customer_retention_model.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = customer_retention_model.predict(X_test)

# Print model evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")


Accuracy: 0.8115


**Question 4:** Predict using the developed model and evaluate the model using acuracy. The evaluation is already completed for you.

**Note**: Assign your prediction to a variable called `prediction`. Failure to do this might result in you not getting a score for this question.

In [5]:

# predict using the model
prediction = customer_retention_model.predict(X_test)

# evaluate the model using accuracy
accuracy = customer_retention_model.score(X_test, y_test)

# Display the accuracy
print(f"Accuracy: {accuracy:.4f}")

prediction

Accuracy: 0.8115


array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

Evaluate the model's performance on the testing data using precision & recall

In [6]:
y_pred = customer_retention_model.predict(X_test)

# Calculate precision, recall, and F1 score
precision = precision_score(y_test, y_pred, zero_division=1)
recall = recall_score(y_test, y_pred, zero_division=1)

# Display precision & recall
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")


Precision: 0.5563
Recall: 0.2010


Fine tuning the model using Grid Search to find optimal hyperparameters:

In [7]:
# Define hyperparameters to search
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l2']
}
# Create logistic regression model
logistic_regression = LogisticRegression(random_state=42)

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(logistic_regression, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Use the best model for predictions
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test)

# Evaluate the tuned model
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print(f"Tuned Model Accuracy: {accuracy_tuned:.4f}")


Best Hyperparameters: {'C': 100, 'penalty': 'l2'}
Tuned Model Accuracy: 0.8110


Precision of the fine tuned model. 

In [8]:
y_pred_tuned = best_model.predict(X_test)

# Calculate precision for the fine-tuned model
precision_tuned = precision_score(y_test, y_pred_tuned, zero_division=1)

# Display precision for the fine-tuned model
print(f"Tuned Model Precision: {precision_tuned:.4f}")

Tuned Model Precision: 0.5524


<!-- BEGIN QUESTION -->

**Question 5:** What insight can you derive from this data?

Model Performance:

The models exhibit a precision of 0.5524, indicating that when they predict a customer will exit, they are correct about 55.24% of the time. This is a moderate precision level, suggesting a reasonable ability to avoid false positives.

However, the recall is relatively low at 0.2010, meaning the models are capturing only 20.10% of all actual customer exits. This indicates a challenge in effectively identifying customers who truly exit the bank.

Implications:

The precision-focused approach of the models results in a trade-off between precision and recall. The models are cautious in predicting customer churn, and when they do make such predictions, they are correct over half the time. However, this caution leads to missing a significant portion of actual customer exits.

Equal Accuracy: Both the initial and fine-tuned models share the same accuracy of 0.8110. This suggests that, overall, they perform equally well in terms of correctly predicting both positive and negative instances.

Threshold Impact: The precision-focused approach is evident in the model's emphasis on high precision but at the cost of lower recall. 

<!-- END QUESTION -->

