This Python code is for building, tuning, and evaluating machine learning models using the scikit-learn library.

Dataset: https://www.kaggle.com/datasets/shantanudhakadd/bank-customer-churn-prediction

In [1]:
# Importing necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Step 1: Data Exploration
- Loads the dataset from the file `Churn_Modelling.csv`
- Displays the first few rows of the dataset
- Provides information about the dataset (e.g., column data types, non-null counts)
- Describes the statistical summary of the numerical features in the dataset

In [2]:
# Load the dataset
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Neuronexus Innovations/Neuronexus Innovations - Machine Learning/Customer Churn Prediction/Churn_Modelling.csv')

# Exploring the data
print('\n', data.head())
print('\n', data.info())
print('\n', data.describe())


    RowNumber  CustomerId   Surname  CreditScore Geography  Gender  Age  \
0          1    15634602  Hargrave          619    France  Female   42   
1          2    15647311      Hill          608     Spain  Female   41   
2          3    15619304      Onio          502    France  Female   42   
3          4    15701354      Boni          699    France  Female   39   
4          5    15737888  Mitchell          850     Spain  Female   43   

   Tenure    Balance  NumOfProducts  HasCrCard  IsActiveMember  \
0       2       0.00              1          1               1   
1       1   83807.86              1          0               1   
2       8  159660.80              3          1               0   
3       1       0.00              2          0               0   
4       2  125510.82              1          1               1   

   EstimatedSalary  Exited  
0        101348.88       1  
1        112542.58       0  
2        113931.57       1  
3         93826.63       0  
4         7

### Step 2: Feature Engineering
- Identifies important features that could potentially influence customer churn
- No specific feature engineering steps are added in this code

In [3]:
# Step 2: Feature Engineering (if required)
# Identify important features that could potentially influence customer churn
# Add feature engineering steps if necessary based on the dataset and problem at hand

### Step 3: Data Preprocessing
- Splits the data into features (X) and the target variable (y)
- Splits the data into training and testing sets using a 80-20 split ratio

In [4]:
X = data.drop(columns=['RowNumber', 'CustomerId', 'Surname', 'Exited'])  # Features
y = data['Exited']  # Target variable

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 4: Model Building
- Defines three pipelines for three different classifiers: Random Forest, Logistic Regression, and Gradient Boosting
- Each pipeline includes a preprocessing step using `ColumnTransformer` to scale numerical features and one-hot encode categorical features
- The classifiers used are RandomForestClassifier, LogisticRegression, and GradientBoostingClassifier

In [5]:
# Using Random Forest, Logistic Regression, and Gradient Boosting classifiers
model_rf = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
        ('num', StandardScaler(), ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']),
        ('cat', OneHotEncoder(), ['Geography', 'Gender'])
    ])),
    ('classifier', RandomForestClassifier())
])

model_lr = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
        ('num', StandardScaler(), ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']),
        ('cat', OneHotEncoder(), ['Geography', 'Gender'])
    ])),
    ('classifier', LogisticRegression())
])

model_gb = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
        ('num', StandardScaler(), ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']),
        ('cat', OneHotEncoder(), ['Geography', 'Gender'])
    ])),
    ('classifier', GradientBoostingClassifier())
])

### Step 5: Hyperparameter Tuning
- For each classifier, it conducts hyperparameter tuning using `GridSearchCV` to find the best hyperparameters based on accuracy
- It prints the best hyperparameters for each classifier

### Step 6: Model Evaluation
- Evaluates each classifier's performance using metrics such as Accuracy, Precision, Recall, F1 Score, and ROC AUC
- Prints the evaluation results for each classifier

In [6]:
# Step 5: Hyperparameter Tuning for Random Forest Classifier
param_grid_rf = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [10, 20, 30, None]
}
grid_search_rf = GridSearchCV(model_rf, param_grid_rf, cv=5, scoring='accuracy').fit(X_train, y_train)
best_params_rf = grid_search_rf.best_params_
print("\nBest Hyperparameters for Random Forest Classifier:", best_params_rf)

# Step 6: Model Evaluation for Random Forest Classifier
best_model_rf = grid_search_rf.best_estimator_
y_pred_rf = best_model_rf.predict(X_test)
metrics_rf = {
    'Accuracy RF': accuracy_score,
    'Precision RF': precision_score,
    'Recall RF': recall_score,
    'F1 Score RF': f1_score,
    'ROC AUC RF': roc_auc_score
}
results_rf = {metric: score(y_test, y_pred_rf) for metric, score in metrics_rf.items()}
print("\n", results_rf)


Best Hyperparameters for Random Forest Classifier: {'classifier__max_depth': 30, 'classifier__n_estimators': 300}

 {'Accuracy RF': 0.8655, 'Precision RF': 0.748, 'Recall RF': 0.4758269720101781, 'F1 Score RF': 0.5816485225505444, 'ROC AUC RF': 0.718311743627989}


In [7]:
# Step 5: Hyperparameter Tuning for Logistic Regression
param_grid_lr = {'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search_lr = GridSearchCV(model_lr, param_grid_lr, cv=5, scoring='accuracy').fit(X_train, y_train)
best_params_lr = grid_search_lr.best_params_
print("\nBest Hyperparameters for Logistic Regression:", best_params_lr)

# Step 6: Model Evaluation for Logistic Regression
best_model_lr = grid_search_lr.best_estimator_
y_pred_lr = best_model_lr.predict(X_test)
metrics_lr = {
    'Accuracy LR': accuracy_score,
    'Precision LR': precision_score,
    'Recall LR': recall_score,
    'F1 Score LR': f1_score,
    'ROC AUC LR': roc_auc_score
}
results_lr = {metric: score(y_test, y_pred_lr) for metric, score in metrics_lr.items()}
print("\n", results_lr)


Best Hyperparameters for Logistic Regression: {'classifier__C': 10}

 {'Accuracy LR': 0.811, 'Precision LR': 0.5524475524475524, 'Recall LR': 0.2010178117048346, 'F1 Score LR': 0.2947761194029851, 'ROC AUC LR': 0.5805960247074267}


In [8]:
# Step 5: Hyperparameter Tuning for Gradient Boosting Classifier
param_grid_gb = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.05, 0.1, 0.2],
    'classifier__max_depth': [3, 4, 5]
}
grid_search_gb = GridSearchCV(model_gb, param_grid_gb, cv=5, scoring='accuracy').fit(X_train, y_train)
best_params_gb = grid_search_gb.best_params_
print("\nBest Hyperparameters for Gradient Boosting Classifier:", best_params_gb)

# Step 6: Model Evaluation for Gradient Boosting Classifier
best_model_gb = grid_search_gb.best_estimator_
y_pred_gb = best_model_gb.predict(X_test)
metrics_gb = {
    'Accuracy GB': accuracy_score,
    'Precision GB': precision_score,
    'Recall GB': recall_score,
    'F1 Score GB': f1_score,
    'ROC AUC GB': roc_auc_score
}
results_gb = {metric: score(y_test, y_pred_gb) for metric, score in metrics_gb.items()}
print("\n", results_gb)


Best Hyperparameters for Gradient Boosting Classifier: {'classifier__learning_rate': 0.05, 'classifier__max_depth': 5, 'classifier__n_estimators': 100}

 {'Accuracy GB': 0.867, 'Precision GB': 0.7509881422924901, 'Recall GB': 0.48346055979643765, 'F1 Score GB': 0.588235294117647, 'ROC AUC GB': 0.7221285375211186}


### Step 7: Deployment
- Placeholder comment for future deployment of the best model to make predictions on new customer data

In [9]:
# Step 7: Deployment
# Deploy the best model to make predictions on new customer data
# Include deployment steps as per your specific deployment requirements

The code demonstrates a structured approach to building, tuning, and evaluating machine learning models for predicting customer churn. It adheres to the best practices for preprocessing, model selection, and performance evaluation.