# Customer Churn Prediction for Beta Bank

## Introduction
Customer churn, also known as customer attrition, refers to the phenomenon of customers leaving a company. It's a critical metric in many businesses, as it's often less expensive to retain existing customers than to attract new ones. Moreover, reducing customer churn is directly related to increasing customer lifetime value.

In this project, we are working with Beta Bank. The bank has been noticing an increase in customer churn rates. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.

We have been provided with a dataset containing data on clients’ past behavior and termination of contracts with the bank. Our task is to predict whether a customer will leave the bank soon. The goal is to create a machine learning model to predict customer churn so that the bank can proactively address the issue and improve customer retention.

The project involves the following steps:

1. **Data Preparation:** Download and prepare the data. Explain the procedure.
2. **Examine the balance of classes:** Check if our target variable 'Exited' is balanced or imbalanced. A class imbalance could affect the performance of our model.
3. **Train the model without taking into account the imbalance:** We will first train our model without considering the possible class imbalance. This will serve as a baseline for comparison.
4. **Improve the quality of the model:** If there is a class imbalance, we will use at least two approaches to fix it and improve the quality of our model. We will use the training set to pick the best parameters.
5. **Perform the final testing:** Finally, we will evaluate the performance of our model on the test set.

## Data Preparation
In this step, we will download and prepare the data for analysis. The data preparation process involves the following steps:

1. **Downloading the data:** We will download the data from the provided URL.
2. **Loading the data:** We will load the data into a pandas DataFrame, which is a 2-dimensional labeled data structure with columns of potentially different types. DataFrames are generally the most commonly used pandas object.
3. **Inspecting the data:** We will inspect the data to understand its structure and the types of data it contains. This includes checking the number of rows and columns, the types of variables, and the number of missing values.
4. **Handling missing values:** If there are any missing values in the data, we will decide how to handle them. This could involve removing rows or columns with missing values, or filling in the missing values with a specific value.
5. **Encoding categorical variables:** If there are any categorical variables in the data, we will encode them. Machine learning models require input to be in numerical format, so we need to convert categorical variables into a suitable numerical format.

In [23]:
import pandas as pd

data = pd.read_csv('Churn.csv')
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


## Initial Data Analysis
In this step, we will inspect the data to understand its structure and the types of data it contains. This includes checking the number of rows and columns, the types of variables, and the number of missing values.

Understanding the structure of the data will help us determine the necessary data preprocessing steps. For example, if there are missing values, we will need to decide how to handle them. If there are categorical variables, we will need to encode them.

Let's start by checking the number of rows and columns in the data.

In [24]:
# Checking the number of rows and columns in the data
print(f'The data has {data.shape[0]} rows and {data.shape[1]} columns.')

The data has 10000 rows and 14 columns.


In [25]:
# Checking the data types of each column
data.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure             float64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [26]:
# Checking for missing values in each column
data.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

## Handling Missing Values
In the data inspection step, we found that the 'Tenure' column has 909 missing values. Missing values can affect the performance of a machine learning model, and it's important to handle them before training the model.

There are several strategies to handle missing values, including removing rows with missing values and imputing missing values. Removing rows with missing values is the simplest strategy, but it can lead to loss of information if many rows have missing values. Imputing missing values involves filling in the missing values with a specific value. The value could be a central tendency measure like the mean or median (for numerical variables) or the mode (for categorical variables). Alternatively, it could be a value estimated by a machine learning model.

In our case, since 'Tenure' is a numerical variable, we will impute the missing values with the median 'Tenure'. The median is a robust measure of central tendency that is not affected by outliers, making it a suitable choice for imputation.

We will also drop the Surname, RowNumber, and CustomerId columns as they are not relevant for our analysis. These columns are unique identifiers and do not contribute to a customer's churn probability.

Let's proceed with imputing the missing values in the 'Tenure' column.


In [27]:
# Calculating the median 'Tenure'
median_tenure = data['Tenure'].median()

# Imputing the missing values in the 'Tenure' column with the median 'Tenure'
data['Tenure'].fillna(median_tenure, inplace=True)

# Drop 'Surname', 'RowNumber', and 'CustomerId' columns
data = data.drop(['Surname', 'RowNumber', 'CustomerId'], axis=1)

# Checking for missing values in each column to verify
data.isnull().sum()

CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

## Examining the Balance of Classes
In this step, we will check if our target variable 'Exited' is balanced or imbalanced. A class imbalance could affect the performance of our model.

Class imbalance refers to a situation where the classes in the target variable are not represented equally. For example, in a binary classification problem, if 90% of the samples belong to Class A and only 10% belong to Class B, we have a severe class imbalance.

Class imbalance can lead to a misleadingly high accuracy rate. For example, a model that always predicts Class A in the above scenario will be 90% accurate, even though it's not identifying Class B at all. Therefore, it's important to check for class imbalance and take it into account when training the model and evaluating its performance.

Let's check the balance of classes in the 'Exited' column.

In [28]:
# Examine the balance of classes in the 'Exited' column
class_counts = data['Exited'].value_counts()
class_counts

0    7963
1    2037
Name: Exited, dtype: int64

The 'Exited' column, which is our target variable, is imbalanced. There are 7,963 customers who have not exited the bank (represented by 0) and 2,037 customers who have exited the bank (represented by 1). From the output, we can see that about 79.63% of the customers have not exited (class 0), while about 20.37% of the customers have exited (class 1). 

This imbalance in the classes can lead to a bias in the model towards predicting the majority class. Therefore, we need to take this into account when training our model.

But before we handle the class imbalance, let's first train a model without taking into account the imbalance and see how it performs. This will give us a baseline performance that we can compare against after we handle the class imbalance.

## Encoding Categorical Variables
Machine learning models require input to be in numerical format. However, our data contains categorical variables, specifically the 'Geography', and 'Gender' columns. We need to convert these categorical variables into a suitable numerical format.

There are several strategies to encode categorical variables, including label encoding and one-hot encoding. Label encoding involves assigning each unique category in a categorical variable with an integer. One-hot encoding involves creating a new binary column for each unique category in a categorical variable.

First, let's encode the categorical variables. The 'Geography' and 'Gender' columns are categorical and need to be encoded. We will use one-hot encoding for 'Geography' since it is a nominal variable (i.e., there is no order in the categories). For 'Gender', we will use label encoding since it is a binary variable.

In [29]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Label encode 'Gender' column
le = LabelEncoder()
data['Gender'] = le.fit_transform(data['Gender'])

# One-hot encode 'Geography' column
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
data = pd.DataFrame(ct.fit_transform(data))

# Display the first few rows of the DataFrame
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,1.0,0.0,0.0,619.0,0.0,42.0,2.0,0.0,1.0,1.0,1.0,101348.88,1.0
1,0.0,0.0,1.0,608.0,0.0,41.0,1.0,83807.86,1.0,0.0,1.0,112542.58,0.0
2,1.0,0.0,0.0,502.0,0.0,42.0,8.0,159660.8,3.0,1.0,0.0,113931.57,1.0
3,1.0,0.0,0.0,699.0,0.0,39.0,1.0,0.0,2.0,0.0,0.0,93826.63,0.0
4,0.0,0.0,1.0,850.0,0.0,43.0,2.0,125510.82,1.0,1.0,1.0,79084.1,0.0


In [30]:
# Assign appropriate column names after one-hot encoding
data.columns = ['Geography_France', 'Geography_Germany', 'Geography_Spain', 'CreditScore', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited']

# Display the first few rows of the DataFrame
data.head()

Unnamed: 0,Geography_France,Geography_Germany,Geography_Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1.0,0.0,0.0,619.0,0.0,42.0,2.0,0.0,1.0,1.0,1.0,101348.88,1.0
1,0.0,0.0,1.0,608.0,0.0,41.0,1.0,83807.86,1.0,0.0,1.0,112542.58,0.0
2,1.0,0.0,0.0,502.0,0.0,42.0,8.0,159660.8,3.0,1.0,0.0,113931.57,1.0
3,1.0,0.0,0.0,699.0,0.0,39.0,1.0,0.0,2.0,0.0,0.0,93826.63,0.0
4,0.0,0.0,1.0,850.0,0.0,43.0,2.0,125510.82,1.0,1.0,1.0,79084.1,0.0


The 'Geography' and 'Gender' columns have been successfully encoded. The 'Geography' column has been one-hot encoded into three columns (one for each country), and the 'Gender' column has been label encoded (with 'Female' as 0 and 'Male' as 1).


## Training the Model without Taking into Account the Imbalance
In this step, we will train a machine learning model without considering the class imbalance in the 'Exited' column. This will serve as a baseline for comparison when we later address the class imbalance and improve the model.

Next, let's split the data into features (X) and the target variable (y), and then split these into training and test sets. We will use 80% of the data for training and 20% for testing.

In [31]:
from sklearn.model_selection import train_test_split

# Separate the features and target variable
X = data.drop('Exited', axis=1)
y = data['Exited']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the training and test sets
print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)

X_train shape: (8000, 12)
y_train shape: (8000,)
X_test shape: (2000, 12)
y_test shape: (2000,)


The data has been successfully split into training and test sets. The training set contains 8,000 samples and the test set contains 2,000 samples.

Next, we will scale the features. Feature scaling is a method used to standardize the range of independent variables or features of data. It is basically scaling all the dimensions to be even, so one independent variable does not dominate others. In our case, we will use StandardScaler from sklearn, which will normalize the features (each column of X, individually) so that each column/feature/variable will have mean = 0 and standard deviation = 1.

Note that for cross-validation it's better to apply scaling in each fold separately to avoid potential data leakage. The easiest way to achieve this is using pipelines.

In [32]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Create a pipeline with a scaler and a logistic regression model
pipeline = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

## Training a Baseline Model

We will train a logistic regression model without taking into account the imbalance in the classes. This will serve as our baseline model. We will evaluate the model using the F1 score and the AUC-ROC metric.

In [33]:
from sklearn.metrics import f1_score, roc_auc_score

# Make predictions on the testing set
y_pred = pipeline.predict(X_test)

# Calculate the F1 score and AUC-ROC score
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

f1, roc_auc

(0.29532710280373836, 0.5809071634753171)

The logistic regression model, trained without taking into account the class imbalance, has an F1 score of 0.295 and an AUC-ROC score of 0.58 on the test set. The F1 score is quite low, which is not surprising given the imbalance in the classes. The AUC-ROC score is slightly better, but still not great.

Next, let's try to improve the quality of the model by handling the class imbalance. We will use two approaches to handle the class imbalance: oversampling the minority class and undersampling the majority class.

## Handling Class Imbalance

We will handle the class imbalance using two approaches:

1. **Oversampling the minority class:** This involves randomly duplicating examples in the minority class to increase its proportion.
2. **Undersampling the majority class:** This involves randomly deleting examples in the majority class to decrease its proportion.

As we're doing cross-validation, we need to apply over/undersampling in each fold separately (using imblearn pipelines) to avoid overly optimistic cross-validation metrics.

## Model Training with Oversampling and Undersampling

Let's train Logistic Regression, Decision Tree, and Random Forest models with both oversampling and undersampling. We will use GridSearchCV to find the best hyperparameters for these models. After training, we will evaluate these models.

In [34]:
from imblearn.pipeline import Pipeline as imPipeline
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define the models and their hyperparameters
models = [
    {
        'name': 'Logistic Regression',
        'model': LogisticRegression(),
        'params': {}
    },
    {
        'name': 'Decision Tree',
        'model': DecisionTreeClassifier(),
        'params': {
            'model__max_depth': [5, 10, 15, 20],
            'model__min_samples_split': [2, 5, 10]
        }
    },
    {
        'name': 'Random Forest',
        'model': RandomForestClassifier(),
        'params': {
            'model__n_estimators': [100, 200, 300],
            'model__max_depth': [5, 10, 15],
            'model__min_samples_split': [2, 5, 10]
        }
    }
]

# For each model, perform grid search with oversampling and print the best parameters and scores
for model in models:
    pipeline = imPipeline([('oversampler', RandomOverSampler()), ('scaler', StandardScaler()), ('model', model['model'])])
    grid_search = GridSearchCV(pipeline, model['params'], cv=5, scoring='f1')
    grid_search.fit(X_train, y_train)
    print('Oversampling -', model['name'])
    print('Best parameters:', grid_search.best_params_)
    print('Best F1 score:', grid_search.best_score_)
    print()

# For each model, perform grid search with undersampling and print the best parameters and scores
for model in models:
    pipeline = imPipeline([('undersampler', RandomUnderSampler()), ('scaler', StandardScaler()), ('model', model['model'])])
    grid_search = GridSearchCV(pipeline, model['params'], cv=5, scoring='f1')
    grid_search.fit(X_train, y_train)
    print('Undersampling -', model['name'])
    print('Best parameters:', grid_search.best_params_)
    print('Best F1 score:', grid_search.best_score_)
    print()

Oversampling - Logistic Regression
Best parameters: {}
Best F1 score: 0.4925518701318299

Oversampling - Decision Tree
Best parameters: {'model__max_depth': 5, 'model__min_samples_split': 10}
Best F1 score: 0.5562103679638798

Oversampling - Random Forest
Best parameters: {'model__max_depth': 15, 'model__min_samples_split': 10, 'model__n_estimators': 200}
Best F1 score: 0.6182742823761718

Undersampling - Logistic Regression
Best parameters: {}
Best F1 score: 0.49353512153242224

Undersampling - Decision Tree
Best parameters: {'model__max_depth': 5, 'model__min_samples_split': 5}
Best F1 score: 0.5586581072824639

Undersampling - Random Forest
Best parameters: {'model__max_depth': 10, 'model__min_samples_split': 10, 'model__n_estimators': 200}
Best F1 score: 0.5983632607699225



## Model Evaluation

Here are the best F1 scores and parameters for each model with oversampling and undersampling:

### Oversampling

- Logistic Regression: F1 score is approximately 0.49. This model doesn't have any hyperparameters to tune.
- Decision Tree: F1 score is approximately 0.565, achieved with a max depth of 5 and a min samples split of 10.
- Random Forest: F1 score is approximately 0.618, achieved with a max depth of 15, a min samples split of 10, and 200 estimators.

### Undersampling

- Logistic Regression: F1 score is approximately 0.49. This model doesn't have any hyperparameters to tune.
- Decision Tree: F1 score is approximately 0.558, achieved with a max depth of 5 and a min samples split of 5.
- Random Forest: F1 score is approximately 0.598, achieved with a max depth of 10, a min samples split of 10, and 200 estimators.

Next, let's evaluate the performance of these models on the test set.

In [36]:
from sklearn.metrics import f1_score, roc_auc_score

# Define the models with the best hyperparameters
best_models = [
    {
        'name': 'Logistic Regression',
        'model': LogisticRegression(),
        'oversampler': RandomOverSampler(),
        'undersampler': RandomUnderSampler()
    },
    {
        'name': 'Decision Tree',
        'model': DecisionTreeClassifier(max_depth=5, min_samples_split=10),
        'oversampler': RandomOverSampler(),
        'undersampler': RandomUnderSampler()
    },
    {
        'name': 'Random Forest',
        'model': RandomForestClassifier(max_depth=15, min_samples_split=10, n_estimators=200),
        'oversampler': RandomOverSampler(),
        'undersampler': RandomUnderSampler()
    }
]

# For each model, perform evaluation with oversampling and print the F1 score and AUC-ROC score
for model in best_models:
    pipeline = imPipeline([('oversampler', model['oversampler']), ('scaler', StandardScaler()), ('model', model['model'])])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    print('Oversampling -', model['name'])
    print('F1 score:', f1_score(y_test, y_pred))
    print('AUC-ROC score:', roc_auc_score(y_test, y_pred))
    print()

# For each model, perform evaluation with undersampling and print the F1 score and AUC-ROC score
for model in best_models:
    pipeline = imPipeline([('undersampler', model['undersampler']), ('scaler', StandardScaler()), ('model', model['model'])])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    print('Undersampling -', model['name'])
    print('F1 score:', f1_score(y_test, y_pred))
    print('AUC-ROC score:', roc_auc_score(y_test, y_pred))
    print()

Oversampling - Logistic Regression
F1 score: 0.49601417183348095
AUC-ROC score: 0.7143548185340536

Oversampling - Decision Tree
F1 score: 0.5595134665508253
AUC-ROC score: 0.7740127083956799

Oversampling - Random Forest
F1 score: 0.6219974715549936
AUC-ROC score: 0.7656840065172883

Undersampling - Logistic Regression
F1 score: 0.5044247787610618
AUC-ROC score: 0.7219606967608316

Undersampling - Decision Tree
F1 score: 0.5572232645403377
AUC-ROC score: 0.7608744186930272

Undersampling - Random Forest
F1 score: 0.5951219512195123
AUC-ROC score: 0.7862983353680066



## Model Evaluation Results

Here are the F1 scores and AUC-ROC scores for each model with oversampling and undersampling on the test set:

### Oversampling

- Logistic Regression: F1 score is approximately 0.5 and AUC-ROC score is approximately 0.71.
- Decision Tree: F1 score is approximately 0.56 and AUC-ROC score is approximately 0.77.
- Random Forest: F1 score is approximately 0.62 and AUC-ROC score is approximately 0.77.

### Undersampling

- Logistic Regression: F1 score is approximately 0.50 and AUC-ROC score is approximately 0.72.
- Decision Tree: F1 score is approximately 0.56 and AUC-ROC score is approximately 0.76.
- Random Forest: F1 score is approximately 0.60 and AUC-ROC score is approximately 0.79.


The AUC-ROC scores are higher than the F1 scores for all models. This is because the AUC-ROC score measures the ability of the model to distinguish between the positive and negative classes, regardless of the threshold used to classify the predictions. On the other hand, the F1 score depends on the specific threshold used, and is more sensitive to the imbalance of the classes.

From these results, we can see that the Random Forest model with oversampling and undersampling performs the best in terms of both F1 score and AUC-ROC score. Therefore, we can conclude that handling class imbalance significantly improves the performance of the model.

## Conclusion

In this project, we aimed to predict whether a customer will leave the bank soon using the data on clients’ past behavior and termination of contracts with the bank. The goal was to build a model with the maximum possible F1 score, with a target of at least 0.59, and to measure the AUC-ROC metric and compare it with the F1.

We started by preparing the data, which included inspecting the data, handling missing values, and encoding categorical variables. We then examined the balance of classes and found that there was a significant imbalance in the classes. We initially trained the model without taking into account the imbalance, and then improved the quality of the model by handling class imbalance using two approaches: oversampling and undersampling.

We trained Logistic Regression, Decision Tree, and Random Forest models with both oversampling and undersampling, and evaluated these models using both F1 score and AUC-ROC score. The Random Forest model with oversampling and undersampling performed the best, achieving an F1 score of approximately 0.62 and an AUC-ROC score of approximately 0.77.

### Project Result

The project was successful in achieving its goal. The best model (Random Forest with oversampling and undersampling) exceeded the target F1 score of 0.59, indicating that it has a good balance of precision and recall and is capable of distinguishing between customers who will leave the bank soon and those who will not. The AUC-ROC score of approximately 0.77 also indicates that the model has a high true positive rate and a low false positive rate.

### Business Aspects

From a business perspective, this model can be very useful for the bank. By predicting whether a customer will leave the bank soon, the bank can take proactive measures to retain the customer. This could include offering special promotions or discounts, improving customer service, or addressing any issues or concerns that the customer may have. Since it is often cheaper to retain existing customers than to attract new ones, this model can help the bank to reduce costs and increase customer satisfaction.