| ![DB_logo.png](attachment:DB_logo.png) | 
|----------------------------------------|

# Deutsche Bank Customer Churn Prediction: End-to-End Analysis and Modeling

## **Introduction**

**Deutsche Bank**, a leading multinational financial services provider, has brought me on board as a data analytics professional to address an urgent business concern: customer churn. As part of the bank’s customer analytics team, I am tasked with analyzing customer behavior trends to help leadership improve retention strategies, enhance client relationships, and reduce account closures.

In this project, I will apply a full suite of data analytics techniques, including exploratory data analysis, feature engineering, model development, and performance evaluation. Using Python, I will build and evaluate several supervised machine learning models, Naive Bayes, Decision Tree, Random Forest, and XGBoost, to predict whether a customer is likely to churn. My ultimate goal is to identify a champion model based on cross-validated performance and deliver actionable insights that help Deutsche Bank proactively manage churn risk and strengthen long-term customer loyalty.

### **Company Background**

**Deutsche Bank** is a major European global banking and financial services institution headquartered in Frankfurt, Germany. Serving millions of individual and corporate clients worldwide, the bank offers services ranging from retail banking to asset management and investment banking. Renowned for its expansive international footprint and innovative financial products, **Deutsche Bank** has a long history of helping clients manage their wealth, investments, and daily banking needs.

In a highly competitive European financial market, customer acquisition and retention are essential for long-term profitability. However, rising churn rates have emerged as a significant operational and financial challenge for the bank. In response, **Deutsche Bank** is investing in advanced data analytics to better understand the drivers of customer attrition and develop targeted strategies to enhance client satisfaction and retention.


### **The Deutsche Bank Scenario**

As part of **Deutsche Bank’s** customer analytics initiative, I have been tasked with addressing a growing concern: the high rate of customer churn. Churn at **Deutsche Bank** reflects customers choosing to close their accounts or discontinue their relationship with the bank. Leadership is particularly concerned because of the substantial financial costs, operational disruptions, and long-term value loss associated with losing customers.

To better understand and mitigate this issue, the bank has compiled customer data capturing financial behavior, demographics, and account activity. My role is to analyze this data, build predictive models, and identify the factors most strongly associated with churn. By proactively identifying at-risk customers, **Deutsche Bank** hopes to implement timely retention strategies, such as personalized offers, account management interventions, and loyalty incentives, to reduce churn rates, improve client satisfaction, and protect long-term revenue streams.

### **Project Scope**

This project focuses on developing a predictive model to help **Deutsche Bank** better understand the factors influencing customer churn. The goal is to predict whether a current customer is likely to close their account based on various customer characteristics, financial behaviors, and account activity indicators.

The project will follow a complete data analytics workflow, starting with exploratory data analysis (EDA) to identify trends, patterns, and anomalies in the dataset. I will then proceed to develop and evaluate multiple supervised machine learning models, including Naive Bayes, Decision Tree, Random Forest, and XGBoost. Each model will be assessed using cross-validation and relevant performance metrics to identify the champion model with the strongest predictive accuracy and reliability.

By integrating these supervised machine learning techniques, this approach ensures a comprehensive understanding of customer churn dynamics. The final model will equip **Deutsche Bank** with actionable, data-driven insights to proactively manage at-risk customers and implement targeted retention strategies.

### **Business Impact**

An effective predictive model will enable **Deutsche Bank** to identify customers at high risk of churn and uncover the key factors contributing to their decision to leave. By understanding these drivers, the bank can take timely, personalized actions such as offering tailored promotions, loyalty incentives, or specialized account services to improve customer satisfaction and reduce attrition.

Reducing customer churn directly impacts the bank’s bottom line by preserving long-term customer value, minimizing acquisition and onboarding costs, and safeguarding revenue streams. Furthermore, proactive retention efforts contribute to stronger client relationships, higher brand loyalty, and a competitive advantage in the European banking market. By integrating predictive analytics into its customer management strategy, **Deutsche Bank** can optimize operational efficiency and build a more resilient, loyal customer base.

## **Overview**

This project investigates the factors influencing customer churn in the European banking industry, using historical customer data from **Deutsche Bank**. The primary objective is to predict whether a current customer is likely to close their account based on various customer characteristics, financial behaviors, and account activity indicators. In doing so, the project aims to uncover actionable insights that can guide customer retention strategies and operational decision-making.

To achieve this, I followed a clear sequence of stages from exploratory data analysis to model development, evaluation, and insight generation, and a **structured, modeling-focused data analytics workflow**, progressing through modeling approach stages during the model development. This structured approach allowed for a **comparative study of multiple supervised machine learning models**, ensuring that each stage aligned with best practices in predictive modeling and contributed meaningfully to selecting the most effective solution.

change below
The process included:

* **Exploratory Data Analysis (EDA):** I conducted data profiling to examine variable types and missing values, analyzed the target variable distribution, and explored relationships between customer attributes and churn behavior. Visualizations and descriptive statistics were used to detect outliers, identify patterns, and prepare the dataset for modeling.

* **Data Preprocessing:** I managed missing values, encoded categorical variables, scaled numerical features where necessary, and prepared the dataset for reliable model development.

* **Predictive Modeling:** I developed and assessed four supervised machine learning models—**Naive Bayes, Decision Tree, Random Forest, and XGBoost**—to predict customer churn. Each model was evaluated for its predictive capability, interpretability, and practical application.

* **Model Evaluation:** The models were compared using relevant performance metrics such as accuracy, precision, recall, F1-score, and confusion matrices to identify the champion model with the strongest predictive accuracy and business utility.

* **Insight Generation:** I translated model outcomes into actionable business recommendations to inform retention strategies, customer experience initiatives, and operational improvements aimed at reducing customer attrition.

By emphasizing **a modeling-focused approach at every stage** and conducting performance comparisons after each modeling phase, I critically evaluated how predictive performance evolved with the introduction of increasingly advanced techniques. This comparative study offers a comprehensive perspective on how each model’s unique strengths contribute to predictive accuracy and decision support.

This initiative demonstrates the value of predictive analytics in addressing customer churn, equipping **Deutsche Bank** with a reliable framework to anticipate customer attrition risks, implement timely interventions, and strengthen long-term customer loyalty.

### **Dataset Structure**

This dataset, titled **Churn\_Modelling.csv**, contains information on **10,000 customers** of Deutsche Bank. Each row represents a unique customer, capturing their **demographic and financial details**. The target variable, `Exited`, indicates whether a customer has churned (i.e., left the bank), with two possible values: **1** (churned) and **0** (did not churn).

The dataset includes a range of **demographic details**, **financial attributes**, and **account activity indicators**. These features cover **customer age**, **geography**, **account balance**, **tenure with the bank**, and **product usage patterns**, such as credit card ownership and active membership status, all of which may influence a customer’s likelihood of churning.

| Column Name     | Type    | Description                                                          |
| :-------------- | :------ | :------------------------------------------------------------------- |
| RowNumber       | int64   | Unique row index for each observation                                |
| CustomerId      | int64   | Unique customer identifier                                           |
| Surname         | object  | Customer’s last name                                                 |
| CreditScore     | int64   | Customer’s credit score                                              |
| Geography       | object  | Customer’s country of residence (France, Spain, or Germany)          |
| Gender          | object  | Customer’s gender (Male or Female)                                   |
| Age             | int64   | Customer’s age in years                                              |
| Tenure          | int64   | Number of years the customer has been with the bank                  |
| Balance         | float64 | Customer’s bank account balance                                      |
| NumOfProducts   | int64   | Number of bank products the customer uses                            |
| HasCrCard       | int64   | Whether the customer has a credit card (1 = Yes, 0 = No)             |
| IsActiveMember  | int64   | Whether the customer is an active bank member (1 = Yes, 0 = No)      |
| EstimatedSalary | float64 | Customer’s estimated annual salary                                   |
| Exited          | int64   | Target variable indicating if the customer churned (1 = Yes, 0 = No) |

This dataset provides a comprehensive overview of customer satisfaction across multiple dimensions of the flight experience, serving as a robust foundation for analysis and predictive modeling to uncover key drivers of passenger satisfaction.

### **Library & Package Versions**

The following versions of the libraries and packages were used for this project:

* **Python Version:** 3.12.4
* **NumPy Version:** 2.2.5
* **Pandas Version:** 2.2.3
* **Matplotlib Version:** 3.10.1
* **Seaborn Version:** 0.13.2
* **Scikit-learn Version:** 1.6.1
* **XGBoost Version:** 3.0.0

Maintaining consistent package versions ensures reproducibility and compatibility throughout the workflow. This allows the analysis, models, and results to be reliably replicated, supporting transparency and accuracy in data processing and predictive modeling.

### **Libraries and Tools Used**

* **pandas** for data manipulation, cleaning, and analysis
* **numpy** for numerical computations and array operations
* **matplotlib** and **seaborn** for visualizing data distributions, trends, and model diagnostics
* **scikit-learn** for feature engineering, model building, evaluation, and tuning, including:

  * **MinMaxScaler** for feature scaling
  * **GaussianNB** for Naive Bayes classification
  * **DecisionTreeClassifier** for decision tree models
  * **RandomForestClassifier** for random forest ensemble modeling
  * **XGBClassifier** from **xgboost** for gradient boosting classification
  * **train\_test\_split** and **PredefinedSplit** for managing train-test datasets
  * **GridSearchCV** for hyperparameter optimization
  * **metrics** and individual scoring functions like **accuracy\_score**, **precision\_score**, **recall\_score**, and **f1\_score** for performance evaluation
  * **ConfusionMatrixDisplay** and **RocCurveDisplay** for model performance visualization
* **pickle** for saving and loading trained models for future deployment and reuse

This toolset ensures a complete, scalable workflow—from data ingestion and preprocessing to exploratory analysis, model training, performance evaluation, and model preservation.

In [56]:
# Standard operational package imports
import numpy as np
import pandas as pd

# Visualization package imports
import matplotlib.pyplot as plt
import seaborn as sns

# For feature engineering
from sklearn.preprocessing import MinMaxScaler

# For Naive Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# For Decision Tree
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import GridSearchCV

# For Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import PredefinedSplit

# For XGBoost
from xgboost import XGBClassifier, plot_importance
from sklearn.metrics import RocCurveDisplay

# Common sklearn imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                           f1_score, confusion_matrix, ConfusionMatrixDisplay)

# For saving models
import pickle

## Data Exploration, Analysis and Feature Engineering

### **Introduction**

In this section, I will perform essential exploratory data analysis and feature engineering on the dataset. The objective is to prepare the data for the supervised classification modeling approach stages by transforming and refining the raw dataset into a format suitable for predictive modeling. This process is critical for improving model performance, interpretability, and fairness, ensuring the resulting models are both effective and ethical.The focus will be on selecting relevant features, creating new derived variables, and transforming existing data to enhance its value for classification tasks. 

### **Task**

In this section, I will explore, clean, and engineer features within the customer dataset to prepare it for future modeling. The workflow involves importing and inspecting the data, selecting and transforming key variables, and generating a final, optimized dataset for classification models. The target variable is `Exited`, indicating whether a customer churned (1) or remained with the bank (0).

### **Overview**

In this section, I will focus on preparing the Deutsche Bank customer churn dataset for predictive modeling by performing key activities such as **feature selection**, **feature extraction**, and **feature transformation**. These actions will result in a processed dataset that will serve as the foundation for subsequent predictive modeling and performance comparisons. By preparing the data effectively, I aim to enhance the quality and predictive power of the final models.

Topics of focus in this section include:

* **Feature selection**

  * Removing uninformative features
* **Feature extraction**

  * Creating new features from existing features
* **Feature transformation**

  * Modifying existing features to better suit our objectives
  * Encoding of categorical features as dummies

### **Project Background**

The ultimate goal of this project is to develop robust predictions of customer churn, enabling the bank to proactively identify customers who are at risk of leaving. This early phase of feature engineering and data preparation will establish a foundation for building predictive models that can detect churn behavior with high accuracy. By focusing on creating useful features and transforming the data to suit the needs of machine learning models, this stage ensures that the subsequent modeling process is both efficient and effective. The insights gained from these activities will guide the development of more advanced models in the upcoming modeling approach stages, ensuring alignment with business objectives and maximizing the potential for customer retention.

### **Libraries and Packages Used in This Data Preparation Stage**

* **Standard operational packages**:

  * **Pandas** — utilized for data manipulation, cleaning, and analysis, making it easy to load, inspect, and transform tabular datasets.
  * **NumPy** — provides numerical operations support, particularly for working with arrays and performing mathematical computations essential for data preprocessing.

These core libraries ensure efficient and reliable data handling as we prepare the dataset for subsequent modeling stages.

## Target variable

The column called `Exited` is a Boolean value that indicates whether or not a customer left the bank (0 = did not leave, 1 = did leave). This will be my target variable. In other words, for each customer, my model should predict whether they should have a 0 or a 1 in the `Exited` column.

This is a supervised learning classification task because I will predict on a binary class. Therefore, this notebook will prepare the data for a classification model.

### Load dataset

`Pandas` is used to read the dataset. To begin, I will read in the data from a `.csv` file and briefly examine it to better understand what it is telling me.

In [57]:
churn_df_original=pd.read_csv(r"C:\Users\saswa\Documents\GitHub\Deutsche-Bank-Customer-Churn-Prediction-End-to-End-Analysis-and-Modeling\Data\Churn_Modelling.csv")

### Output the first 10 rows

To get a quick look at the dataset, I output the first 10 rows of data.

In [58]:
churn_df_original.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2,134603.88,1,1,1,71725.73,0


### Explore the data

I checked the **data type of each column** in the dataset using the `.dtypes` attribute. This step is important because **logistic regression models require all predictor variables to be numeric**. 

In [59]:
churn_df_original.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

#### **Observations on Variable Data Types and Implications for Preprocessing**

The dataset contains three types of variables: **integer (int64), floating-point (float64), and categorical (object).**

**Observations on Data Types:**

* **Categorical Variables (object type)**
  The variables `Surname`, `Geography`, and `Gender` are stored as **objects**. These represent categorical data and will need to be **encoded** (e.g., one-hot encoding) before being used in machine learning models.

* **Numerical Variables (int64 & float64)**
  Most numerical features, such as `CreditScore`, `Age`, `Tenure`, `NumOfProducts`, `HasCrCard`, `IsActiveMember`, and `Exited` are stored as **int64** (whole numbers).
  `Balance` and `EstimatedSalary` are stored as **float64**, indicating these features contain continuous numerical values with potential decimals.

Recognizing these data type distinctions ensures that appropriate **data preprocessing** techniques—such as encoding for categorical variables and scaling for numerical variables—are applied prior to model development.

### Check the number of rows and columns in the dataset

In [60]:
churn_df_original.shape

(10000, 14)

#### **Importance of Inspecting Dataset Dimensions for Data Quality and Completeness**

Checking the number of rows and columns in a dataset is crucial for assessing its completeness and quality. If only a small number of values are missing, they can often be safely removed. However, if many values are missing, alternative strategies like imputation or further investigation into the data source may be necessary.

### **Gather basic information about the data**

In [61]:
churn_df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


#### **Initial Data Exploration and Observations**

The **Deutsche Bank customer churn dataset** includes **10,000 entries and 14 columns** covering customer demographics, financials, and activity details.

Key **continuous variables** include `CreditScore`, `Age`, `Balance`, and `EstimatedSalary`, which may directly influence churn and could require scaling before modeling. The **categorical variable** `Geography` will be one-hot encoded for compatibility with machine learning algorithms.

**Binary variables** like `HasCrCard`, `IsActiveMember`, and `Exited` (the target) are already properly formatted.

Finally, **identifier fields** (`RowNumber`, `CustomerId`, `Surname`) and `Gender` — which raises ethical concerns — will be removed during feature selection to simplify the model and avoid bias.

### Gather descriptive statistics about the data

In [62]:
churn_df_original.describe(include='all')

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000,10000.0,10000,10000,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
unique,,,2932,,3,2,,,,,,,,
top,,,Smith,,France,Male,,,,,,,,
freq,,,32,,5014,5457,,,,,,,,
mean,5000.5,15690940.0,,650.5288,,,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,,96.653299,,,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,,350.0,,,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,,584.0,,,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,,652.0,,,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,,718.0,,,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0


#### **Overall Interpretation of Summary Statistics**

The **Deutsche Bank customer churn dataset** contains a combination of numerical, categorical, and identifier fields. The numerical variables display reasonable ranges and distributions. For example, `CreditScore` ranges from **350 to 850** with a mean around **650**, while `Age` spans **18 to 92**, averaging **39 years**.

The `Balance` and `EstimatedSalary` fields show considerable variability, reflecting the financial diversity of the customer base. Binary fields like `HasCrCard`, `IsActiveMember`, and `Exited` appear well-suited for modeling. Additionally, the close alignment of mean and median values for fields such as `RowNumber`, `CustomerId`, and `NumOfProducts` suggests largely symmetrical distributions without strong skew. Categorical variables like `Geography` and `Gender` will need encoding before use in machine learning models. No missing values are detected, indicating the dataset is clean and ready for preprocessing.

### Check for missing values

A critical requirement for models such as Naive Bayes, Decision Tree, and Random Forest is that the dataset must not contain any missing values, as these models in scikit-learn cannot handle them directly. On the other hand, XGBoost can handle missing values internally by learning the optimal path for observations with missing features.

Before proceeding with feature engineering and model building, it's essential to check for any missing values in the Deutsche Bank customer churn dataset. Identifying and addressing any missing data is a key step in ensuring the dataset is clean and ready for training the models.

In [63]:
churn_df_original.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

There are no missing values in the data.

### Check duplicates

Then, I check for any duplicate entries in the data.

In [64]:
churn_df_original.duplicated().sum()

np.int64(0)

There are no duplicate values in the data.

When modeling, a best practice is to perform a rigorous examination of the data before beginning feature engineering and feature selection. Not only does this process help me understand my data, what it is telling me, and what it is *not* telling me, but it also can give me clues that help me create new features.

### Feature Selection

Feature selection is the process of identifying which variables to include in a predictive model. In practice, this happens at multiple stages of a data analytics project. While a data professional may sometimes receive a prepared dataset with a defined target, more often they start with a question or business problem. From there, they need to:

* Assess what data is available
* Define an appropriate target variable
* Decide on a modeling approach
* Assemble a set of features likely to be useful for predicting that target

This initial selection happens during the **planning stage**.

Feature selection is a continuous, layered process that happens both before and during model development. It begins during exploratory data analysis (EDA), where variables are assessed for appropriateness — for instance, identifying features with excessive null values or those unlikely to hold predictive value. Decisions like dropping irrelevant or problematic variables at this stage are also part of feature selection. 
Later, during modeling, it's common to build preliminary models, review feature importances or coefficients, and iteratively remove low-value predictors. This informal, model-based selection complements more structured techniques like Forward or Backward Elimination. Together, these steps help streamline models, improve performance, and focus on the most meaningful predictors.

Returning to the bank data, I notice that the first column is called `RowNumber`, and it just enumerates the rows. I should drop this feature, because row number shouldn't have any correlation with whether or not a customer churned.

The same is true for `CustomerID`, which appears to be a number assigned to the customer for administrative purposes, and `Surname`, which is the customer's last name. Since these cannot be expected to have any influence over the target variable, I can remove them from the modeling dataset.

Finally, for ethical reasons, I should remove the `Gender` column. The reason for doing this is that I don't want my model making predictions (and therefore, offering promotions or financial incentives) based on a person's gender.


In [65]:
churn_df = churn_df_original.drop(['RowNumber', 'CustomerId', 'Surname', 'Gender'], 
                            axis=1)

In [66]:
churn_df.head()

Unnamed: 0,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,41,1,83807.86,1,0,1,112542.58,0
2,502,France,42,8,159660.8,3,1,0,113931.57,1
3,699,France,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,43,2,125510.82,1,1,1,79084.1,0


Sure — here’s a clean, refined version of that section you could use:

---

### **Feature Extraction**

Beyond using existing variables, it’s often valuable to create new features derived from the data. Well-designed derived features can capture patterns or relationships not explicitly represented in the raw data, sometimes becoming some of the model’s most predictive inputs.

Effective feature extraction typically relies on a mix of domain knowledge and understanding of the dataset. For example, if a data professional knew that a system glitch caused many credit card transactions to fail in October, it might be reasonable to suspect those affected customers are more likely to churn. If the dataset contained monthly credit card transaction counts, one could engineer a feature like `OctUseRatio`:

$\text{OctUseRatio} = \frac{\text{Number of October transactions}}{\text{Average monthly transactions}}$

This new feature could help flag customers likely affected by the glitch and improve model performance. In this way, domain-aware feature creation can be a powerful tool for enriching a predictive model.

I’ll create a `Loyalty` feature that represents the percentage of each customer’s life that they were customers by dividing `Tenure` by `Age`:

$$
\text{Loyalty} = \frac{\text{Tenure}}{\text{Age}}
$$

The intuition here is that people who have been customers for a greater proportion of their lives might be less likely to churn.


In [67]:
churn_df['Loyalty'] = churn_df['Tenure'] / churn_df['Age']

In [68]:
churn_df.head()

Unnamed: 0,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Loyalty
0,619,France,42,2,0.0,1,1,1,101348.88,1,0.047619
1,608,Spain,41,1,83807.86,1,0,1,112542.58,0,0.02439
2,502,France,42,8,159660.8,3,1,0,113931.57,1,0.190476
3,699,France,39,1,0.0,2,0,0,93826.63,0,0.025641
4,850,Spain,43,2,125510.82,1,1,1,79084.1,0,0.046512


The new variable appears as the last column in the updated dataframe.

### Feature transformation

The next step is to transform our features to get them ready for modeling. Different models have different requirements for how the data should be prepared and also different assumptions about their distributions, independence, and so on. 
The models we will be building with this data are all classification models, and classification models generally need categorical variables to be encoded. I will check how many categories appear in the data for the `Geography` feature.

In [69]:
churn_df['Geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

There are three unique values: France, Spain, and Germany. To encode this data for machine learning, I will use the `pd.get_dummies()` function, which will replace the `Geography` column with three new Boolean columns—one for each category.

When I specify `drop_first=True`, it will create only two new columns instead of three. This is because the third category (France) can be inferred if both `Geography_Germany` and `Geography_Spain` are 0, making the `Geography_France` column unnecessary. Dropping the first category simplifies the dataset and helps prevent multicollinearity. Multicollinearity can occur when features are highly correlated, which can distort model coefficients and make interpretation difficult. By dropping the first category, I avoid redundancy and ensure the model can interpret the data more effectively.

As a result, I will end up with two columns: `Geography_Germany` and `Geography_Spain`. If both are 0, we can infer the customer is from France.

In [70]:
churn_df = pd.get_dummies(churn_df, drop_first=True)

In [71]:
churn_df.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Loyalty,Geography_Germany,Geography_Spain
0,619,42,2,0.0,1,1,1,101348.88,1,0.047619,False,False
1,608,41,1,83807.86,1,0,1,112542.58,0,0.02439,False,True
2,502,42,8,159660.8,3,1,0,113931.57,1,0.190476,False,False
3,699,39,1,0.0,2,0,0,93826.63,0,0.025641,False,False
4,850,43,2,125510.82,1,1,1,79084.1,0,0.046512,False,True


I can now use my new dataset to build a model.

#### **Key Takeaways**

The dataset is now well-prepared for modeling, with no missing values or duplicates and carefully selected relevant features. Ethical considerations prompted the removal of `Gender`, while feature engineering introduced a valuable `Loyalty` feature, capturing the relationship between customer tenure and age. The `Geography` variable was efficiently encoded to avoid multicollinearity, ensuring compatibility with machine learning models. Irrelevant identifiers were also dropped, leaving a streamlined set of predictors ready for analysis.

#### **Key Findings to Share with the Data Team**

The dataset has undergone thorough preprocessing and initial exploration, with key adjustments including:

* Removal of non-predictive and ethically sensitive fields such as `RowNumber`, `CustomerId`, `Surname`, and `Gender`.
* Encoding of the `Geography` variable to prevent multicollinearity and facilitate seamless model integration.
* Creation of a derived `Loyalty` feature from `Tenure` and `Age`, which may add predictive strength in identifying churn-prone customers.

Additionally, the target variable (`Exited`) exhibits class imbalance, suggesting that techniques like resampling or class weighting may be necessary to support balanced and reliable model performance.

#### **Recommendations for Stakeholders**

* The dataset has been cleaned, transformed, and is now ready for modeling. We will begin with the **Naive Bayes** model as the first approach, and insights will be communicated progressively as the analysis unfolds.

* **Leverage `Loyalty` Feature**: The engineered `Loyalty` feature, representing the tenure-to-age ratio, will be incorporated into the modeling stages. Customers with lower loyalty scores may represent a higher churn risk, and this insight can help shape targeted retention initiatives.

#### **Conclusion**

The dataset has been effectively prepared for predictive modeling through systematic feature selection, ethical filtering, and thoughtful feature engineering. Key actions included the removal of irrelevant and sensitive identifiers, creation of a meaningful `Loyalty` feature, and encoding of categorical variables for model readiness. With the dataset now optimized, the next stage will involve applying classification models—including **Naive Bayes**, decision trees, Random Forest, and XGBoost—while carefully evaluating their performance to deliver actionable, fair, and unbiased insights for reducing customer churn.