# 📘 Customer Churn Prediction
A machine learning project to predict customer exit status based on various features such as airline, duration, class, departure time, etc.



## 🧠 Objective
The task is to build a predictive model that determines whether a customer of a financial institution will exit or stay (exit_status), based on demographic and transactional features.



## 1. 📦 Importing Libraries
We begin by importing essential libraries for data manipulation, visualization, and modeling.

In [None]:
!pip install scikit-learn==0.24.2
!pip install imbalanced-learn==0.8.0
!pip install -U xgboost

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler,OneHotEncoder,MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split,RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.metrics import precision_score,f1_score,recall_score
from imblearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from lightgbm import LGBMClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from scipy.stats import randint, uniform
import warnings
warnings.filterwarnings('ignore')

## 2. 🗂️ Loading Data
We load both the training and test datasets and take an initial look.

In [None]:
data = pd.read_csv('/kaggle/input/customer-exit-status-data/customer_exit_status_data.csv')

## 3. 📊 Exploratory Data Analysis (EDA)
In this section, we explore the dataset to understand its structure, uncover patterns, and identify any issues such as missing values, duplicates, or outliers.

### 📐 Shape and Structure of Data

Data Shape: (90000 rows × 14 columns)

This gives us an idea of dataset size and dimensionality.

In [None]:
#Shape of data

data_shape = data.shape
print(f'Shape of training data : {data_shape}')

### 🧾  Column Names and Data Types

We inspect the column names and check data types to identify categorical vs. numerical features. This also helps in planning preprocessing and encoding steps.

* Numerical columns: id, customer_id, credit_score, age, tenure, acc_balance, prod_count, has_card, is_active, estimated_salary, salary_status

* Categorical columns : last_name, country, gender

In [None]:
#Information on data and column datatypes

data_info = data.info()

###  📊   Descriptive Statistics

Using `.describe()`, we look at:
- Mean, Count, Std, min, max, and percentiles of numerical features like `id`, `credit_score`, `age` etc.The median values are shown in the 50% row.

In [None]:
# Descriptive Statistics

data.describe()

### ❗ Missing Values

We check for missing values using `.isnull().sum()`. Handling these is essential before modeling.


In [None]:
#Missing values

missing_values = data.isnull().sum()
print(f'Number of missing values corresponding to each column :\n {missing_values}')

### 🔁 Duplicate Records

We find and count duplicate rows **excluding the `id` column**, which is unique by design.

- Number of duplicated rows found: **0**

Duplicates can distort model training, so we drop them to ensure clean learning.


In [None]:
#Duplicates

number_of_duplicates = data.duplicated(subset=[col for col in data.columns if col != 'id']).sum()
print(f'Number of Duplicated Rows : {number_of_duplicates}')

### 📊 Boxplot Analysis and Insights
Boxplots help visualize the distribution of numeric variables and detect outliers. Here's a summary of insights from your boxplots:

✅ Columns with Potential Outliers:

1. `credit_score`:

* Outliers present on the lower end (below ~500).

* Most scores lie between 600–750.


2. `age`

* Significant right-skew with many outliers beyond ~60 years.

* Median age is around early 30.


3. `prod_count`

* Outliers present above 3.5.

* Most users have 1–3 products.


🟡 Columns with No Major Outliers (Fairly Balanced Distributions):

1. `acc_balance`

* Large spread but no extreme outliers.

* Distribution is positively skewed.


2. `estimated_salary`

* Broad range, no major outliers.

* Almost symmetric distribution.

3. `tenure`

* Fairly uniform distribution from 0–10.

* No significant outliers.

In [None]:
#Boxplot

numeric_cols = data.select_dtypes(include=np.number).columns
plt.figure(figsize=(15,10))

for i,col in enumerate(numeric_cols,1):
    plt.subplot(len(numeric_cols) // 3 + 1, 3, i)
    sns.boxplot(x=data[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()    

### 🔍 Correlation Heatmap Analysis
The correlation heatmap helps identify relationships between numerical features. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). Here's an interpretation of the key findings:

✅ Key Correlation Insights:
1. `age` vs `exit_status` : 0.34

* Moderate positive correlation.

* Older customers are more likely to exit.

* Strongest correlation with the target → important predictive feature.

2. `acc_balance` vs `prod_count` : -0.36

* Negative correlation: customers with higher balance tend to have fewer products.

* Might suggest passive users keeping money without using products.

3. `prod_count` vs `exit_status` : -0.21

* Weak negative correlation.

* Customers with more products are less likely to exit, which is expected.

4. `acc_balance` vs `exit_status` : 0.13

* Slight positive correlation.

* Users with higher account balances may also exit, possibly due to dissatisfaction despite high holdings.

Other features like credit_score, tenure, has_card, is_active, and estimated_salary have very low correlations with exit_status (close to 0), suggesting:

* They are weak individual predictors.

* However, they may still contribute value in multivariate models (e.g., tree-based models).



In [None]:
# Correlation HeatMap

numeric_cols = data.select_dtypes(include=np.number)
corr_matrix = numeric_cols.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix,cmap='Reds',annot=True)
plt.show()

### 📊 Target Variable Distribution: exit_status
- Label 0: Represents customers who stayed.

- Label 1: Represents customers who exited.

From the bar plot:

* About 70,000 customers stayed.

* Only 20,000 customers exited.

⚠️ Class Imbalance Detected

* The dataset is highly imbalanced, with many more examples of class 0 than class 1.

* This imbalance can cause most models to be biased toward predicting the majority class (0), reducing performance for predicting churners (1).

In [None]:
# Visualizing the distribution of the target variable (e.g., 'exit_status')

sns.countplot(x='exit_status', data=data, palette='Set1')
plt.title('Distribution of Target Variable')
plt.xlabel('Exit Status')
plt.ylabel('Count')
plt.show()


## 3. 🧹 Data Preprocessing

### ✅ Dropping Irrelevant Columns

* `id`, `customer_id`, `last_name` columns were dropped from the dataset sets as it provides no predictive value.

In [None]:
#Dropping irrelevant columns

irr_cols = ['id','customer_id','last_name']
data.drop(irr_cols,inplace=True,axis=1)
print(f'Train data columns after dropping :\n{data.columns}')

### 📈 Outlier Handling
* We capped outliers `credit_score` using the IQR method:
* Outliers were replaced with these bounds to minimize distortion while preserving overall data distribution.

In [None]:
print("Clipping Outliers in Training dataset")
col='credit_score'
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_count = ((data[col] < lower_bound) | (data[col] > upper_bound)).sum()
data[col] = np.clip(data[col], lower_bound, upper_bound)

print(f"  {col}: {outliers_count} outliers capped between {lower_bound:.2f} and {upper_bound:.2f}")

plt.figure(figsize=(8,3))
data[col].plot(kind='box')
plt.title("Plot after outlier capping")
plt.show()

### 🎯 Feature & Target Separation 

The target variable for this task is `exit_status`. All other relevant columns were selected as features.

In [None]:
#Feature-Target Separation

X = data.drop('exit_status',axis=1)
y = data['exit_status']

### 🪓 Train-Test-Validation Split

The dataset was split into training, validation and test sets to evaluate model generalization.

In [None]:
#Train-Test-Validation Split

# First split: Train (70%) and Temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)

# Second split: Validation (15%) and Test (15%)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

### 🧼 Data Preprocessing Pipeline

To prepare the data for model training, we built a preprocessing pipeline that handles different types of features appropriately:

1. Numerical Columns (Mean Impute + StandardScaler)


* For credit_score, acc_balance, and estimated_salary, missing values are imputed using the mean, followed by standard scaling to normalize the distribution.

2. Numerical Columns (Median Impute + MinMaxScaler)


* For prod_count, age, and tenure, missing values are filled using the median, then scaled to a 0–1 range using MinMaxScaler.

3. Categorical Columns

* For country and gender, missing values are imputed using the most frequent category, and then One-Hot Encoded to convert them into numeric form.

ColumnTransformer is used to combine all the above steps and apply them to the appropriate columns.
Any remaining columns (remainder='passthrough') are included without modification.



In [None]:
#Pipeline

numeric_mean_cols = ['credit_score','acc_balance','estimated_salary']
numeric_median_cols = ['prod_count','age','tenure']
cat_cols = ['country','gender']


numeric_mean_standardscaler = Pipeline([
    ('mean_impute',SimpleImputer(strategy='mean')),
    ('standard_Scaler',StandardScaler())
])

numeric_median_minmax_scaler = Pipeline([
    ('median_impute',SimpleImputer(strategy='median')),
    ('minmax_scaler',MinMaxScaler())
])

cat_pipeline = Pipeline([
    ('cat_impute',SimpleImputer(strategy='most_frequent')),
    ('cat_encode',OneHotEncoder())
])

preprocessor = ColumnTransformer([
    ('numeric_mean',numeric_mean_standardscaler,numeric_mean_cols),
    ('numeric_median',numeric_median_minmax_scaler,numeric_median_cols),
    ('cat',cat_pipeline,cat_cols)
],remainder='passthrough')


## 4.🧠 Model Building, Tuning, and Evaluation

### 🏗️  Model Building
- A total of **7 different classification models** were trained on the data:
  - Logistic Regression
  - AdaBoost
  - Gaussian Naive Bayes  
  - Random Forest  
  - Gradient Boosting  
  - XGBoost  
  - LightGBM  


In [None]:
#Models

models={
    "RandomForest":RandomForestClassifier(random_state=42),
    "GradientBoost":GradientBoostingClassifier(),
    "XGBoost":XGBClassifier(),
    "LogisticRegression":LogisticRegression(),
    "LightGBM":LGBMClassifier(random_state=42,verbosity=-1),
    "AdaBoost":AdaBoostClassifier(),
    "GaussianNaiveBayes":GaussianNB(),
}

### 📊 Model Evaluation

Each classification model was trained using a pipeline that included preprocessing and SMOTE to handle class imbalance. The models were evaluated on the validation set using key metrics:

* F1 Score

* Precision

* Recall

All results were stored, ranked by F1 Score, and displayed in a comparison table to identify the best-performing model.

In [None]:
#Results

results = []

for name,model in models.items():
    print(f"\n>>> Training model: {name}")
    print(f"Model type: {type(model)}")
    final_pipeline = Pipeline([
        ('preprocess',preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('model', model)
    ])
    final_pipeline.fit(X_train,y_train)
    pred = final_pipeline.predict(X_val)
    results.append({
        "Model":name,
        "F1 Score":f1_score(y_val,pred),
        "Precision Score":precision_score(y_val,pred),
        "Recall Score":recall_score(y_val,pred)
    })

results_df = pd.DataFrame(results).sort_values(by="F1 Score", ascending=False)
display(results_df.reset_index(drop=True))

### 🛠️ Hyperparameter Tuning

To improve model performance, RandomizedSearchCV was used to tune key hyperparameters for three models:

* Gradient Boosting

* XGBoost

* Light GBM

Each model was wrapped in a pipeline with preprocessing and SMOTE. Randomized search was run with 3-fold cross-validation, optimizing for F1 Score. The best parameters were selected and used to update the models for final evaluation.



In [None]:
# Hyperparameter Tuning

param_distributions = {
    "GradientBoost": {
        "model__n_estimators": randint(50, 300),
        "model__max_depth": randint(3, 10),
        "model__learning_rate": uniform(0.01, 0.3),
        "model__subsample": uniform(0.7, 0.3)
    },
    "XGBoost": {
        "model__n_estimators": randint(50, 300),
        "model__max_depth": randint(3, 10),
        "model__learning_rate": uniform(0.01, 0.3),
        "model__subsample": uniform(0.7, 0.3),
        "model__colsample_bytree": uniform(0.7, 0.3)
    },
    "LightGBM": {
        'model__n_estimators': randint(100, 500),
        'model__max_depth': randint(3, 10),
        'model__learning_rate': uniform(0.01, 0.3),
        'model__num_leaves': randint(20, 100),
        'model__min_child_samples': randint(10, 100),
        'model__subsample': uniform(0.7, 0.3),
        'model__colsample_bytree': uniform(0.7, 0.3)
    }
}


models_to_tune = ["GradientBoost", "XGBoost", "LightGBM"]

for name in models_to_tune:
    print(f"Tuning {name}...")
    
    pipeline = Pipeline([
        ('preprocess', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('model', models[name])
    ])
    
    search = RandomizedSearchCV(
        pipeline,
        param_distributions[name],
        n_iter=20,
        cv=3,
        scoring='f1',
        verbose=1,
        random_state=42,
        n_jobs=-1
    )
    
    search.fit(X_train, y_train)
    best_pipeline = search.best_estimator_

    models[name] = best_pipeline.named_steps['model']


### 🧪 Tuned Model Evaluation
After tuning, each optimized model was retrained on the training data and evaluated on the validation set using a consistent pipeline (preprocessing + SMOTE + model).

For each tuned model (Gradient Boost, XGBoost, Light GBM), the following metrics were computed:

* Precision

* Recall

* F1 Score

The results were tabulated and sorted by F1 Score to identify the best-performing model after hyperparameter optimization.

In [None]:
tuned_results = []

for name in models_to_tune:
    
    tuned_model = models[name]

    tuned_pipeline = Pipeline([
        ('preprocess', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('model', tuned_model)
    ])

    tuned_pipeline.fit(X_train, y_train)
    y_val_pred = tuned_pipeline.predict(X_val)

    precision = precision_score(y_val, y_val_pred)
    recall = recall_score(y_val, y_val_pred)
    f1 = f1_score(y_val, y_val_pred)

    tuned_results.append({
        "Model": name,
        "F1 Score": f1,
        "Precision": precision,
        "Recall": recall,
    })

results_df = pd.DataFrame(tuned_results).sort_values(by="F1 Score", ascending=False).reset_index(drop=True)
display(results_df)

### 🧾 Final Test Prediction
To generate predictions for the test set:

Training Data Recombined: The training and validation sets (X_train, X_val, y_train, y_val) were merged into a single dataset (X_full, y_full) to utilize all available labeled data.

* Best Model Selection: The model with the highest F1 Score from the tuning phase was selected as the final model.

* Pipeline Assembly: A complete pipeline was built with:

* Preprocessing (preprocessor)

* SMOTE for handling class imbalance

* The best-performing tuned model

* Model Training & Prediction:

- The pipeline was trained on the full data.

- Predictions were generated on the unseen test dataset (X_test).

In [None]:
#Test Prediction

final_results = []

X_full = pd.concat([X_train, X_val], axis=0).reset_index(drop=True)
y_full = pd.concat([y_train, y_val], axis=0).reset_index(drop=True)


best_model_name = results_df.loc[0, 'Model']
print(f"\n Best tuned model: {best_model_name}")

best_model = models[best_model_name]
best_pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('model', best_model)
])

best_pipeline.fit(X_full, y_full)
y_test_pred = best_pipeline.predict(X_test)

final_results.append({
        "Model": best_model_name,
        "F1 Score": f1_score(y_test, y_test_pred),
        "Precision": precision_score(y_test, y_test_pred),
        "Recall": recall_score(y_test, y_test_pred),
})

final_results_df = pd.DataFrame(final_results)
display(final_results_df)