# 🚀 Step-by-Step Guide to Train a Churn Prediction Model

This notebook will guide you through training a churn prediction model using customer data. We will prepare the data, train a classification model, evaluate it, and save the model for later use.

---

## Step 1: Import Libraries

Import necessary Python libraries such as pandas, numpy, scikit-learn, and joblib.

---

## Step 2: Load and Explore Data

Load your cleaned and encoded dataset, perform basic exploration to understand the data structure and target distribution.

---

## Step 3: Prepare Features and Target Variable

Separate the dataset into feature variables (`X`) and the target variable (`y`), which is the churn flag.

---

## Step 4: Split Data into Training and Testing Sets

Use `train_test_split` to create training and testing datasets, typically 70-80% for training and 20-30% for testing.

---

## Step 5: Choose and Train the Model

Select a classification algorithm: Logistic Regression and train it on the training data.

---

## Step 6: Evaluate Model Performance

Evaluate the model on the test set using accuracy, precision, recall, F1-score, and ROC-AUC metrics.

---

## Step 7: Save the Trained Model

Save the trained model using `joblib` or `pickle` for deployment or integration in the Streamlit app.

---

## Step 8: Conclusion and Next Steps

Summarize findings and discuss how the model can be used for real-time churn prediction and customer retention strategies.

---


# Churn Prediction Model Features

## Target Variable
- **Attrition_Flag**  
  The target variable we want to predict.  
  Typically:  
  - `0` = Customer did **not** churn  
  - `1` = Customer **churned**

## Feature Variables (Inputs for the Model)
The following columns are used as features to predict churn:

| Feature Name             | Description                            |
|-------------------------|------------------------------------|
| `Customer_Age`           | Age of the customer                  |
| `Income_Category`        | Income group category                |
| `Card_Category`          | Credit card type/category            |
| `Months_Inactive_12_mon` | Number of months inactive in last 12 months |
| `Avg_Utilization_Ratio`  | Average credit utilization ratio    |
| `Total_Trans_Amt`        | Total transaction amount             |
| `Credit_Limit`           | Credit limit on the card             |

# 📊 Data Cleaning & Preparation for Churn Prediction

## 🎯 Goal

Prepare the dataset for machine learning by:

- Removing unnecessary columns
- Keeping only important features
- Preparing target column for prediction
- Ensuring correct data types for ML models

---

## 🔎 Understanding the Dataset

The original dataset includes:

- Customer demographic and account information
- Behavior & transaction metrics
- Some columns already encoded
- Some unnecessary columns (IDs, model probabilities)

---

## 🔧 Why Do We Have Extra Columns After Nominal Encoding?

We applied **Nominal Encoding** (One-Hot Encoding) on some categorical columns.

### 🔍 Example:

Original column:Card_Category → ['Blue', 'Gold', 'Platinum', 'Silver']


After one-hot encoding, we get 4 new columns:

- `Card_Blue`
- `Card_Gold`
- `Card_Platinum`
- `Card_Silver`

✅ This allows ML models to work with categorical variables as numerical binary columns.

---

## 🚫 Columns to Remove

| Column               | Reason                               |
|----------------------|------------------------------------|
| `Customer_ID`        | Unique ID — no predictive power    |
| `NB_Stay_Probability`| Model output — causes data leakage |
| `NB_Churn_Probability`| Model output — causes data leakage |

---

## ✅ Columns to Keep

| Column Group          | Columns                                                                                  |
|-----------------------|------------------------------------------------------------------------------------------|
| **Demographics**      | `Customer_Age`, `Gender`, `Dependent_count`, `Education_Level`, `Income_Category`         |
| **Account Info**      | `Tenure_Months`, `Products_Count`, `Months_Inactive_12_mon`, `Contacts_Count_12_mon`      |
| **Credit Info**       | `Credit_Limit`, `Total_Revolving_Bal`, `Available_Credit`                                |
| **Behavior**          | `Total_Amt_Chng_Q4_Q1`, `Total_Trans_Amt`, `Total_Trans_Ct`, `Total_Ct_Chng_Q4_Q1`, `Avg_Utilization_Ratio` |
| **One-Hot Encoded**   | `Marital_Divorced`, `Marital_Married`, `Marital_Single`, `Marital_Unknown`, `Card_Blue`, `Card_Gold`, `Card_Platinum`, `Card_Silver` |

---

## 🎯 Target Column

| Column          | Description                    |
|-----------------|-------------------------------|
| `Attrition_Flag`| Churn label (0 = Stay, 1 = Churn) |

---

## 🔢 Data Type Adjustment

The one-hot encoded columns are currently stored as `True` / `False`.  
✅ We will convert them into numerical `0` / `1` for machine learning.

---


In [1]:
#Step 1: Import Libraries

# 1️⃣ Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [3]:
# Step 2: Load and Explore Data
 

df = pd.read_csv("../data/cleaned_data/encoded_data.csv")


In [4]:
df.head()

Unnamed: 0,Customer_ID,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Income_Category,Tenure_Months,Products_Count,Months_Inactive_12_mon,...,NB_Stay_Probability,NB_Churn_Probability,Marital_Divorced,Marital_Married,Marital_Single,Marital_Unknown,Card_Blue,Card_Gold,Card_Platinum,Card_Silver
0,768805383,0,45.0,0,3,3,2,39.0,5,1.0,...,9.3e-05,0.99991,False,True,False,False,True,False,False,False
1,818770008,0,49.0,1,5,2,4,44.0,6,1.0,...,5.7e-05,0.99994,False,False,True,False,True,False,False,False
2,713982108,0,51.0,0,3,2,3,36.0,4,1.0,...,2.1e-05,0.99998,False,True,False,False,True,False,False,False
3,769911858,0,40.0,1,4,3,4,34.0,3,4.0,...,0.000134,0.99987,False,False,False,True,True,False,False,False
4,709106358,0,40.0,0,3,5,2,21.0,5,1.0,...,2.2e-05,0.99998,False,True,False,False,True,False,False,False
