# Notebook 1: ETL Pipeline

## ETL Pipeline for Credit Card Churn Analysis

- **Extract:** Load the credit card customer data from the dataset.
- **Transform:** Clean the data by removing duplicates, checking for and handling missing values, standardizing categorical variables, encoding categories into numeric form, and creating useful features like credit utilization ratio.
- **Load:** Save the cleaned and processed dataset for further analysis and predictive modeling to identify customers likely to churn.

This pipeline ensures your data is reliable and ready to build effective churn prediction models.

### Column Descriptions

| Column                       | Description                                                  |
| ---------------------------- | ------------------------------------------------------------ |
| **CLIENTNUM**                | Unique customer ID number                                    |
| **Attrition_Flag**           | Whether customer left the service or stayed (TARGET VARIABLE)|
| **Customer_Age**             | Age of the customer in years                                 |
| **Gender**                   | Male or Female                                               |
| **Dependent_count**          | Number of dependents (family members)                        |
| **Education_Level**          | Customer's education level                                   |
| **Marital_Status**           | Single, Married, or Divorced                                 |
| **Income_Category**          | Annual income range category                                 |
| **Card_Category**            | Type of credit card (Blue, Silver, Gold, Platinum)           |
| **Months_on_book**           | How long customer has been with the company                  |
| **Total_Relationship_Count** | Number of products customer has with the company             |
| **Months_Inactive_12_mon**   | Number of months customer was inactive in last 12 months     |
| **Contacts_Count_12_mon**    | Number of times customer contacted company in last 12 months |
| **Credit_Limit**             | Maximum credit limit allowed                                 |
| **Total_Revolving_Bal**      | Outstanding balance on the card                              |
| **Avg_Open_To_Buy**          | Average available credit remaining                           |
| **Total_Amt_Chng_Q4_Q1**     | Change in transaction amount from Q4 to Q1                   |
| **Total_Trans_Amt**          | Total transaction amount in last 12 months                   |
| **Total_Trans_Ct**           | Total transaction count in last 12 months                    |
| **Total_Ct_Chng_Q4_Q1**      | Change in transaction count from Q4 to Q1                    |
| **Avg_Utilization_Ratio**    | How much of credit limit is being used                       |
| **NB_Stay_Probabilty**       | Probability customer will stay                               |
| **NB_Churn_Probality**       | Probability customer will leave                              |

# 📥 Step 1: Extract

In this step, we will load (extract) the raw dataset into our working environment for further processing.

---

### ✅ 1️⃣ Import Required Libraries

We start by importing the necessary libraries for data handling.


In [8]:
import pandas as pd
import numpy as np

In [12]:
# Extract: Load the data
df = pd.read_csv("../data/raw_data/BankChurners.csv")


In [13]:
df.head()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,...,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,9.3e-05,0.99991
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,5.7e-05,0.99994
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,3418.0,0,3418.0,2.594,1887,20,2.333,0.0,2.1e-05,0.99998
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,...,3313.0,2517,796.0,1.405,1171,20,2.333,0.76,0.000134,0.99987
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,...,4716.0,0,4716.0,2.175,816,28,2.5,0.0,2.2e-05,0.99998


### Key Column Renamings for Better Understanding

The columns

- `CLIENTNUM`
- `Months_on_book`
- `Total_Relationship_Count`
- `Avg_open_To_Buy`
-`Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1`  
-`Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2`  

are renamed as:  

- **Customer_ID**  
- **Tenure_Months** 
- **Products_Count** 
- **Available_Credit**
- **NB_Stay_Probability**  
- **NB_Churn_Probability**  

In [14]:
# Rename specified columns
df = df.rename(columns={
    'CLIENTNUM': 'Customer_ID',
    'Months_on_book': 'Tenure_Months',
    'Total_Relationship_Count': 'Products_Count',
    'Avg_Open_To_Buy': 'Available_Credit',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1': 'NB_Stay_Probability',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2': 'NB_Churn_Probability'
})


In [15]:
# Check all column names after renaming
print(df.columns)


Index(['Customer_ID', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Tenure_Months', 'Products_Count',
       'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
       'Total_Revolving_Bal', 'Available_Credit', 'Total_Amt_Chng_Q4_Q1',
       'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
       'Avg_Utilization_Ratio', 'NB_Stay_Probability', 'NB_Churn_Probability'],
      dtype='object')


In [16]:
# Display the basic information about the DataFrame:
# - Shows the total number of entries (rows)
# - Lists all column names, data types (int, float, object, etc.)
# - Shows how many non-null (non-missing) values each column has
# - Helps to quickly check for missing data and verify data types before further processing
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Customer_ID             10127 non-null  int64  
 1   Attrition_Flag          10127 non-null  object 
 2   Customer_Age            10127 non-null  int64  
 3   Gender                  10127 non-null  object 
 4   Dependent_count         10127 non-null  int64  
 5   Education_Level         10127 non-null  object 
 6   Marital_Status          10127 non-null  object 
 7   Income_Category         10127 non-null  object 
 8   Card_Category           10127 non-null  object 
 9   Tenure_Months           10127 non-null  int64  
 10  Products_Count          10127 non-null  int64  
 11  Months_Inactive_12_mon  10127 non-null  int64  
 12  Contacts_Count_12_mon   10127 non-null  int64  
 13  Credit_Limit            10127 non-null  float64
 14  Total_Revolving_Bal     10127 non-null

In [17]:
# check for missing values in each column to make sure:
# df.isna() creates a boolean DataFrame where True indicates a missing (NaN) value.
# .sum() counts the number of missing values in each column.
# This helps identify which columns have missing data and how many, 
# so you can decide how to handle them during data cleaning.
df.isna().sum()

Customer_ID               0
Attrition_Flag            0
Customer_Age              0
Gender                    0
Dependent_count           0
Education_Level           0
Marital_Status            0
Income_Category           0
Card_Category             0
Tenure_Months             0
Products_Count            0
Months_Inactive_12_mon    0
Contacts_Count_12_mon     0
Credit_Limit              0
Total_Revolving_Bal       0
Available_Credit          0
Total_Amt_Chng_Q4_Q1      0
Total_Trans_Amt           0
Total_Trans_Ct            0
Total_Ct_Chng_Q4_Q1       0
Avg_Utilization_Ratio     0
NB_Stay_Probability       0
NB_Churn_Probability      0
dtype: int64