# **ETL for Credit Card Churn Analysis**

## Objectives

* The workbook aims to load,  This aims to make use of KaggleAPI, typical data cleansing methods expected to produce a ready to use clean data file(s).  

## Inputs

* kaggle based credit-card-customers dataset (sakshigoyal7/credit-card-customers)

## Outputs

* Cleansed file of current & churned credit card customers & there attributes. 

## Additional Comments

* This will only contain the ETL elements of this process. Please see the other noteboot for vis details 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

print("Current Working Directory:", current_dir)   
  


Current Working Directory: c:\Users\ryan_\VS-code-projects\CreditCardChurn\jupyter_notebooks


We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))

print("You set a new current directory")


You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\ryan_\\VS-code-projects\\CreditCardChurn'

# Section 1 - pulling the dataset

In [4]:
dwpath = os.path.join(current_dir, 'dataFiles', 'rawdata')
outpath = os.path.join(current_dir,'dataFiles', 'cleandata')
print("Data Download Path:", dwpath)    
print("Data Cleansed Path:", outpath)

Data Download Path: c:\Users\ryan_\VS-code-projects\CreditCardChurn\dataFiles\rawdata
Data Cleansed Path: c:\Users\ryan_\VS-code-projects\CreditCardChurn\dataFiles\cleandata


# Section 1 - Pull the dataset to local system

This section aims to pull the dataset from Kaggle.com's website via there API. 
(an account was previously created on website and api json file was generated and saved on my system)

In [5]:

from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

api.dataset_download_files('sakshigoyal7/credit-card-customers', path = dwpath, unzip= True)



Dataset URL: https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers


In [5]:
#checking the file downnloaded is accessible and has records. 
import pandas as pd
import numpy as np
df_raw = pd.read_csv(os.path.join(dwpath,'BankChurners.csv'))

df_raw.head()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,...,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,9.3e-05,0.99991
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,5.7e-05,0.99994
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,3418.0,0,3418.0,2.594,1887,20,2.333,0.0,2.1e-05,0.99998
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,...,3313.0,2517,796.0,1.405,1171,20,2.333,0.76,0.000134,0.99987
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,...,4716.0,0,4716.0,2.175,816,28,2.5,0.0,2.2e-05,0.99998


---

# Section 2 - intial data inspection 


This section evaluates the dataset structure and quality before cleaning:
- shape and column names
- data types
- missing values
- duplicate rows
- target distribution (churn vs non-churn)

Section 2 content

todo
Inspect shape, column names, dtypes

Check missing values

Check duplicates

Confirm target variable (Attrition_Flag) distribution from ideas brainstorm

justify reasons for steps taken at end. 

In [18]:
# Basic structure
print("Shape:", df_raw.shape)
display(df_raw.head(3))
display(df_raw.tail(3))


Shape: (10127, 21)


Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,...,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0


Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
10124,716506083,Attrited Customer,44,F,1,High School,Married,Less than $40K,Blue,36,...,3,4,5409.0,0,5409.0,0.819,10291,60,0.818,0.0
10125,717406983,Attrited Customer,30,M,2,Graduate,Unknown,$40K - $60K,Blue,36,...,3,3,5281.0,0,5281.0,0.535,8395,62,0.722,0.0
10126,714337233,Attrited Customer,43,F,2,Graduate,Married,Less than $40K,Silver,25,...,2,4,10388.0,1961,8427.0,0.703,10294,61,0.649,0.189


In [19]:

df_raw.columns 


Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')

Looking at the columns from the above output i have decided to remove the columns: 'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'

from the dataset as they are not business related columns and are an artifcate of a Naive Bayes classifier which can use them as model helper/output fields. I want to remove them as they can confuse EDA down the line and look a little odd. 

In [20]:
# Dropping the two odd columns 
ODD_COLS = [
    "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1",
    "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2",
]

# Dropping only if present
df_raw = df_raw.drop(columns=[c for c in ODD_COLS if c in df_raw.columns])

#checking columns after column drop
df_raw.columns 

Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')

In [21]:
# Missing values check
missing = df_raw.isna().sum().sort_values(ascending=False)
missing = missing[missing > 0]
missing


Series([], dtype: int64)

In [25]:
# Duplicate rows check
dup_count = df_raw.duplicated().sum()
print("Duplicate rows:", dup_count)


Duplicate rows: 0


In [None]:
# Target distribution check
TARGET_COL = "Attrition_Flag"
target_counts = df_raw[TARGET_COL].value_counts(dropna=False)
target_pct = df_raw[TARGET_COL].value_counts(normalize=True, dropna=False).round(4) * 100  #translate into precentages 

display(pd.DataFrame({"count": target_counts, "pct": target_pct}))


Unnamed: 0_level_0,count,pct
Attrition_Flag,Unnamed: 1_level_1,Unnamed: 2_level_1
Existing Customer,8500,83.93
Attrited Customer,1627,16.07


In [24]:
display(df_raw.describe(include="number").T)
display(df_raw.describe(include="object").T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CLIENTNUM,10127.0,739177600.0,36903780.0,708082083.0,713036800.0,717926400.0,773143500.0,828343100.0
Customer_Age,10127.0,46.32596,8.016814,26.0,41.0,46.0,52.0,73.0
Dependent_count,10127.0,2.346203,1.298908,0.0,1.0,2.0,3.0,5.0
Months_on_book,10127.0,35.92841,7.986416,13.0,31.0,36.0,40.0,56.0
Total_Relationship_Count,10127.0,3.81258,1.554408,1.0,3.0,4.0,5.0,6.0
Months_Inactive_12_mon,10127.0,2.341167,1.010622,0.0,2.0,2.0,3.0,6.0
Contacts_Count_12_mon,10127.0,2.455317,1.106225,0.0,2.0,2.0,3.0,6.0
Credit_Limit,10127.0,8631.954,9088.777,1438.3,2555.0,4549.0,11067.5,34516.0
Total_Revolving_Bal,10127.0,1162.814,814.9873,0.0,359.0,1276.0,1784.0,2517.0
Avg_Open_To_Buy,10127.0,7469.14,9090.685,3.0,1324.5,3474.0,9859.0,34516.0


Unnamed: 0,count,unique,top,freq
Attrition_Flag,10127,2,Existing Customer,8500
Gender,10127,2,F,5358
Education_Level,10127,7,Graduate,3128
Marital_Status,10127,4,Married,4687
Income_Category,10127,6,Less than $40K,3561
Card_Category,10127,4,Blue,9436


### Data Quality Summary (as of 21/01/2026)
Step
- Missing values: No reported missing values 
- Duplicates:  No reported duplicates 
- Target balance: Attrited Customers represents a 16.07% (1627 out of 8500) from the dataset. 




---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---