# 03_clean_BNPL.ipynb

## Purpose and Relevance

This notebook prepares and cleans the dataset `1_datasets/raw_data/BNPL.csv`, 
which comes from Kaggle consists of 1,000 customer records mimicking real-world BNPL user profiles. Each row represents a unique customer and includes 15 columns covering behavioral, financial, and demographic features which will help to understand analysis around classification of ethical risk in BNPL.

## Output

The final cleaned data will be saved in:
- `../1_datasets/processed_datasets/BNPL_cleaned.csv`


In [4]:
# Import library for data cleaning
import pandas as pd

In [5]:
#  Load the raw dataset from the 1_datasets/raw_data folder
df = pd.read_csv("../1_datasets/raw_data/BNPL.csv")

# View the top 5 rows to understand structure
df.head()

Unnamed: 0,CustomerID,failed_traditional_credit,bnpl_usage_frequency,over_indebtedness_flag,financial_stress_score,external_repayment_loans,credit_card_interest_incidence,credit_limit_utilisation,payment_delinquency_count,impulsive_buying_score,financial_literacy_assessment,debt_accumulation_metric,return_dispute_incidents,demographic_risk_factor,bnpl_debt_ratio
0,CUST_0001,0,15,1,8,0,0,48,2,9,8,1.61,2,3,0.03
1,CUST_0002,1,12,1,10,1,0,13,5,6,1,2.2,3,2,0.31
2,CUST_0003,1,14,1,4,0,0,39,5,3,6,4.17,3,4,0.84
3,CUST_0004,0,8,0,2,0,0,39,4,6,2,4.65,3,5,0.14
4,CUST_0005,0,3,0,2,1,0,31,2,10,2,3.14,1,1,1.02


In [8]:
# Check column names
df.columns

Index(['CustomerID', 'failed_traditional_credit', 'bnpl_usage_frequency',
       'over_indebtedness_flag', 'financial_stress_score',
       'external_repayment_loans', 'credit_card_interest_incidence',
       'credit_limit_utilisation', 'payment_delinquency_count',
       'impulsive_buying_score', 'financial_literacy_assessment',
       'debt_accumulation_metric', 'return_dispute_incidents',
       'demographic_risk_factor', 'bnpl_debt_ratio'],
      dtype='object')

## 🔎 Variable Dictionary

| Feature              | Description                                      |
|---------------------|--------------------------------------------------|
| CustomerID              | Unique identifier for each customer |
| failed_traditional_credit           | Whether the customer was denied traditional credit (e.g., loan/credit card rejection)       |
| bnpl_usage_frequency    | How often the customer uses Buy Now Pay Later services                 |
| over_indebtedness_flag  | Indicates if the customer is considered over-indebted (likely a binary flag)              |
| financial_stress_score  | Numeric score representing the customer’s financial stress level              |
| external_repayment_loans  | Number of loans the customer is repaying outside of BNPL              |
| credit_card_interest_incidence| Whether the customer has incurred credit card interest charges (likely binary)            |
| credit_limit_utilisation   | Proportion of credit limit currently used by the customer                          |
| payment_delinquency_count   | Number of times the customer has missed or delayed payments                          |
| impulsive_buying_score   | Score reflecting the customer’s tendency for impulsive purchases                          |
| financial_literacy_assessment   | Score or category indicating the customer’s financial literacy level                          |
| debt_accumulation_metric   | Metric indicating how much debt the customer is accumulating                          |
| return_dispute_incidents   | Number of times the customer has disputed a return or transaction                          |
| demographic_risk_factor   | Risk factor based on demographic information (e.g., age, income, location)                          |
| bnpl_debt_ratio   | Ratio of BNPL debt to total debt or income                         |

In [None]:
# Check for missing values
missing = df[df.isnull().any(axis=1)]
print("Rows with missing values:", len(missing))
# Check for duplicate rows
duplicates = df.duplicated().sum()
print("Number of duplicate rows:", duplicates)

Rows with missing values: 0


In [12]:
# Check for data types
data_types = df.dtypes
print("Data types of each column:\n", data_types)

Data types of each column:
 CustomerID                         object
failed_traditional_credit           int64
bnpl_usage_frequency                int64
over_indebtedness_flag              int64
financial_stress_score              int64
external_repayment_loans            int64
credit_card_interest_incidence      int64
credit_limit_utilisation            int64
payment_delinquency_count           int64
impulsive_buying_score              int64
financial_literacy_assessment       int64
debt_accumulation_metric          float64
return_dispute_incidents            int64
demographic_risk_factor             int64
bnpl_debt_ratio                   float64
dtype: object


## Save Cleaned Data

We now export the cleaned dataset to the 
`/1_datasets/processed_datasets` folder. This cleaned version will be 
used in Milestone 4 for further exploration and 
comparative modeling.

In [13]:
# Save cleaned CSV for use in MS4
df.to_csv("../1_datasets/processed_datasets/BNPL_cleaned.csv", index=False)

print("Cleaned file saved in /1_datasets/processed_datasets/")

Cleaned file saved in /1_datasets/processed_datasets/
