# Notebook 01: Data Ingestion & Initial Exploration

## Project: AI-Powered Credit Risk Intelligence System

**Purpose of this notebook**
- Load raw weekly credit application data (Parquet)
- Perform initial exploration and sanity checks
- Validate data structure before downstream processing

This notebook represents the **Bronze → Silver boundary** in the data pipeline.


In [2]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 100)
pd.set_option("display.float_format", "{:,.4f}".format)


In [3]:
DATA_PATH = "../data/raw/credit_applications_snapshot_20251221.parquet"

df = pd.read_parquet(DATA_PATH)

print("Shape:", df.shape)
df.head()


Shape: (1000000, 39)


Unnamed: 0,application_id,customer_id,application_date,product_type,region,age,education_level,employment_status,time_in_job_months,residential_status,time_at_address_months,annual_income,other_income,total_income,monthly_income,loan_amount,term_months,monthly_debt_payments,dti,savings_balance,credit_history_months,num_open_accounts,num_closed_accounts,num_delinquencies_12m,num_delinquencies_24m,utilization_rate,hard_inquiries_6m,hard_inquiries_12m,bankruptcies,collections,bureau_score,apr,device_trust_score,email_domain_risk,ip_risk,kyc_match,p_default_true,pd_bucket,default_12m
0,10000001,1803258,2024-06-25,credit_card,London,22,college,employed,22,family,48,19685.42,1757.82,21443.24,1786.94,382.06,0,689.09,0.3856,1225.11,130,9,7,1,2,0.6005,3,5,0,0,608,0.4189,61.9,low,low,1,0.2709,E,0
1,10000002,7965604,2025-10-14,credit_card,London,39,bachelors,retired,35,mortgage,53,37811.12,2707.9,40519.02,3376.58,3673.99,0,1243.76,0.3683,192.49,50,7,5,0,0,0.2184,0,0,0,0,613,0.3535,86.2,low,medium,1,0.0517,C,0
2,10000003,6891143,2025-04-01,credit_card,Midlands,38,bachelors,student,40,mortgage,43,29183.89,921.76,30105.65,2508.8,4243.84,0,2717.92,1.0834,1789.04,144,6,5,3,5,0.4918,1,1,0,0,571,0.4689,62.7,medium,low,1,0.5697,E,1
3,10000004,4949905,2024-04-29,mortgage_topup,North,45,bachelors,self_employed,10,mortgage,15,27936.71,1785.8,29722.51,2476.88,7897.96,84,642.1,0.2592,708.74,191,4,13,0,0,0.4486,1,2,0,1,600,0.2124,75.2,low,low,1,0.1094,D,0
4,10000005,4897137,2025-12-01,personal_loan,North,35,bachelors,employed,26,mortgage,24,19960.2,940.78,20900.98,1741.75,1687.41,18,1019.26,0.5852,2875.86,116,3,5,1,1,0.4223,0,1,0,0,632,0.2901,95.5,medium,low,1,0.0957,C,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 39 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   application_id          1000000 non-null  int64         
 1   customer_id             1000000 non-null  int64         
 2   application_date        1000000 non-null  datetime64[ns]
 3   product_type            1000000 non-null  object        
 4   region                  1000000 non-null  object        
 5   age                     1000000 non-null  int64         
 6   education_level         1000000 non-null  object        
 7   employment_status       1000000 non-null  object        
 8   time_in_job_months      1000000 non-null  int64         
 9   residential_status      1000000 non-null  object        
 10  time_at_address_months  1000000 non-null  int64         
 11  annual_income           1000000 non-null  float64       
 12  other_income   

In [5]:
df.columns.tolist()

['application_id',
 'customer_id',
 'application_date',
 'product_type',
 'region',
 'age',
 'education_level',
 'employment_status',
 'time_in_job_months',
 'residential_status',
 'time_at_address_months',
 'annual_income',
 'other_income',
 'total_income',
 'monthly_income',
 'loan_amount',
 'term_months',
 'monthly_debt_payments',
 'dti',
 'savings_balance',
 'credit_history_months',
 'num_open_accounts',
 'num_closed_accounts',
 'num_delinquencies_12m',
 'num_delinquencies_24m',
 'utilization_rate',
 'hard_inquiries_6m',
 'hard_inquiries_12m',
 'bankruptcies',
 'collections',
 'bureau_score',
 'apr',
 'device_trust_score',
 'email_domain_risk',
 'ip_risk',
 'kyc_match',
 'p_default_true',
 'pd_bucket',
 'default_12m']

### Column Groups

**Identifiers**
- application_id
- customer_id
- application_date

**Demographics & Stability**
- age
- region
- education_level
- employment_status
- time_in_job_months
- residential_status
- time_at_address_months

**Financials**
- annual_income
- other_income
- total_income
- monthly_income
- savings_balance
- monthly_debt_payments
- dti

**Credit History**
- credit_history_months
- num_open_accounts
- num_closed_accounts
- num_delinquencies_12m
- num_delinquencies_24m
- utilization_rate
- hard_inquiries_6m
- hard_inquiries_12m
- bankruptcies
- collections
- bureau_score

**Loan Terms**
- product_type
- loan_amount
- term_months
- apr

**Fraud / KYC Proxies**
- device_trust_score
- email_domain_risk
- ip_risk
- kyc_match

**Target / Labels**
- p_default_true
- pd_bucket
- default_12m


In [8]:
df.describe().T


Unnamed: 0,count,mean,min,25%,50%,75%,max,std
application_id,1000000.0,10500000.5000,10000001.0000,10250000.7500,10500000.5000,10750000.2500,11000000.0000,288675.2789
customer_id,1000000.0,5497218.4923,1000000.0000,3247334.7500,5498659.0000,7744528.2500,9999992.0000,2596953.0145
application_date,1000000.0,2024-12-20 12:11:15.907200256,2023-12-22 00:00:00,2024-06-21 00:00:00,2024-12-20 00:00:00,2025-06-21 00:00:00,2025-12-20 00:00:00,
age,1000000.0,36.1538,18.0000,29.0000,36.0000,43.0000,75.0000,9.6869
time_in_job_months,1000000.0,29.9958,0.0000,9.0000,21.0000,42.0000,240.0000,29.9723
time_at_address_months,1000000.0,35.9274,0.0000,10.0000,25.0000,50.0000,240.0000,35.6266
annual_income,1000000.0,36246.1657,8000.0000,22825.1300,31993.7650,44798.2125,250000.0000,19307.6981
other_income,1000000.0,3742.2981,23.9000,1361.4800,2500.2300,4580.1900,80000.0000,4160.7434
total_income,1000000.0,39988.4638,8023.9000,26234.0075,35796.0850,48944.1550,255533.6900,19748.3695
monthly_income,1000000.0,3332.3720,668.6600,2186.1700,2983.0050,4078.6825,21294.4700,1645.6975


In [9]:
missing = df.isna().mean().sort_values(ascending=False)
missing[missing > 0]


Series([], dtype: float64)

In [10]:
df["default_12m"].value_counts(normalize=True)


default_12m
0   0.8985
1   0.1015
Name: proportion, dtype: float64

In [11]:
df["product_type"].value_counts(normalize=True)


product_type
personal_loan    0.4509
credit_card      0.2991
auto_loan        0.1201
payday_like      0.0799
mortgage_topup   0.0500
Name: proportion, dtype: float64

In [12]:
df["application_date"].min(), df["application_date"].max()


(Timestamp('2023-12-22 00:00:00'), Timestamp('2025-12-20 00:00:00'))

In [13]:
df.groupby("default_12m")[
    ["dti", "utilization_rate", "bureau_score", "num_delinquencies_12m"]
].mean()


Unnamed: 0_level_0,dti,utilization_rate,bureau_score,num_delinquencies_12m
default_12m,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.2831,0.3283,652.8086,0.5703
1,0.5252,0.3761,635.4665,0.8497


In [15]:
BRONZE_PATH = "../data/bronze/credit_applications_20251221.parquet"
df.to_parquet(BRONZE_PATH, index=False)


## Summary & Next Steps

✔ Data successfully ingested from Parquet  
✔ Dataset contains ~1M credit applications  
✔ No critical missing values detected  
✔ Risk variables behave consistently with default outcomes  

**Next notebook:**  
➡️ `02_data_quality_checks_and_validation.ipynb`
