# 01 - EDA & Insights

**LoanGuardian** — Exploratory Data Analysis (UAE synthetic dataset)

This notebook performs a professional, recruiter-friendly EDA using the synthetic UAE dataset located at `../data/loan_guardian_uae.csv`.



## 1. Setup & Imports

Standard imports. Plots use `matplotlib` (no seaborn) so visuals render reliably in varied environments.


In [1]:
# Setup imports and plotting defaults
import os, math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

# Create directories for saving images/reports
os.makedirs('../docs/reports/eda_images', exist_ok=True)
os.makedirs('../docs/reports', exist_ok=True)

# Helper: save and show figure
def save_fig(fig, name):
    path = os.path.join('../docs/reports/eda_images', name)
    fig.savefig(path, bbox_inches='tight', dpi=150)
    print('Saved:', path)


## 2. Load dataset

Make sure the dataset `loan_guardian_uae.csv` exists under `../data/`. If you used the generator script, it will be there.


In [3]:
# Load data (adjust path if needed)
DATA_PATH = '../data/loan_guardian_uae.csv'
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"Dataset not found at {DATA_PATH}. Run data/generate_uae_synthetic_data.py first.")
df = pd.read_csv(DATA_PATH)
print('Loaded dataset with shape:', df.shape)
# show a small sample
df.head()


Loaded dataset with shape: (50000, 15)


Unnamed: 0,LoanID,LoanAmount,Income,Age,CreditScore,ExistingEMI,EmploymentStatus,Tenure,MaritalStatus,Dependents,RepaymentHistory,Purpose,Stage,DelinquencyCount12M,DefaultStatus
0,1,126958,26125,41,782,16638,Employed,22,Widowed,2,Good,Car Loan,Disbursement,1,0
1,2,676155,7655,31,830,11937,Unemployed,55,Single,0,Good,Car Loan,Disbursement,1,0
2,3,136932,4448,23,826,2112,Retired,39,Widowed,0,Good,Home Loan,File,2,0
3,4,370838,16185,29,478,3683,Self-Employed,11,Widowed,3,Poor,Personal Loan,Lead,0,1
4,5,264178,19211,63,827,10679,Unemployed,11,Married,1,Good,Business Loan,Disbursement,0,0


## 3. Quick data overview

Show basic info, datatypes, missing values, and a numeric summary.


In [4]:
# Basic info
print('Columns and dtypes:\n')
print(df.dtypes)
print('\nMissing values (counts):\n')
print(df.isnull().sum())

# Numeric summary
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print('\nNumeric summary:')
display(df[num_cols].describe().T)


Columns and dtypes:

LoanID                  int64
LoanAmount              int64
Income                  int64
Age                     int64
CreditScore             int64
ExistingEMI             int64
EmploymentStatus       object
Tenure                  int64
MaritalStatus          object
Dependents              int64
RepaymentHistory       object
Purpose                object
Stage                  object
DelinquencyCount12M     int64
DefaultStatus           int64
dtype: object

Missing values (counts):

LoanID                 0
LoanAmount             0
Income                 0
Age                    0
CreditScore            0
ExistingEMI            0
EmploymentStatus       0
Tenure                 0
MaritalStatus          0
Dependents             0
RepaymentHistory       0
Purpose                0
Stage                  0
DelinquencyCount12M    0
DefaultStatus          0
dtype: int64

Numeric summary:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LoanID,50000.0,25000.5,14433.901067,1.0,12500.75,25000.5,37500.25,50000.0
LoanAmount,50000.0,502617.78514,287379.778371,5005.0,254058.5,503893.5,750825.5,999969.0
Income,50000.0,21051.2134,10937.772891,2000.0,11603.25,21086.5,30520.0,39999.0
Age,50000.0,42.479,12.698653,21.0,32.0,42.0,53.0,64.0
CreditScore,50000.0,574.10456,159.088784,300.0,435.0,573.0,713.0,849.0
ExistingEMI,50000.0,10010.2389,5763.563525,0.0,5056.0,9988.0,14998.0,19999.0
Tenure,50000.0,32.50252,15.604904,6.0,19.0,32.0,46.0,59.0
Dependents,50000.0,2.01248,1.415388,0.0,1.0,2.0,3.0,4.0
DelinquencyCount12M,50000.0,2.00584,1.410945,0.0,1.0,2.0,3.0,4.0
DefaultStatus,50000.0,0.04524,0.207832,0.0,0.0,0.0,0.0,1.0


## 4. Missing values report

Report features with missingness and percentage missing.


In [5]:
# Missingness report
missing = df.isnull().mean().sort_values(ascending=False)
missing = missing[missing>0]
if len(missing)>0:
    print((missing*100).round(2).astype(str) + ' %')
else:
    print('No missing values detected')


No missing values detected


## 5. Target balance & basic counts

Inspect `DefaultStatus` class distribution and key categorical features.


In [6]:
# Target distribution
target_counts = df['DefaultStatus'].value_counts()
print('DefaultStatus counts:\n', target_counts)
print('\nDefault rate: {:.2f}%'.format(100*target_counts.get(1,0)/len(df)))

# Categorical overview
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
print('\nCategorical columns:', cat_cols)


DefaultStatus counts:
 0    47738
1     2262
Name: DefaultStatus, dtype: int64

Default rate: 4.52%

Categorical columns: ['EmploymentStatus', 'MaritalStatus', 'RepaymentHistory', 'Purpose', 'Stage']


## 6. Univariate analysis — Numerical

Histograms and boxplots for important numeric features: LoanAmount, Income, CreditScore, ExistingEMI, Age.


In [7]:
# Choose features for univariate plots
features = ['LoanAmount','Income','CreditScore','ExistingEMI','Age','DelinquencyCount12M']
for feat in features:
    if feat not in df.columns:
        continue
    # Histogram
    fig = plt.figure(figsize=(7,4))
    plt.hist(df[feat].dropna(), bins=60)
    plt.title(f'Histogram — {feat}')
    plt.xlabel(feat)
    plt.ylabel('Count')
    save_fig(fig, f'hist_{feat}.png')
    plt.close(fig)
    
    # Boxplot
    fig = plt.figure(figsize=(6,3))
    plt.boxplot(df[feat].dropna(), vert=False)
    plt.title(f'Boxplot — {feat}')
    save_fig(fig, f'box_{feat}.png')
    plt.close(fig)


Saved: ../docs/reports/eda_images\hist_LoanAmount.png
Saved: ../docs/reports/eda_images\box_LoanAmount.png
Saved: ../docs/reports/eda_images\hist_Income.png
Saved: ../docs/reports/eda_images\box_Income.png
Saved: ../docs/reports/eda_images\hist_CreditScore.png
Saved: ../docs/reports/eda_images\box_CreditScore.png
Saved: ../docs/reports/eda_images\hist_ExistingEMI.png
Saved: ../docs/reports/eda_images\box_ExistingEMI.png
Saved: ../docs/reports/eda_images\hist_Age.png
Saved: ../docs/reports/eda_images\box_Age.png
Saved: ../docs/reports/eda_images\hist_DelinquencyCount12M.png
Saved: ../docs/reports/eda_images\box_DelinquencyCount12M.png


## 7. Univariate analysis — Categorical

Bar charts for EmploymentStatus, Purpose, Stage, RepaymentHistory.


In [8]:
cat_features = ['EmploymentStatus','Purpose','Stage','RepaymentHistory','MaritalStatus']
for col in cat_features:
    if col not in df.columns: 
        continue
    counts = df[col].value_counts(normalize=True).sort_values(ascending=False)
    fig = plt.figure(figsize=(8,4))
    plt.bar(counts.index.astype(str), counts.values)
    plt.title(f'Distribution — {col} (proportion)')
    plt.xticks(rotation=45, ha='right')
    plt.ylabel('Proportion')
    save_fig(fig, f'bar_{col}.png')
    plt.close(fig)


Saved: ../docs/reports/eda_images\bar_EmploymentStatus.png
Saved: ../docs/reports/eda_images\bar_Purpose.png
Saved: ../docs/reports/eda_images\bar_Stage.png
Saved: ../docs/reports/eda_images\bar_RepaymentHistory.png
Saved: ../docs/reports/eda_images\bar_MaritalStatus.png


## 8. Bivariate analysis — Numeric vs Target

Boxplots for numeric features against `DefaultStatus` to examine distributional differences.


In [9]:
# Boxplots by DefaultStatus
for feat in ['CreditScore','Income','LoanAmount','ExistingEMI','DelinquencyCount12M']:
    if feat not in df.columns: 
        continue
    data0 = df[df['DefaultStatus']==0][feat].dropna()
    data1 = df[df['DefaultStatus']==1][feat].dropna()
    fig = plt.figure(figsize=(8,4))
    plt.boxplot([data0, data1], labels=['No Default','Default'], showfliers=False)
    plt.title(f'{feat} by DefaultStatus (boxplot)')
    save_fig(fig, f'box_{feat}_by_target.png')
    plt.close(fig)


Saved: ../docs/reports/eda_images\box_CreditScore_by_target.png
Saved: ../docs/reports/eda_images\box_Income_by_target.png
Saved: ../docs/reports/eda_images\box_LoanAmount_by_target.png
Saved: ../docs/reports/eda_images\box_ExistingEMI_by_target.png
Saved: ../docs/reports/eda_images\box_DelinquencyCount12M_by_target.png


## 9. Correlation matrix (numeric features)

Show correlations and save a heatmap image. Uses pandas' corr() and matplotlib's imshow.


In [10]:
num = df.select_dtypes(include=[np.number]).drop(columns=['LoanID'], errors='ignore')
corr = num.corr()
fig = plt.figure(figsize=(10,8))
plt.imshow(corr, interpolation='nearest', cmap='viridis')
plt.colorbar()
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title('Correlation matrix (numeric features)')
save_fig(fig, 'corr_matrix.png')
plt.close(fig)


Saved: ../docs/reports/eda_images\corr_matrix.png


## 10. Scatter matrix (pairwise) — selected features

Use pandas' scatter_matrix for a quick pairwise view. Keep the selection small to remain readable.


In [11]:
selection = ['CreditScore','Income','LoanAmount','ExistingEMI','DelinquencyCount12M']
selection = [s for s in selection if s in df.columns]
if len(selection)>=2:
    # sample for performance
    sample = df[selection].dropna().sample(n=min(2000, len(df)), random_state=42)
    fig = scatter_matrix(sample, figsize=(12,12), diagonal='hist')
    save_fig(plt.gcf(), 'scatter_matrix_selection.png')
    plt.close()


Saved: ../docs/reports/eda_images\scatter_matrix_selection.png


## 11. Outlier detection (IQR method)

Flag top outliers per numeric column using IQR and list counts.


In [12]:
outlier_summary = {}
for col in num.columns:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5*iqr
    upper = q3 + 1.5*iqr
    mask = (df[col] < lower) | (df[col] > upper)
    outlier_summary[col] = int(mask.sum())
    
outlier_df = pd.Series(outlier_summary).sort_values(ascending=False).to_frame('outlier_count')
outlier_df.head(10)


Unnamed: 0,outlier_count
DefaultStatus,2262
LoanAmount,0
Income,0
Age,0
CreditScore,0
ExistingEMI,0
Tenure,0
Dependents,0
DelinquencyCount12M,0


## 12. Feature relationships — categorical vs target

Compute default rates by category for key categorical features (Stage, EmploymentStatus, Purpose, RepaymentHistory).


In [13]:
def default_rate_by(col):
    rates = df.groupby(col)['DefaultStatus'].agg(['count','sum'])
    rates['default_rate'] = (rates['sum'] / rates['count']).round(4)
    rates = rates.sort_values('default_rate', ascending=False)
    return rates

for col in ['Stage','EmploymentStatus','Purpose','RepaymentHistory']:
    if col in df.columns:
        print('\nDefault rate by', col)
        display(default_rate_by(col).head(8))



Default rate by Stage


Unnamed: 0_level_0,count,sum,default_rate
Stage,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Disbursement,12765,608,0.0476
File,12410,570,0.0459
Lead,12495,571,0.0457
Sanction,12330,513,0.0416



Default rate by EmploymentStatus


Unnamed: 0_level_0,count,sum,default_rate
EmploymentStatus,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Employed,12438,584,0.047
Self-Employed,12376,562,0.0454
Retired,12597,570,0.0452
Unemployed,12589,546,0.0434



Default rate by Purpose


Unnamed: 0_level_0,count,sum,default_rate
Purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Personal Loan,12383,596,0.0481
Home Loan,12639,570,0.0451
Business Loan,12489,562,0.045
Car Loan,12489,534,0.0428



Default rate by RepaymentHistory


Unnamed: 0_level_0,count,sum,default_rate
RepaymentHistory,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Poor,5019,2262,0.4507
Average,9894,0,0.0
Good,35087,0,0.0


## 13. Save a small CSV/JSON report with key EDA metrics

This makes it easy to include in m docs and README.


In [14]:
# Save summary metrics
summary = {
    'n_rows': int(len(df)),
    'n_defaults': int(df['DefaultStatus'].sum()),
    'default_rate': float(df['DefaultStatus'].mean()),
    'num_numeric_features': int(len(num.columns)),
    'num_categorical_features': int(len(cat_cols))
}
import json
with open('../docs/reports/eda_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)
print('Saved summary report to ../docs/reports/eda_summary.json')
summary


Saved summary report to ../docs/reports/eda_summary.json


{'n_rows': 50000,
 'n_defaults': 2262,
 'default_rate': 0.04524,
 'num_numeric_features': 9,
 'num_categorical_features': 5}

## 14. Short Insights & Next Steps (forREADME)

- Top predictors: CreditScore, DelinquencyCount12M, ExistingEMI/Income ratio, RepaymentHistory.
- Cohorts: Lead-stage applicants show higher default rates — target these early.
- Data quality: small amount of missing CreditScore and Income; consider KNN imputation or use of alternative proxy features.

**Next steps:** Feature engineering (WOE, DTI ratio), handle imbalance (SMOTE), train models with cross-validation, and produce SHAP explainability for production-ready model.


----

**How to run this notebook locally**

1. Ensure dataset is at `data/loan_guardian_uae.csv` relative to repository root.
2. Launch Jupyter Lab / Notebook from repository root (`LoanGuardian/`).
3. Open `notebooks/01_EDA_and_insights.ipynb` and run all cells.

**Files created by this notebook:**
- `../docs/reports/eda_images/` (images)
- `../docs/reports/eda_summary.json` (summary)
