# Exploratory Data Analysis for Credit Scoring

This notebook performs EDA on a credit dataset to uncover patterns, assess data quality, and identify key predictors of creditworthiness. The analysis includes data cleaning, univariate and bivariate analysis, visualizations, and fairness checks.

## Dataset
Assumes a credit dataset (e.g., Kaggle Credit Risk Dataset) with columns like `loan_amnt`, `int_rate`, `grade`, `default` (target), etc. Place the dataset in `data/raw/credit_data.csv`.

## Steps
1. Load and clean the dataset.
2. Summarize data (statistics, missing values).
3. Analyze distributions and relationships.
4. Generate visualizations.
5. Identify key predictors and check for potential biases.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Import custom modules from src/
from src.data_preprocessing import load_data, clean_data, save_cleaned_data
from src.visualizations import plot_distribution, plot_correlation_heatmap, plot_default_by_category
from src.analysis import summarize_data, correlation_analysis, chi_square_test

# Set output directory for figures
os.makedirs('../reports/figures', exist_ok=True)

## 1. Load and Clean Data

Load the raw dataset and apply cleaning steps (handle missing values, encode categorical variables, remove outliers).

In [None]:
# Load dataset
data_path = '../data/raw/credit_data.csv'
df = load_data(data_path)

# Display first few rows
print("Raw Data Preview:")
print(df.head())

# Clean dataset
df_clean = clean_data(df)

# Save cleaned dataset
save_cleaned_data(df_clean, '../data/processed/credit_data_cleaned.csv')

# Display cleaned data preview
print("\nCleaned Data Preview:")
print(df_clean.head())

## 2. Data Summary

Generate summary statistics, check for missing values, and review data types.

In [None]:
# Summarize data
summary = summarize_data(df_clean)
print("Summary Statistics:")
print(summary['numerical'])
print("\nMissing Values:")
print(summary['missing'])
print("\nData Types:")
print(summary['data_types'])

## 3. Univariate Analysis

Analyze the distribution of key numerical features (e.g., loan amount, interest rate).

In [None]:
# Plot distributions for key numerical columns
numerical_cols = ['loan_amnt', 'int_rate']  # Adjust based on your dataset
for col in numerical_cols:
    if col in df_clean.columns:
        plot_distribution(df_clean, col, '../reports/figures')
        print(f"Distribution plot for {col} saved.")
    else:
        print(f"Column {col} not found in dataset.")

## 4. Bivariate and Multivariate Analysis

Analyze relationships between features and the target variable (`default`).

In [None]:
# Correlation analysis with target variable
target = 'default'  # Adjust based on your target column
if target in df_clean.columns:
    correlations = correlation_analysis(df_clean, target)
    print("Correlations with Default:")
    print(correlations)
    plot_correlation_heatmap(df_clean, '../reports/figures')
    print("Correlation heatmap saved.")
else:
    print(f"Target column {target} not found in dataset.")

In [None]:
# Analyze default rates by categorical variables
categorical_cols = ['grade']  # Adjust based on your dataset
for col in categorical_cols:
    if col in df_clean.columns:
        plot_default_by_category(df_clean, col, target, '../reports/figures')
        print(f"Default rate plot for {col} saved.")
        # Perform chi-square test
        chi2_result = chi_square_test(df_clean, col, target)
        print(f"Chi-Square Test for {col}:")
        print(chi2_result)
    else:
        print(f"Column {col} not found in dataset.")

## 5. Fairness and Bias Check

Check for potential biases in sensitive attributes (e.g., gender, age group).

In [None]:
# Example: Default rates by a sensitive attribute (e.g., age group)
sensitive_col = 'age'  # Adjust based on your dataset
if sensitive_col in df_clean.columns:
    plot_default_by_category(df_clean, sensitive_col, target, '../reports/figures')
    print(f"Default rate plot for {sensitive_col} saved.")
    chi2_result = chi_square_test(df_clean, sensitive_col, target)
    print(f"Chi-Square Test for {sensitive_col}:")
    print(chi2_result)
else:
    print(f"Sensitive column {sensitive_col} not found in dataset.")

## 6. Key Findings and Next Steps

- **Key Predictors**: Based on correlations and chi-square tests, [e.g., loan_amnt, int_rate, grade] are likely strong predictors.
- **Data Quality**: [Summarize missing values, outliers handled].
- **Potential Biases**: [Note any concerning patterns in sensitive attributes].
- **Next Steps**: Use cleaned dataset (`data/processed/credit_data_cleaned.csv`) for feature engineering and model development.