# 1. Data Preparation

# 1.1 Generate Synthetic Data

To simulate a realistic scenario, we will create a synthetic dataset of loan applications. While real-world data would be ideal, a synthetic dataset can provide a valuable proxy for demonstration and learning purposes. An alternative would be sanitized credit analysis datasets from sites like Kaggle. These offer real world data, but mostly sanitized from context, and with an abundance of community solutions that are not ideal to have for this demonstration exercise. 


The dataset includes various features typically used in such models, including Income, Loan_Amount, Credit_Score, Employment_Status, Debt_to_Income, Loan_Term, Age, Home_Ownership, and Default.

We begin the synthetic data creation by asking OpenAI's GPT-4 to create explicit mean/mode and skewness assumptions for each feature, based on reputable organizations such as Statistics Finland and the Bank of Finland. There remains a risk of hallucination or inaccuracies inherent to AI-generated content. For our demonstration exercise, we will accept a quick inspection, and leave more detailed expert verifications as a further improvement point.

For real-life datasets, conducting thorough due diligence, including consulting multiple data sources and verifying the accuracy of the assumptions, is essential. This synthetic dataset serves as an illustrative example and should not be used as a substitute for actual financial data in professional applications.

The table below summarizes the assumptions and sources used for generating each feature in the dataset
:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats
from scipy.stats import lognorm, norm, beta, truncnorm
from scipy.stats.mstats import winsorize
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
import pickle


# Creating the metadata DataFrame
metadata = {
    'Feature': [
        'Income', 
        'Loan_Amount', 
        'Credit_Score', 
        'Employment_Status', 
        'Debt_to_Income', 
        'Loan_Term', 
        'Age', 
        'Home_Ownership', 
        'Default'
    ],
    'Skewness Description': [
        'Positively skewed. Higher concentration of individuals with lower incomes, with fewer high-income earners.',
        'Positively skewed. Loan amounts are concentrated towards lower to moderate values, with fewer large loan amounts.',
        'Negatively skewed. Most people have good to excellent credit scores, with fewer individuals having very low credit scores.',
        'Binary distribution. Higher proportion of the population being employed, heavily weighted towards employment.',
        'Positively skewed. More individuals with lower ratios, but significant instances of high debt relative to income.',
        'Discrete uniform. Distributed across specific intervals with peaks at common loan terms like 20 and 30 years.',
        'Approximately normal. Centered around peak working ages (30-45 years), with fewer young and old applicants.',
        'Binary distribution. Higher proportion of the population owning homes, heavily weighted towards ownership.',
        'Positively skewed. Default rates are typically low, with a small percentage representing defaults.'
    ],
    'Mean/Mode Assumptions and Distribution Details': [
        'Mean: €3994, Distribution: Log-normal with σ = 0.5',
        'Mean: €175,000, Distribution: Log-normal with σ = 0.5',
        'Mean: 700, Mode: 750, Distribution: Normal (reversed) with μ = 100 and σ = 50',
        'Mode: Employed, Distribution: Binary with p = 0.9',
        'Mean: 0.6, Distribution: Beta (α = 2, β = 5) scaled to [0, 1.2]',
        'Mode: 20 years, Distribution: Discrete with p = [0.1, 0.2, 0.3, 0.1, 0.3]',
        'Mean: 35, Distribution: Normal with μ = 35 and σ = 10, clipped to [18, 75]',
        'Mode: Own, Distribution: Binary with p = 0.7',
        'Mean: 0.15, Distribution: Top 15% of risk scores' # Seems high for Finland, but we keep high since we lack other features. 
    ]
}

# Creating Metadata DataFrame
metadata_df = pd.DataFrame(metadata)

# Displaying the Metadata DataFrame
metadata_df
