# Approach

- I started by identifying all available parameters related to credit behavior.
- From these, I selected 5 high-impact parameters that strongly influence credit score dynamics.
- I manually labeled data based on these selected parameters by assigning meaningful weights and defining clear thresholds for classification.
- For the remaining parameters, I used realistic statistical distributions and applied rule-based labeling logic to generate diverse samples.
- I also injected outliers intentionally to capture rare but critical edge cases.
- Finally, I added controlled noise to the dataset to simulate real-world variability and reduce the risk of model overfitting.


# Final Feature Selection for Credit Score Modeling

This table outlines the selected features used to simulate and classify credit score behavior. Each parameter was retained based on its direct impact on financial risk, repayment ability, or behavioral pattern, contributing to a 3-class classification model: likely to increase, decrease, or remain stable.

| Parameter                                             | Reason                                                            |
| ----------------------------------------------------- | ----------------------------------------------------------------- |
| `age`                                                 | Credit behavior often varies by age group                         |
| `monthly_income`                                      | Key for affordability and risk ratios                             |
| `monthly_emi_ratio`                                   | Strong indicator of credit burden                                 |
| `current_total_outstanding`                           | Core to risk exposure                                             |
| `credit_utilization_ratio`                            | One of the most critical predictors of score                      |
| `num_open_loans`                                      | Captures current liabilities                                      |
| `repayment_history_score (0–100)`                     | Encapsulates past behavior — very predictive                      |
| `dpd_last_3_months`                                   | Measures immediate delinquency risk                               |
| `num_hard_inquiries_last_6m`                          | Recent credit-seeking behavior — indicative of potential distress |
| `recent_credit_card_usage (last 3 months)`            | Shows current behavior — high relevance                           |
| `recent_loan_disbursed_amount`                        | Adds context to inquiries and credit behavior                     |
| `total_credit_limit`                                  | Needed to interpret utilization and exposure                      |
| `months_since_last_default`                           | Signals time passed since negative behavior                       |
| `Recent_lon_type 3 months`                            | Captures product-level risk or trend changes                      |
| `Amount owed to deliquency`                           | Adds depth to `Delinquency` info                                  |
| `Time since default`                                  | Add temporal relevance to negative flags                          |
| `Credit history`                                      | Age of credit file — important for maturity and experience        |
| `num_of_opened_accounts recently`                     | Can signal aggressive credit-seeking                              |
| `mum of different credit accounts`                    | Diversity in credit mix — relevant for scoring models             |
| `percentage of outstanding debt installment/rotation` | Breaks down the nature of debt — useful in risk modeling          |
| `installment_type` (e.g. personal, educational, home) | Personal loans carry higher default risk than secured loans       |

This final list ensures a balance between behavioral, financial, and credit history variables to simulate realistic and diverse credit profiles.


In [1]:
import numpy as np
import pandas as pd
from tabulate import tabulate


In [2]:
# Parameter ranges for data generation
parameter_ranges = {
    'age': {
        'min': 21,
        'max': 65,
        'type': 'int',
        'effect': 'positive'
    },
    'monthly_income': {
        'min': 10000,
        'max': 500000,
        'type': 'float',
        'effect': 'positive'
    },
    'monthly_emi_ratio': {
        'min': 0,
        'max': 1,
        'type': 'float',
        'effect': 'negative'
        
    },
    'current_total_outstanding': {
        'min': 0,
        'max': 5000000,
        'type': 'float',
        'effect': 'negative'
    },
    'credit_utilization_ratio': {
        'min': 0.0,
        'max': 1.0,  
        'type': 'float',
        'effect': 'negative'
    },

    'num_open_loans': {
        'min': 0,
        'max': 20.0,  
        'type': 'int',
        'effect': 'negative'
    },
    'repayment_score': {
        'min': 0,
        'max': 100,
        'type': 'float',
        'effect': 'positive'
    },
    'dpd_last_3m': {
        'min': 0,
        'max': 90,
        'type': 'int',
        'effect': 'positive'
    },
    'num_hard_inquiries_last_6m': {
        'min': 0,
        'max': 30,
        'type': 'int',
        'effect': 'negative'
    },
    'recent_credit_card_usage': {
        'min': 0,
        'max': 90,
        'type': 'float',
        'effect': 'positive'
    },
    'recent_loan_disbursed_amount': {
        'min': 0,
        'max': 5000000,
        'type': 'float',
        'effect': 'positive'
    },
    'total_credit_limit': {
        'min': 50000,
        'max': 2000000,
        'type': 'float',
        'effect': 'positive'
    },
    'time_since_default': {
        'min': 0,
        'max': 999,  # 999 represents "never defaulted"
        'type': 'int',
        'effect': 'positive'
    },
    'credit_history':{
        'min': 0,
        'max': 44,
        'type':float,
        'effect':'increase'
    }
}
df = pd.DataFrame()

Using normalize because the max values might be very large.

In [3]:
def normalize_log(x, min_val, max_val, epsilon=1e-5):
    """
    Normalizes a value using log-scaling and min-max to bring result between 0 and 1.

    Args:
        x (float): The raw value to normalize.
        min_val (float): Minimum possible value of the feature.
        max_val (float): Maximum possible value of the feature.
        epsilon (float): Small value to prevent log(0).

    Returns:
        float: Normalized value in the range [0, 1].
    """
    x_log = np.log1p(x + epsilon)         # log(1 + x) for stability
    min_log = np.log1p(min_val + epsilon)
    max_log = np.log1p(max_val + epsilon)
    
    return (x_log - min_log) / (max_log - min_log)


# Labeling function
def assign_label(score):
    if score <= 0.45:
        return "Decrease"
    elif score >= 0.55:
        return "Increase"
    else:
        return "Stable"

## Writing for high impact factors


In [4]:

high_impact = {'dpd_last_3m':0.30,
          'current_utilization_ratio':0.25,
          'repayment_score':0.20,
          'monthly_emi_ratio':0.15,
          'time_since_default':0.10
          }


# Data generation
n_increase_samples = 2500
data = []

for _ in range(n_increase_samples):
    sample = {}
    final_score = 0
    
    for param, meta in parameter_ranges.items():
        if param not in high_impact:
            if meta['effect'] == 'positive':
                low = meta['min']
                high = meta['min'] + 0.1 * (meta['max'] - meta['min'])  # low values
            if meta['effect'] == 'negative':
                low = meta['min'] + 0.9 * (meta['max'] - meta['min'])  # high values
                high = meta['max']


            value = np.round(np.random.uniform(low, high), 4)
            sample[param] = value
            
        else :
            # For increase-bias:
            if meta['effect'] == 'positive':
                low = meta['min'] + 0.9 * (meta['max'] - meta['min'])
                high = meta['max']
            if meta['effect'] == 'negative':
                low = meta['min'] 
                high = meta['min'] +  0.1 * (meta['max'] - meta['min'])
            value = np.round(np.random.uniform(low, high), 4)

            
            sample[param] = value
            norm_value = normalize_log(value, meta['min'], meta['max'])
            
            # Invert scoring logic for negative indicators
            if meta['effect'] == 'negative':
                norm_value = 1 - norm_value
            final_score += norm_value * high_impact[param]

    sample['impact_score'] = round(final_score, 4)
    sample['label'] = assign_label(final_score)
    data.append(sample)

# Convert to DataFrame
df1 = pd.DataFrame(data)

print(tabulate(df1.head(100), headers='keys', tablefmt='grid'))
df = pd.concat([df,df1],ignore_index = True)
print("df.shape  = ", df.shape)
# print(tabulate(df.head(100), headers='keys', tablefmt='grid'))



+----+---------+------------------+---------------------+-----------------------------+----------------------------+------------------+-------------------+---------------+------------------------------+----------------------------+--------------------------------+----------------------+----------------------+------------------+----------------+----------+
|    |     age |   monthly_income |   monthly_emi_ratio |   current_total_outstanding |   credit_utilization_ratio |   num_open_loans |   repayment_score |   dpd_last_3m |   num_hard_inquiries_last_6m |   recent_credit_card_usage |   recent_loan_disbursed_amount |   total_credit_limit |   time_since_default |   credit_history |   impact_score | label    |
|  0 | 22.1478 |          57949.7 |              0.0008 |                 4.63279e+06 |                     0.9819 |          18.9231 |           94.5163 |       86.1824 |                      28.127  |                     6.2984 |                      215416    |             193099 

In [5]:
n_decrease_samples = 2500
data = []

for _ in range(n_decrease_samples):
    sample = {}
    final_score = 0
    
    for param, meta in parameter_ranges.items():
        if param not in high_impact:
            if meta['effect'] == 'positive':
                low = meta['min'] 
                high = meta['min'] +  0.1 * (meta['max'] - meta['min'])        
            if meta['effect'] == 'negative':
                low = meta['min'] + 0.9 * (meta['max'] - meta['min'])
                high = meta['max']
            value = np.round(np.random.uniform(low, high), 4)
            sample[param] = value
            
        else :
            # For decrease-bias:
            if meta['effect'] == 'positive':
                low = meta['min'] 
                high = meta['min'] +  0.1 * (meta['max'] - meta['min'])
            if meta['effect'] == 'negative':
                low = meta['min'] + 0.9 * (meta['max'] - meta['min'])
                high = meta['max']
            value = np.round(np.random.uniform(low, high), 4)

            
            sample[param] = value
            norm_value = normalize_log(value, meta['min'], meta['max'])
            
            # Invert scoring logic for negative indicators
            if meta['effect'] == 'negative':
                norm_value = 1 - norm_value
            final_score += norm_value * high_impact[param]

    sample['impact_score'] = round(final_score, 4)
    sample['label'] = assign_label(final_score)
    data.append(sample)

# Convert to DataFrame
df2 = pd.DataFrame(data)

print(tabulate(df2.head(100), headers='keys', tablefmt='grid'))
df = pd.concat([df,df2],ignore_index = True)
print("df.shape  = ", df.shape)

+----+---------+------------------+---------------------+-----------------------------+----------------------------+------------------+-------------------+---------------+------------------------------+----------------------------+--------------------------------+----------------------+----------------------+------------------+----------------+----------+
|    |     age |   monthly_income |   monthly_emi_ratio |   current_total_outstanding |   credit_utilization_ratio |   num_open_loans |   repayment_score |   dpd_last_3m |   num_hard_inquiries_last_6m |   recent_credit_card_usage |   recent_loan_disbursed_amount |   total_credit_limit |   time_since_default |   credit_history |   impact_score | label    |
|  0 | 21.6787 |          16118.5 |              0.9939 |                 4.61659e+06 |                     0.9635 |          18.9592 |            1.7698 |        4.1773 |                      28.5182 |                     3.6099 |                      230191    |              71193.

## Adding other 15000 values.
- Defining distributions for each parameter and then sampling from those distributions.
# Parameter Distributions for Credit Score Dataset

The following table outlines the statistical distributions used to generate realistic values for each parameter. These distributions were chosen to match real-world credit behavior patterns and ensure the generated data maintains realistic statistical properties.

# Recommended Parameter Distributions

| Parameter |  Distribution | Reason |
|-----------|------------------------|--------|
| `age` | Truncated Normal (mean=32, std=8, range 21–65) | Most borrowers are working age, slightly younger demographic for Indian context |
| `monthly_income` | Log-normal (μ=10.3, σ=0.65) | Income is positively skewed, gives realistic median ~₹30K for Indian market |
| `monthly_emi_ratio` | Beta (α=2, β=5) scaled to [0,1] | Most people have low-to-moderate EMI burdens, bounded ratio |
| `current_total_outstanding` | Gamma (shape=2, scale=100000) | Heavy right tail — debt can vary drastically |
| `credit_utilization_ratio` | Beta (α=2, β=5) scaled to [0,1] | Utilization typically skews low, spikes when risky |
| `num_open_loans` | Poisson (λ=2) clipped to [0, 20] | Count data, most people have < 5 loans |
| `repayment_score` | Beta (α=7, β=3) scaled to [0,100] | Most borrowers score well, more realistic skew toward higher scores |
| `dpd_last_3m` | Zero-inflated Poisson (λ=0.5) | Most borrowers have 0 DPD, occasional short delays |
| `num_hard_inquiries_last_6m` | Poisson (λ=1) | Rarely exceed 3–4, fits count nature |
| `recent_credit_card_usage` | Beta (α=1.5, β=3) scaled to [0,100] | Bimodal usage patterns, many low users with some heavy users |
| `recent_loan_disbursed_amount` | Gamma (shape=2, scale=200000) | Loan amounts vary widely, right-skewed |
| `total_credit_limit` | Log-normal (μ=12.8, σ=0.8) | Credit limits are right-skewed, median ~₹3.6L realistic |
| `time_since_default` | Mixture: 70% never (999) + 30% Exp(scale=180) | Most never defaulted, realistic timing for those who did |
| `credit_history` | Gamma (shape=2, scale=3) clipped to [0, 25] | Right-skewed, younger borrowers have shorter history |



## Distribution Parameters

- **Truncated Normal**: Good for parameters that follow a normal distribution but have strict bounds
- **Log-normal**: Good for financial parameters that are positively skewed
- **Beta**: Good for ratios and proportions that are bounded between 0 and 1
- **Gamma**: Good for parameters with heavy right tails
- **Poisson**: Good for count data with rare occurrences
- **Exponential**: Good for time-based parameters with decreasing probability

I have taken inspiration of how FICO score is calculater.A FICO credit score is calculated based on five key factors: payment history (35%), amount owed (30%), length of credit history (15%), new credit (10%), and credit mix (10%). Source : https://www.myfico.com/credit-education/whats-in-your-credit-score

In [6]:
import scipy.stats as stats
impact = {
    'credit_utilization_ratio': 0.20,
    'dpd_last_3m': 0.15,
    'repayment_score': 0.15,
    'current_total_outstanding': 0.10,
    'num_open_loans': 0.05,
    'credit_history': 0.10,
    'age': 0.05,
    'num_hard_inquiries_last_6m': 0.04,
    'recent_credit_card_usage': 0.03,
    'recent_loan_disbursed_amount': 0.03,
    'monthly_emi_ratio': 0.04,
    'total_credit_limit': 0.03,
    'monthly_income': 0.02,
    'time_since_default': 0.01
}

class ZeroInflatedPoisson:
    def __init__(self, lam, zero_prob):
        self.lam = lam
        self.zero_prob = zero_prob
        self.poisson = stats.poisson(lam)
    
    def rvs(self, size=1):
        is_zero = np.random.random(size) < self.zero_prob
        poisson_values = self.poisson.rvs(size)
        return np.where(is_zero, 0, poisson_values)

distribution = {
    'credit_utilization_ratio': stats.beta(2, 5),
    'dpd_last_3m': ZeroInflatedPoisson(0.5, 0.7), # 70% people have 0 dpd
    'repayment_score': stats.beta(7, 3),
    'current_total_outstanding': stats.gamma(2, scale=100000),
    'num_open_loans': stats.poisson(2),
    'credit_history': stats.gamma(2, scale=3),
    'age': stats.truncnorm((21-32)/8, (65-32)/8, loc=32, scale=8),
    'num_hard_inquiries_last_6m': stats.poisson(1),
    'recent_credit_card_usage': stats.beta(1.5, 3),
    'recent_loan_disbursed_amount': stats.gamma(2, scale=200000),
    'monthly_emi_ratio': stats.beta(2, 5),
    'total_credit_limit': stats.lognorm(s=0.8, scale=np.exp(12.8)),
    'monthly_income': stats.lognorm(s=0.65, scale=np.exp(10.3)),
    'time_since_default': stats.gamma(2, scale=3)
}

n_samples = 15000
data = []
for _ in range(n_samples):
    sample = {}
    final_score = 0
    
    for param, meta in parameter_ranges.items():
            if param == 'dpd_last_3m':
                value = distribution[param].rvs()
            elif param == 'repayment_score':
                value = distribution[param].rvs() * 100
            elif param == 'recent_credit_card_usage':
                value = distribution[param].rvs() * 90
            elif param == 'num_open_loans':
                value = np.clip(distribution[param].rvs(), 0, 20)
            elif param == 'credit_history':
                value = np.clip(distribution[param].rvs(), 0, 44)
            else:
                value = distribution[param].rvs()
            
            sample[param] = value
            norm_value = normalize_log(value, meta['min'], meta['max'])
            
            # Invert scoring logic for negative indicators
            if meta['effect'] == 'negative':
                norm_value = 1 - norm_value
            final_score += norm_value * impact[param]

    sample['impact_score'] = round(float(final_score), 4)
    sample['label'] = assign_label(final_score)
    data.append(sample)

df3 = pd.DataFrame(data)
print(tabulate(df3.head(100), headers='keys', tablefmt='grid'))
df = pd.concat([df,df3],ignore_index = True)
print("df.shape  = ", df.shape)

  sample['impact_score'] = round(float(final_score), 4)


+----+---------+------------------+---------------------+-----------------------------+----------------------------+------------------+-------------------+---------------+------------------------------+----------------------------+--------------------------------+----------------------+----------------------+------------------+----------------+----------+
|    |     age |   monthly_income |   monthly_emi_ratio |   current_total_outstanding |   credit_utilization_ratio |   num_open_loans |   repayment_score |   dpd_last_3m |   num_hard_inquiries_last_6m |   recent_credit_card_usage |   recent_loan_disbursed_amount |   total_credit_limit |   time_since_default |   credit_history |   impact_score | label    |
|  0 | 38.9034 |         48462.6  |           0.367077  |                   275817    |                  0.145939  |                1 |           75.5864 |             0 |                            2 |                  27.2111   |               157340           |     346436         

## Generating outlier values

In [7]:
import random
def calculate_outlier_val(meta, bias):

    min_val = meta['min']
    max_val = meta['max']

    # Ensure min_val is not greater than max_val for calculation stability
    if min_val > max_val:
        min_val, max_val = max_val, min_val

    q1_percentile = random.uniform(0.01, 0.24)
    q3_percentile = random.uniform(0.76, 0.99)

    if hasattr(distribution[param], 'ppf'):
        # Use the distribution's ppf (percent point function) to get quartiles
        Q1 = distribution[param].ppf(q1_percentile)
        Q3 = distribution[param].ppf(q3_percentile)
    else:
        # Fallback to estimating Q1 and Q3 assuming a uniform distribution
        Q1 = min_val + q1_percentile * (max_val - min_val)
        Q3 = min_val + q3_percentile * (max_val - min_val)

    IQR = Q3 - Q1

    if bias == "increase":
        # Lower bound for random outlier: the 1.5 IQR upper fence
        lower_bound_outlier = Q3 + 1.5 * IQR
        # Upper bound for random outlier: extend by another IQR from the lower bound
        # This creates a sensible range for the random outlier beyond the fence.
        upper_bound_outlier = lower_bound_outlier + IQR
        # Ensure the bounds are reasonable, especially if IQR is very small
        if upper_bound_outlier <= lower_bound_outlier:
            upper_bound_outlier = lower_bound_outlier + 1.0 # Add a small fixed value if range is too small
        return random.uniform(lower_bound_outlier, upper_bound_outlier)
    elif bias == "decrease":
        # Upper bound for random outlier: the 1.5 IQR lower fence
        upper_bound_outlier = Q1 - 1.5 * IQR
        # Lower bound for random outlier: extend by another IQR below the upper bound
        # This creates a sensible range for the random outlier below the fence.
        lower_bound_outlier = upper_bound_outlier - IQR
        # Ensure the bounds are reasonable, especially if IQR is very small
        if lower_bound_outlier >= upper_bound_outlier:
            lower_bound_outlier = upper_bound_outlier - 1.0 # Subtract a small fixed value if range is too small
        return max(random.uniform(lower_bound_outlier, upper_bound_outlier), min_val)
# Generate outlier records
outlier_data = []
n_outliers = 5000
for i in range(n_outliers):
    sample = {}
    final_score = 0
    bias = "increase" if i < n_outliers // 2 else "decrease"
    for param, meta in parameter_ranges.items():
        for param, meta in parameter_ranges.items():
            val = calculate_outlier_val(meta, bias)        
            sample[param] = val
        norm_value = normalize_log(value, meta['min'], meta['max'])
            
            # Invert scoring logic for negative indicators
        if meta['effect'] == 'negative':
            norm_value = 1 - norm_value
        final_score += norm_value * impact[param]
    
    sample['impact_score'] = round(float(final_score), 4)
    sample['label'] = assign_label(final_score)
    outlier_data.append(sample)
df4 = pd.DataFrame(outlier_data)

print(tabulate(df4.head(100), headers='keys', tablefmt='grid'))
df = pd.concat([df,df4],ignore_index = True)
print("df.shape  = ", df.shape)

+----+----------+------------------+---------------------+-----------------------------+----------------------------+------------------+-------------------+---------------+------------------------------+----------------------------+--------------------------------+----------------------+----------------------+------------------+----------------+----------+
|    |      age |   monthly_income |   monthly_emi_ratio |   current_total_outstanding |   credit_utilization_ratio |   num_open_loans |   repayment_score |   dpd_last_3m |   num_hard_inquiries_last_6m |   recent_credit_card_usage |   recent_loan_disbursed_amount |   total_credit_limit |   time_since_default |   credit_history |   impact_score | label    |
|  0 | 100.687  |         138666   |            1.12873  |            776858           |                   1.28977  |          6.61945 |           1.50827 |       212.961 |                      6.74764 |                    1.34937 |                    1.28379e+06 |          4.97529

Gaussian noise is added to each value to introduce realism and variability, while keeping impact_score unchanged to preserve label logic and avoid boundary shifts.

In [8]:

df = df.sample(frac=1, random_state=42).reset_index(drop=True)

exclude_cols = ['age', 'impact_score', 'label']


for col in df.columns:
    if col not in exclude_cols and pd.api.types.is_numeric_dtype(df[col]):
        std_dev = df[col].std()
        noise = np.random.normal(0, 0.05 * std_dev, size=df.shape[0])
        df[col] = df[col] + noise

df['dpd_last_3m'] = df['dpd_last_3m'].astype(float)
df.drop('impact_score',axis = 1)
df.to_csv("dataset.csv", index=False)
