## Alternative Behavioral Dataset Creation

Traditional credit datasets lack information on digital financial behavior.
To simulate alternative data used in modern credit scoring systems, a
synthetic behavioral dataset was created. The dataset captures digital
payment activity, mobile usage patterns, and bill payment behavior inspired
by publicly available behavioral datasets and industry research.


In [3]:
import pandas as pd
import numpy as np

app = pd.read_csv("../data/home-credit-default-risk/application_train_cleaned.csv")
np.random.seed(42)


In [4]:
alt_data = pd.DataFrame({
    "SK_ID_CURR": app["SK_ID_CURR"],
    
    # Monthly digital behavior
    "avg_monthly_recharge": np.random.normal(400, 150, len(app)).clip(50, 2000),
    "upi_txn_count": np.random.poisson(25, len(app)),
    "online_purchase_freq": np.random.poisson(6, len(app)),
    
    # Payment discipline
    "delayed_bill_payments": np.random.poisson(1.5, len(app)),
    
    # Mobile usage
    "avg_monthly_data_gb": np.random.normal(12, 5, len(app)).clip(1, 50)
})


In [5]:
alt_data["digital_engagement_score"] = (
    0.3 * alt_data["upi_txn_count"] +
    0.2 * alt_data["online_purchase_freq"] +
    0.2 * alt_data["avg_monthly_data_gb"] -
    0.3 * alt_data["delayed_bill_payments"]
)


In [6]:
alt_data["digital_engagement_score"] = (
    alt_data["digital_engagement_score"] - alt_data["digital_engagement_score"].min()
) / (
    alt_data["digital_engagement_score"].max() - alt_data["digital_engagement_score"].min()
)


In [7]:
alt_data.head()
alt_data.isnull().sum()
alt_data.describe()


Unnamed: 0,SK_ID_CURR,avg_monthly_recharge,upi_txn_count,online_purchase_freq,delayed_bill_payments,avg_monthly_data_gb,digital_engagement_score
count,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0
mean,278180.518577,400.461376,24.996348,5.996403,1.499992,12.039888,0.470615
std,102790.175348,148.646945,4.99663,2.454982,1.224026,4.936201,0.116184
min,100002.0,50.0,6.0,0.0,0.0,1.0,0.0
25%,189145.5,298.776925,22.0,4.0,1.0,8.642922,0.391257
50%,278202.0,399.916117,25.0,6.0,1.0,12.018242,0.468577
75%,367142.5,501.097427,28.0,8.0,2.0,15.390091,0.547576
max,456255.0,1084.317209,51.0,21.0,10.0,37.169026,1.0


In [8]:
alt_data.isnull().sum()


SK_ID_CURR                  0
avg_monthly_recharge        0
upi_txn_count               0
online_purchase_freq        0
delayed_bill_payments       0
avg_monthly_data_gb         0
digital_engagement_score    0
dtype: int64

In [9]:
alt_data.to_csv("../data/alt_data.csv", index=False)
