## Data Integration

This notebook integrates cleaned application-level data, aggregated bureau
records, aggregated previous application features, and synthetic alternative
behavioral indicators into a single applicant-level dataset. All datasets
are merged using the common applicant identifier (`SK_ID_CURR`) to construct
the final modeling table.


In [4]:
import pandas as pd

app = pd.read_csv("../data/home-credit-default-risk/application_train_cleaned.csv")
bureau = pd.read_csv("../data/bureau_aggregated.csv")
prev = pd.read_csv("../data/previous_application_aggregated.csv")
alt = pd.read_csv("../data/alt_data.csv")


In [5]:
df = app.merge(bureau, on="SK_ID_CURR", how="left")


In [6]:
df = df.merge(prev, on="SK_ID_CURR", how="left")


In [7]:
df = df.merge(alt, on="SK_ID_CURR", how="left")


In [8]:
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns
df[numeric_cols] = df[numeric_cols].fillna(0)


In [9]:
df.shape
df.isnull().sum().sort_values(ascending=False).head(10)
df["TARGET"].value_counts(normalize=True)


TARGET
0    0.919271
1    0.080729
Name: proportion, dtype: float64

In [10]:
df.to_csv("../data/final_dataset.csv", index=False)