In [1]:
import pandas as pd

In [2]:
df_app_train = pd.read_csv('../data/raw/application_train.csv')
df_app_test  = pd.read_csv('../data/raw/application_test.csv')

df_bureau = pd.read_csv('../data/raw/bureau.csv')
df_bureau_balance = pd.read_csv('../data/raw/bureau_balance.csv')

df_prev_app = pd.read_csv('../data/raw/previous_application.csv')
df_pos_cash = pd.read_csv('../data/raw/POS_CASH_balance.csv')
df_installments = pd.read_csv('../data/raw/installments_payments.csv')
df_cc_balance = pd.read_csv('../data/raw/credit_card_balance.csv')

In [4]:
df_app_train.shape

(307511, 122)

The application_train table contains 307,511 rows and 122 features,
where each row represents a unique loan application.

In [5]:
df_app_train['SK_ID_CURR'].is_unique

True

SK_ID_CURR is unique in application_train, confirming that the dataset
is already at the correct modeling granularity.

In [6]:
df_app_train.isna().mean().sort_values(ascending=False).head(10)

COMMONAREA_MEDI             0.698723
COMMONAREA_AVG              0.698723
COMMONAREA_MODE             0.698723
NONLIVINGAPARTMENTS_MODE    0.694330
NONLIVINGAPARTMENTS_AVG     0.694330
NONLIVINGAPARTMENTS_MEDI    0.694330
FONDKAPREMONT_MODE          0.683862
LIVINGAPARTMENTS_MODE       0.683550
LIVINGAPARTMENTS_AVG        0.683550
LIVINGAPARTMENTS_MEDI       0.683550
dtype: float64

A large number of housing-related variables (e.g. COMMONAREA, LIVINGAPARTMENTS)
exhibit very high missing rates (around 70%). This pattern suggests that the
missingness is likely structural or business-related, rather than random.
For example, applicants without registered housing information naturally lack
these attributes. Therefore, such missing values may carry predictive meaning
and should be treated carefully during feature engineering.

In [7]:
df_app_train['DAYS_EMPLOYED'].value_counts().head()

DAYS_EMPLOYED
 365243    55374
-200         156
-224         152
-230         151
-199         151
Name: count, dtype: int64

The variable DAYS_EMPLOYED shows an abnormal value of 365243, which appears
far more frequently than any other value. This value is a known placeholder
used in the Home Credit dataset to indicate applicants without a formal
employment history. Therefore, this value should not be treated as a real
employment duration but rather as a separate category or missing indicator
during feature engineering.

In [8]:
df_app_train['DAYS_BIRTH'].describe()

count    307511.000000
mean     -16036.995067
std        4363.988632
min      -25229.000000
25%      -19682.000000
50%      -15750.000000
75%      -12413.000000
max       -7489.000000
Name: DAYS_BIRTH, dtype: float64

The variable DAYS_BIRTH is expressed as a negative number representing
the number of days before the application date. Its range corresponds
to a realistic age distribution (approximately 20 to 70 years),
indicating that this variable is well-behaved and does not contain
obvious anomalies.

In [9]:
df_bureau.shape

(1716428, 17)

In [10]:
df_bureau['SK_ID_CURR'].nunique()

305811

## bureau.csv

This table contains applicants’ historical credit records from external
financial institutions as reported to the Credit Bureau. Each applicant
can have multiple records, resulting in a one-to-many relationship with
the application table via SK_ID_CURR. Therefore, this table must be
aggregated before joining with the main application data.

In [11]:
df_bureau_balance.shape

(27299925, 3)

In [12]:
df_bureau_balance['SK_ID_BUREAU'].nunique()

817395

## bureau_balance.csv

This table provides monthly balance and status information for credits
reported in the Credit Bureau. Each credit record is observed over multiple
months, forming a time-series structure linked via SK_ID_BUREAU. As a
behavioral table with a strong temporal component, aggregation is required
before integration into the modeling dataset.

In [13]:
df_prev_app.shape

(1670214, 37)

In [14]:
df_prev_app['SK_ID_CURR'].nunique()

338857

## previous_application.csv

This table contains historical loan application records submitted by the
same applicants to Home Credit prior to the current loan. Since each
applicant may have multiple previous applications, the table exhibits a
one-to-many relationship with the application table through SK_ID_CURR.
Relevant information should be summarized at the applicant level before
being used for modeling.

In [15]:
df_pos_cash.shape

(10001358, 8)

In [16]:
df_pos_cash['SK_ID_PREV'].nunique()

936325

## POS_CASH_balance.csv

This table records monthly balance snapshots for applicants’ previous
POS and cash loans issued by Home Credit. Each loan is tracked across
multiple months, making this a behavioral time-series table linked via
SK_ID_PREV. Direct joining would cause row duplication, so aggregation is
necessary prior to feature construction.

In [17]:
df_installments.shape

(13605401, 8)

In [18]:
df_installments['SK_ID_PREV'].nunique()

997752

## installments_payments.csv

This table captures historical installment payment behavior for previously
disbursed Home Credit loans. It includes both actual payments and missed
installments, providing detailed insight into repayment behavior. Due to
its one-to-many relationship with the main application data via SK_ID_PREV,
the information must be aggregated before modeling.

In [19]:
df_cc_balance.shape

(3840312, 23)

In [20]:
df_cc_balance['SK_ID_PREV'].nunique()

104307

## credit_card_balance.csv

This table contains monthly balance information for applicants’ credit card
accounts with Home Credit. As each credit card account is observed over
multiple months, the table represents behavioral time-series data linked
via SK_ID_PREV. Aggregation is required to avoid data leakage and row
explosion when integrating with the application table.

## Overall Observations

- The application table defines the modeling granularity and contains
  several business-driven missing values and encoded anomalies.
- All auxiliary tables follow a one-to-many relationship with the main
  table and therefore require aggregation prior to modeling.
- Many missing values reflect the absence of specific products or behaviors
  rather than data quality issues.