In [1]:
import pandas as pd

### Simplify the dataset

**The issue**: The original dataset contains multiple rows per customer across different months. That structure makes this a time-series / sequence problem if the goal is to predict a customer's credit score in a *future* month (e.g., next month). Time-aware modeling requires sequence features, temporal cross-validation, and care to avoid leakage.

**Why we truncate**: For this project stage we simplify the problem to a cross-sectional classification task by keeping only the most recent (latest) month per customer. This produces a single row per customer and allows us to prototype feature engineering and classification models without implementing a time-series pipeline.

**Implications & next steps**:
- The truncated dataset is saved to `../data/raw/credit_score_truncated_raw.csv`.
- From the truncated dataset we create a development split (`train_full.csv`, 80%) and a locked holdout (`test_holdout.csv`, 20%) to mimic realistic evaluation practice.
- Pros: faster iteration, simpler modeling, easier baseline comparisons.
- Cons: loss of temporal dynamics; not suitable if the production task requires forecasting future credit scores or leveraging temporal patterns.

When moving beyond prototyping, restore the temporal structure and use time-aware validation and modeling approaches.

In [2]:
file_path = "../data/raw/credit_score_raw.csv"

df = pd.read_csv(file_path)

  df = pd.read_csv(file_path)


In [3]:
df.sample(n=2, random_state=1).T

Unnamed: 0,43660,87278
ID,0x115d2,0x21564
Customer_ID,CUS_0x5af1,CUS_0x87be
Month,May,July
Name,,Novakz
Age,38,46
SSN,620-05-5524,268-75-5454
Occupation,Doctor,Doctor
Annual_Income,40026.12_,75868.8
Monthly_Inhand_Salary,,6074.4
Num_Bank_Accounts,6,6


In [4]:
# Convert month to numeric for sorting
month_map = {
    'January': 1, 'February': 2, 'March': 3, 'April': 4,
    'May': 5, 'June': 6, 'July': 7, 'August': 8,
    'September': 9, 'October': 10, 'November': 11, 'December': 12
}
df['month_num'] = df['Month'].map(month_map)

# Sort by customer and month (ascending)
df_sorted = df.sort_values(['Customer_ID', 'month_num'])

# Keep ONLY the last month for each customer
df = df_sorted.groupby('Customer_ID', as_index=False).last()
df = df.drop(["month_num"], axis=1)

In [5]:
truncated_file_path = "../data/raw/credit_score_truncated_raw.csv"
df.to_csv(truncated_file_path, index=False)

### Notes about filename and splits

- The notebook filename `00_truncate_data.ipynb` is fine â€” no rename required.
- Next we create a reproducible 80/20 split from the truncated dataset: `train_full.csv` (80%) for development and `test_holdout.csv` (20%) as a locked holdout.

In [None]:
from sklearn.model_selection import train_test_split

# Choose target column if present; fallback to no stratify if unknown
target_col = 'credit_score' if 'credit_score' in df.columns else ('Credit_Score' if 'Credit_Score' in df.columns else None)

if target_col is not None:
    train_df, test_df = train_test_split(df, test_size=0.2, stratify=df[target_col], random_state=42)
else:
    # target not detected; perform random split without stratification
    train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

train_path = "../data/raw/train_full.csv"
test_path = "../data/raw/test_holdout.csv"
train_df.to_csv(train_path, index=False)
test_df.to_csv(test_path, index=False)
print(f"Saved: {train_path} ({len(train_df)} rows), {test_path} ({len(test_df)} rows)")