# PD Sampling Pipeline (Stratified, 10%, Train/Val/OOT)

This notebook builds a representative PD training extract from large Freddie-style performance files using:
- two-pass stratified sampling,
- loan-level deterministic inclusion,
- deterministic train/val/OOT split,
- inverse-probability sample weights,
- manifest + QA diagnostics.


In [4]:
import pandas as pd
import glob
import numpy as np

# List of files you downloaded (e.g., 2007Q3, 2014Q1, 2018Q4, 2020Q4, 2022Q2)
file_list = ["2007Q1.csv", "2009Q1.csv", "2016Q1.csv", "2022Q2.csv"]
sample_rate = 0.1  # 1% sample
final_sample = []

for file in file_list:
    print(f"Processing {file}...")
    
    # 1. Get unique IDs first to sample consistently
    # We only need the ID column (index 0 or 1 depending on file version)
    ids = pd.read_csv(file, sep='|', usecols=[1], header=None)[1].unique()
    
    # 2. Randomly select 1% of those IDs
    sampled_ids = np.random.choice(ids, size=int(len(ids) * sample_rate), replace=False)
    sampled_ids_set = set(sampled_ids)
    
    # 3. Read the file in chunks and keep only the sampled IDs
    # This prevents loading the 5GB+ file into RAM all at once
    chunks = pd.read_csv(file, sep='|', header=None, chunksize=100000, 
                         usecols=[1,2,11,15,17,19,22,23,39,43])
    
    for chunk in chunks:
        # Filter for our pre-selected IDs
        filtered_chunk = chunk[chunk[1].isin(sampled_ids_set)]
        final_sample.append(filtered_chunk)

# Combine all sampled vintages into one DataFrame
df_all = pd.concat(final_sample, ignore_index=True)
print(f"Sampling complete. Total rows: {len(df_all)}")


Processing 2007Q1.csv...
Processing 2009Q1.csv...
Processing 2016Q1.csv...
Processing 2022Q2.csv...
Sampling complete. Total rows: 9877075


In [5]:
df_all.to_csv("sampled_PD_data_10pct.csv", index=False)