🔄 Train-Test-Validation (OOT) Split Strategy
Since this is time-series data:

Train: earliest 70% of the timeline (e.g., Jan 2019–Jul 2020).

Test: next 15% (e.g., Aug–Oct 2020).

OOT (Out-of-Time validation): final 15% (e.g., Nov–Dec 2020).

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')

# Load DataFrames from exported CSVs
daily_aggregate = pd.read_csv('data_exports/daily_aggregate.csv', parse_dates=['trans_dt'])
weekly_aggregate = pd.read_csv('data_exports/weekly_aggregate.csv', parse_dates=['trans_dt'])
monthly_aggregate = pd.read_csv('data_exports/monthly_aggregate.csv', parse_dates=['trans_dt'])


transaction_features = pd.read_csv('data_exports/transaction_features.csv', parse_dates=['trans_dt'])
per_card_summary = pd.read_csv('data_exports/per_card_summary.csv')
per_card_daily_panel = pd.read_csv('data_exports/per_card_daily_panel.csv', parse_dates=['trans_dt'])

# Sanity checks
print("Daily Aggregate shape:", daily_aggregate.shape)
print("Weekly Aggregate shape:", weekly_aggregate.shape)
print("Monthly Aggregate shape:", monthly_aggregate.shape)
print("Transaction Features shape:", transaction_features.shape)
print("Per-card Summary shape:", per_card_summary.shape)
print("Per-card Daily Panel shape:", per_card_daily_panel.shape)


Daily Aggregate shape: (538, 2)
Weekly Aggregate shape: (77, 2)
Monthly Aggregate shape: (18, 2)
Transaction Features shape: (1296675, 55)
Per-card Summary shape: (983, 42)
Per-card Daily Panel shape: (488031, 3)


### Loading in and Checking the Transaction Level + Per Card Time Series Features

In [None]:
# load your exports


# print the column lists
print("=== transaction_features columns ===")
print(transaction_features.columns.tolist(), end="\n\n")

print("=== per_card_summary columns ===")
print(per_card_summary.columns.tolist(), end="\n\n")

print("=== per_card_daily_panel columns ===")
print(per_card_daily_panel.columns.tolist())

=== transaction_features columns ===
['cc_num', 'merchant', 'category', 'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip', 'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time', 'merch_lat', 'merch_long', 'is_fraud', 'amt_outlier', 'trans_dt', 'hour', 'weekday', 'month', 'spend_entertainment', 'spend_food_dining', 'spend_gas_transport', 'spend_grocery_net', 'spend_grocery_pos', 'spend_health_fitness', 'spend_home', 'spend_kids_pets', 'spend_misc_net', 'spend_misc_pos', 'spend_personal_care', 'spend_shopping_net', 'spend_shopping_pos', 'spend_travel', 'frac_spend_entertainment', 'frac_spend_food_dining', 'frac_spend_gas_transport', 'frac_spend_grocery_net', 'frac_spend_grocery_pos', 'frac_spend_health_fitness', 'frac_spend_home', 'frac_spend_kids_pets', 'frac_spend_misc_net', 'frac_spend_misc_pos', 'frac_spend_personal_care', 'frac_spend_shopping_net', 'frac_spend_shopping_pos', 'frac_spend_travel', 'is_top300_merchant']

=== per_card_summary columns ===
['t

## SPLITTING INTO COHORTS TRAIN, TEST, AND OOT

### Per_Card_Daily_Panel_Columns Splitting 

In [None]:
panel = per_card_daily_panel.copy()

# First ensure dates are sorted
panel = panel.sort_values(by=['cc_num', 'trans_dt'])

# Check range
print(panel['trans_dt'].min(), panel['trans_dt'].max())

# Define split dates
train_end = '2020-01-31'
test_end = '2020-04-15'

# Create splits
daily_train_panel = panel[panel['trans_dt'] <= train_end]
daily_test_panel = panel[(panel['trans_dt'] > train_end) & (panel['trans_dt'] <= test_end)]
daily_oot_panel = panel[panel['trans_dt'] > test_end]

# Sanity checks
print("Train period:", daily_train_panel['trans_dt'].min(), "to", daily_train_panel['trans_dt'].max())
print("Test period:", daily_test_panel['trans_dt'].min(), "to", daily_test_panel['trans_dt'].max())
print("OOT period:", daily_oot_panel['trans_dt'].min(), "to", daily_oot_panel['trans_dt'].max())

print("Train size:", len(daily_train_panel))
print("Test size:", len(daily_test_panel))
print("OOT size:", len(daily_oot_panel))

2019-01-01 00:00:00 2020-06-21 00:00:00
Train period: 2019-01-01 00:00:00 to 2020-01-31 00:00:00
Test period: 2020-02-01 00:00:00 to 2020-04-15 00:00:00
OOT period: 2020-04-16 00:00:00 to 2020-06-21 00:00:00
Train size: 359402
Test size: 68114
OOT size: 60515


## Transaction Feature Store Splitting

In [5]:
transaction_features = transaction_features.sort_values('trans_dt')

train_end = '2020-01-31'
test_end = '2020-04-15'

train_transactions = transaction_features[transaction_features['trans_dt'] <= train_end]
test_transactions = transaction_features[
    (transaction_features['trans_dt'] > train_end) & 
    (transaction_features['trans_dt'] <= test_end)
]
oot_transactions = transaction_features[transaction_features['trans_dt'] > test_end]

print("Transaction Features Splits:")
print("Train:", train_transactions['trans_dt'].min(), "to", train_transactions['trans_dt'].max())
print("Test:", test_transactions['trans_dt'].min(), "to", test_transactions['trans_dt'].max())
print("OOT:", oot_transactions['trans_dt'].min(), "to", oot_transactions['trans_dt'].max())

print("Sizes:")
print("Train size:", len(train_transactions))
print("Test size:", len(test_transactions))
print("OOT size:", len(oot_transactions))

Transaction Features Splits:
Train: 2019-01-01 00:00:18 to 2020-01-30 23:58:55
Test: 2020-01-31 00:01:33 to 2020-04-14 23:59:47
OOT: 2020-04-15 00:00:01 to 2020-06-21 12:13:37
Sizes:
Train size: 975662
Test size: 154141
OOT size: 166872


## Per Card Summary

Strategy:
Split based on a combination of first_txn and ensuring that cards in test and OOT have sufficient activity. For instance, you could split by cards that have activity through the full dataset, but then split transactions by date per card, rather than splitting by card cohorts directly.

Recommended rigorous approach (Time-Series Per-Card Splitting):
Instead of splitting the per-card summary directly, it’s often better to:

Use the full set of cards in all splits.

Split each card’s transactions by date into train/test/OOT periods.

Thus, each card appears in all cohorts, but the dates for the transactions are strictly separated. This is standard in time series forecasting to predict future behavior from past behavior per card.

In [8]:
per_card_summary['first_txn'] = pd.to_datetime(per_card_summary['first_txn'])

# Cohort assignment based on first transaction date
train_end = '2020-01-31'
test_end = '2020-04-15'

# Per-card transaction splits:
train_panel = per_card_daily_panel[per_card_daily_panel['trans_dt'] <= train_end]
test_panel = per_card_daily_panel[
    (per_card_daily_panel['trans_dt'] > train_end) & 
    (per_card_daily_panel['trans_dt'] <= test_end)
]
oot_panel = per_card_daily_panel[per_card_daily_panel['trans_dt'] > test_end]

# Check per-card coverage:
train_cards = train_panel['cc_num'].nunique()
test_cards = test_panel['cc_num'].nunique()
oot_cards = oot_panel['cc_num'].nunique()

print("Transaction-based splits:")
print("Train period:", train_panel['trans_dt'].min(), "to", train_panel['trans_dt'].max())
print("Test period:", test_panel['trans_dt'].min(), "to", test_panel['trans_dt'].max())
print("OOT period:", oot_panel['trans_dt'].min(), "to", oot_panel['trans_dt'].max())

print("\nNumber of unique cards per period:")
print("Train cards:", train_cards)
print("Test cards:", test_cards)
print("OOT cards:", oot_cards)

print("\nSizes (number of rows):")
print("Train rows:", len(train_panel))
print("Test rows:", len(test_panel))
print("OOT rows:", len(oot_panel))

Transaction-based splits:
Train period: 2019-01-01 00:00:00 to 2020-01-31 00:00:00
Test period: 2020-02-01 00:00:00 to 2020-04-15 00:00:00
OOT period: 2020-04-16 00:00:00 to 2020-06-21 00:00:00

Number of unique cards per period:
Train cards: 963
Test cards: 915
OOT cards: 921

Sizes (number of rows):
Train rows: 359402
Test rows: 68114
OOT rows: 60515


## Target Encoding using the Per Card Summary Data Frame based on AMT 

Why use amt (spend amount) instead of transaction count (txn_count)?
You have two possible ways to define segments:

**Option A**: Transaction Count (txn_count)<br>
**Pros**: Measures frequency of card use.<br>
**Cons**: Ignores the monetary value of transactions.<br>
A card with many small transactions (e.g., coffee daily) may be considered "heavy," while a card with fewer high-value transactions (e.g., luxury purchases) would appear "light."



**Option B**: Total Amount (amt) <br>
**Pros**: Directly measures actual financial impact, more closely aligned to spending behaviors and profitability.<br>
**Cons**: Ignores transaction frequency—high spenders might spend infrequently but substantially.<br>



⚖️ Choosing between the two:
For predicting revenue or spending behaviors, total amount (amt) is typically preferred because it's directly correlated to business outcomes (profits, revenue, etc.).

Transaction frequency is useful if your business goal is engagement or activity levels (frequency of use).

Given the goal—"predicting future credit card spending based on historical behaviors"—segmenting based on amount spent (amt) is usually more practical and meaningful.

How does this prevent data leakage?
Segments are created solely using historical training-period data.

After establishing segments historically, you then apply these stable segments to the test and OOT periods.

No future data (test or OOT) influences the segment definitions.

This ensures segments are stable, interpretable, and leakage-free.



In [9]:
train_totals = train_panel.groupby('cc_num')['amt'].sum()
quantiles = train_totals.quantile([0.33, 0.66])

def spend_segment(amount):
    if amount <= quantiles.iloc[0]:
        return 'Light'
    elif amount <= quantiles.iloc[1]:
        return 'Medium'
    else:
        return 'High'

card_segments = train_totals.apply(spend_segment).rename('spender_segment')


In [10]:
# Merge segments into panel datasets
train_panel = train_panel.merge(card_segments, on='cc_num', how='left')
test_panel = test_panel.merge(card_segments, on='cc_num', how='left')
oot_panel = oot_panel.merge(card_segments, on='cc_num', how='left')

# Numeric encoding
segment_encoding = {'Light':0, 'Medium':1, 'High':2}
train_panel['seg_code'] = train_panel['spender_segment'].map(segment_encoding)
test_panel['seg_code'] = test_panel['spender_segment'].map(segment_encoding)
oot_panel['seg_code'] = oot_panel['spender_segment'].map(segment_encoding)

In [13]:
# Count segment labels in training panel
print(train_panel['spender_segment'].value_counts())
print(train_panel['seg_code'].value_counts())


spender_segment
High      129878
Medium    125469
Light     104055
Name: count, dtype: int64
seg_code
2    129878
1    125469
0    104055
Name: count, dtype: int64


Means:

You had ~129k transactions in train from high spenders

~125k from medium

~104k from light



### 📌 Why Segmenting Is Useful for Modeling

| Purpose               | Benefit                                                                 |
|-----------------------|-------------------------------------------------------------------------|
| ✅ Model personalization | Train different models per segment (e.g., heavy spenders may behave differently) |
| ✅ Forecasting            | Segment can be an exogenous feature in ARIMA, Prophet, or LSTM          |
| ✅ Feature                | You can use `seg_code` as a categorical input in tree models or deep nets |
| ✅ Downstream analytics   | Track how each group behaves differently over time                      |
| ✅ Targeted strategy      | You might recommend different products or marketing to each group        |

---

### 🧠 Bonus: Tips for Using the Encoded Segment

#### 📈 Plot time series of average spend per segment:
```python
avg_spend = train_panel.groupby(['trans_dt', 'seg_code'])['amt'].mean().unstack()
avg_spend.plot(title="Avg Daily Spend by Segment")



---

## ❓ Do You Need to Add Segment Encodings to the Transaction Feature Store?

**Short answer**: ✅ **Yes, if** you're planning to train models using the `transaction_features` dataset.

### Here's Why:
- The `seg_code` gives your model access to historical **spending profile** context per transaction.
- Many of your predictive tasks (e.g., forecasting, fraud detection, behavior modeling) could benefit from **user segmentation** as an additional feature.

### How to Add It:
Since `transaction_features` has a `cc_num` column:
```python
# Merge the segment encodings into the full transaction feature dataset
transaction_features = transaction_features.merge(card_segments, on='cc_num', how='left')

# Add numeric encoding
transaction_features['seg_code'] = transaction_features['spender_segment'].map(segment_encoding)
