# Target Analysis by Split and Effective Month (Pandas Version)

## Overview
This notebook analyzes all 33 prediction head targets using **pandas** for data processing.

### Pandas Optimizations for Large-Scale Data:
- **Chunked processing**: Load and process one chunk at a time
- **Incremental aggregation**: Build statistics incrementally
- **Memory management**: Explicit garbage collection after each chunk
- **Selective column loading**: Only load target columns needed

### Analysis Goals:
1. Per-head positive count and positive rate by split and month
2. Temporal trends in target distributions
3. Class imbalance assessment
4. Comparison across splits

### Outputs:
- Target statistics CSV files (by split, by month)
- Heatmaps and trend visualizations

---


In [None]:
%pip install --upgrade pandas==2 -i https://repo.td.com/repository/pypi-all/simple

In [None]:
dbutils.library.restartPython()

In [None]:
import pandas
print(pandas.__version__)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from io import BytesIO
from collections import defaultdict
import gc

pd.set_option('display.max_columns', None)
sns.set_style("whitegrid")

print("✓ Libraries imported")


In [None]:
# Helper functions
def save_pandas_to_csv_adls(df_pandas, adls_path):
    csv_string = df_pandas.to_csv(index=False)
    dbutils.fs.put(adls_path, csv_string, overwrite=True)
    print(f"✓ Saved CSV to {adls_path}")

def save_plot_to_adls(fig, adls_path, dpi=150):
    import tempfile, os
    buf = BytesIO()
    fig.savefig(buf, format='png', dpi=dpi, bbox_inches='tight')
    buf.seek(0)
    with tempfile.NamedTemporaryFile(mode='wb', suffix='.png', delete=False) as tmp:
        tmp.write(buf.getvalue())
        tmp_path = tmp.name
    dbutils.fs.cp(f"file:{tmp_path}", adls_path)
    os.remove(tmp_path)
    print(f"✓ Saved plot to {adls_path}")

print("✓ Helper functions defined")


In [None]:
# Configuration
DATA_PATH = "abfss://home@edaaaazepcalayelaye0001.dfs.core.windows.net/MD_Artifacts/money-out/data/"
TARGET_TRAIN_VAL_PATH = DATA_PATH + "target/cust/all_products_chunk_320/train_val/"
TARGET_TEST_PATH = DATA_PATH + "target/cust/all_products_chunk_320/test/"
OUTPUT_PATH = "abfss://home@edaaaazepcalayelaye0001.dfs.core.windows.net/MD_Artifacts/money-out/mv/eda_validation/target_analysis/"
dbutils.fs.mkdirs(OUTPUT_PATH)

TOTAL_CHUNKS = 320
OOT_START_DATE = '2024-01-01'
SAMPLING_RATIO = 1.0

# All 33 targets
TARGETS = [
    'cc_acq_2-3', 'cc-aei_acq_2-3', 'cc-aeip_acq_2-3', 'cc-aep_acq_2-3',
    'cc-cbe_acq_2-3', 'cc-cbi_acq_2-3', 'cc-fct_acq_2-3', 'cc-lr_acq_2-3',
    'cc-pt_acq_2-3', 'cc-rew_acq_2-3', 'cc_att_2-3', 'cc_boat_2-3',
    'cc_clip_2-3', 'cc_dg_2-3', 'cc_pap_2-3',
    'cc_2ndary_2-3', 'cc_upg_2-3',
    'ofi_resl_acq_4-5', 'resl_acq_4-5', 'heloc_att_2-7', 'heloc_cpr_2-3',
    'mtg_att_2-7', 'mtg_cpr_2-3',
    'ulon_acq_2-3', 'ulon-dcl_acq_2-3', 'ulon-mpl_acq_2-3', 'ulon-rsp_acq_2-3',
    'ulon_att_2-3', 'ulon_cpr_2-3',
    'uloc_acq_2-3', 'uloc_att_2-3', 'uloc_clip_2-3', 'uloc_cpr_2-3'
]

TARGET_CATEGORIES = {
    'Cards': TARGETS[:17],
    'RESL': TARGETS[17:23],
    'Unsecured_Loan': TARGETS[23:29],
    'Unsecured_LoC': TARGETS[29:33]
}

def get_category(target):
    for cat, targets in TARGET_CATEGORIES.items():
        if target in targets:
            return cat
    return 'Other'

print("✓ Configuration loaded")


## Processing Strategy: True Incremental (Constant Memory)

### Why This Approach?
This notebook uses **true incremental chunked processing** because target statistics (positive counts, positive rates) CAN be aggregated incrementally.

### Memory Efficiency:
- **Memory usage**: ~800 MB constant (regardless of SAMPLING_RATIO)
- **Safe for 100% sampling**: Yes - memory doesn't increase with more data
- **Processing**: One chunk at a time, building statistics incrementally

### How It Works:
```
Initialize: target_stats = defaultdict(dict)

For each chunk (0 to 319):
  1. Load ['pid', 'pred_dt'] + all 33 target columns
  2. For each target and split:
     - Count non-null values for this target (target.notna().sum())
     - Count positive values (target.sum())
     - Increment non_null_count
     - Increment positive sum
  3. Free memory (del df; gc.collect())
  4. Move to next chunk

After all chunks:
  Calculate positive_rate = total_positives / total_non_null_count
```

### Why Incremental Works Here:
- **Non-null counts**: sum(chunk1_non_null) + sum(chunk2_non_null) = total_non_null ✓
- **Sums**: sum(chunk1_positives) + sum(chunk2_positives) = total_positives ✓
- **Rates**: Calculated at the end from aggregated non-null counts and sums ✓
- **Important**: Only non-null values are counted in denominator (handles missing target values)

### Key Fix:
- **Positive rate calculation**: `positive / non_null_count` (not `positive / total_samples`)
- Some targets have null values for certain samples - these are excluded from rate calculation
- This ensures accurate positive rates that reflect the true proportion among valid observations

---


## Load and Process Targets Chunk by Chunk


In [None]:
print("="*80)
print("PROCESSING TARGET DATA CHUNK BY CHUNK")
print("="*80)

# Incremental statistics
target_stats_by_split = defaultdict(lambda: defaultdict(lambda: {'count': 0, 'positive': 0}))
target_stats_by_month = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: {'count': 0, 'positive': 0})))

cols_to_load = ['pid', 'pred_dt'] + [t for t in TARGETS]

# Process both train_val and test
for source, base_path in [('train_val', TARGET_TRAIN_VAL_PATH), ('test', TARGET_TEST_PATH)]:
    print(f"\nProcessing {source} chunks...")
    
    for chunk_id in range(TOTAL_CHUNKS):
        chunk_path = f"{base_path}chunk={chunk_id}/"
        
        try:
            files = [f.path for f in dbutils.fs.ls(chunk_path) if f.name.endswith('.csv')]
            if not files:
                continue
            
            for file_path in files:
                import tempfile, os
                with tempfile.NamedTemporaryFile(mode='wb', suffix='.csv', delete=False) as tmp:
                    tmp_path = tmp.name
                dbutils.fs.cp(file_path, f"file:{tmp_path}")
                
                # Load only necessary columns
                available_cols = pd.read_csv(tmp_path, nrows=0).columns.tolist()
                cols_load = [c for c in cols_to_load if c in available_cols]
                df = pd.read_csv(tmp_path, usecols=cols_load)
                os.remove(tmp_path)
                
                if SAMPLING_RATIO < 1.0:
                    df = df.sample(frac=SAMPLING_RATIO, random_state=42)
                
                df = df.rename(columns={'pid': 'cust_id', 'pred_dt': 'efectv_dt'})
                
                # Ensure efectv_dt is datetime and convert OOT_START_DATE for comparison
                if 'efectv_dt' in df.columns:
                    df['efectv_dt'] = pd.to_datetime(df['efectv_dt'])
                    oot_date = pd.to_datetime(OOT_START_DATE)
                    df['split_label'] = df.apply(
                        lambda row: 'OOT' if row['efectv_dt'] >= oot_date
                        else 'train' if chunk_id < 256 else 'valid',
                        axis=1
                    )
                else:
                    df['split_label'] = df.apply(
                        lambda row: 'train' if chunk_id < 256 else 'valid',
                        axis=1
                    )
                
                # Aggregate statistics
                for target in TARGETS:
                    if target in df.columns:
                        # By split
                        for split in df['split_label'].unique():
                            split_data = df[df['split_label'] == split]
                            # Count non-null values for this target (exclude nulls from denominator)
                            non_null_count = split_data[target].notna().sum()
                            positive_count = split_data[target].sum()  # sum() treats NaN as 0, but we only count non-null
                            # Only count positive from non-null values
                            if non_null_count > 0:
                                target_stats_by_split[split][target]['count'] += non_null_count
                                target_stats_by_split[split][target]['positive'] += positive_count
                        
                        # By month and split
                        for (date, split), group in df.groupby(['efectv_dt', 'split_label']):
                            # Count non-null values for this target (exclude nulls from denominator)
                            non_null_count = group[target].notna().sum()
                            positive_count = group[target].sum()  # sum() treats NaN as 0, but we only count non-null
                            # Only count positive from non-null values
                            if non_null_count > 0:
                                target_stats_by_month[date][split][target]['count'] += non_null_count
                                target_stats_by_month[date][split][target]['positive'] += positive_count
                
                del df
                gc.collect()
            
            if (chunk_id + 1) % 50 == 0:
                print(f"  Processed {chunk_id + 1}/{TOTAL_CHUNKS} chunks...")
        
        except Exception as e:
            pass

print("\n✓ All chunks processed")


In [None]:
print("="*80)
print("CREATING SUMMARY DATAFRAMES")
print("="*80)

# Results by split
# Note: 'count' now represents non-null count (not total samples)
results_by_split = []
for split, targets_dict in target_stats_by_split.items():
    for target, stats in targets_dict.items():
        non_null_count = stats['count']  # This is now the non-null count
        positive_count = int(stats['positive'])
        positive_rate = positive_count / non_null_count if non_null_count > 0 else 0
        results_by_split.append({
            'split': split,
            'target': target,
            'non_null_count': non_null_count,  # Renamed for clarity
            'positive_count': positive_count,
            'positive_rate': positive_rate,  # positive / non_null_count (excludes nulls)
            'category': get_category(target)
        })

results_by_split_df = pd.DataFrame(results_by_split)
save_pandas_to_csv_adls(results_by_split_df, OUTPUT_PATH + "target_statistics_by_split.csv")

# Results by month
# Note: 'count' now represents non-null count (not total samples)
results_by_month = []
for date, splits_dict in target_stats_by_month.items():
    for split, targets_dict in splits_dict.items():
        for target, stats in targets_dict.items():
            non_null_count = stats['count']  # This is now the non-null count
            positive_count = int(stats['positive'])
            positive_rate = positive_count / non_null_count if non_null_count > 0 else 0
            results_by_month.append({
                'efectv_dt': date,
                'split': split,
                'target': target,
                'non_null_count': non_null_count,  # Renamed for clarity
                'positive_count': positive_count,
                'positive_rate': positive_rate,  # positive / non_null_count (excludes nulls)
                'category': get_category(target)
            })

results_by_month_df = pd.DataFrame(results_by_month)
save_pandas_to_csv_adls(results_by_month_df, OUTPUT_PATH + "target_statistics_by_month.csv")

print("✓ DataFrames created and saved")


## Visualizations and Analysis


## Success Criteria and Key Findings

### ✅ **Validation Passes If**:
- All 33 targets have statistics calculated
- Positive rates are reasonable (not all 0% or 100%)
- Class imbalance is quantified for each target
- Temporal trends show stability (no sudden jumps)
- All splits (train/valid/OOT) have data

### 📊 **Key Metrics to Report**:
After running this notebook, document:
- **Highly imbalanced targets** (<1% positive rate): Count and list
- **Moderately imbalanced targets** (1-5% positive rate): Count and list
- **Distribution shifts**: Targets with >0.1pp difference between train and OOT
- **Temporal anomalies**: Any targets with sudden rate changes over time

### 📈 **Expected Findings**:
- Majority of targets will be highly imbalanced (<1% positive rate) - this is normal
- Cards targets typically have higher positive rates than RESL/ULON targets
- Sub-product targets (cc-aei, cc-cbe, etc.) usually have lower rates than core products
- OOT positive rates may differ slightly from in-time (temporal drift)

---


In [None]:
# Class imbalance analysis
imbalance_pivot = results_by_split_df.pivot(index='target', columns='split', values='positive_rate')
imbalance_pivot = imbalance_pivot[['train', 'valid', 'OOT']]
imbalance_pivot['category'] = imbalance_pivot.index.map(get_category)
imbalance_pivot['mean_positive_rate'] = imbalance_pivot[['train', 'valid', 'OOT']].mean(axis=1)
imbalance_pivot = imbalance_pivot.sort_values('mean_positive_rate')
save_pandas_to_csv_adls(imbalance_pivot, OUTPUT_PATH + "class_imbalance_summary.csv")

# Create split comparison analysis
comparison = imbalance_pivot[['train', 'valid', 'OOT', 'category']].copy()
comparison['train_vs_valid_diff'] = comparison['valid'] - comparison['train']
comparison['train_vs_oot_diff'] = comparison['OOT'] - comparison['train']
comparison['valid_vs_oot_diff'] = comparison['OOT'] - comparison['valid']
comparison['train_vs_valid_pct'] = (comparison['train_vs_valid_diff'] / comparison['train']) * 100
comparison['train_vs_oot_pct'] = (comparison['train_vs_oot_diff'] / comparison['train']) * 100
comparison['valid_vs_oot_pct'] = (comparison['valid_vs_oot_diff'] / comparison['valid']) * 100
save_pandas_to_csv_adls(comparison.reset_index(), OUTPUT_PATH + "split_comparison_analysis.csv")

# Heatmap
fig, ax = plt.subplots(figsize=(10, 18))
heatmap_matrix = imbalance_pivot[['train', 'valid', 'OOT']]
sns.heatmap(heatmap_matrix, annot=True, fmt='.4f', cmap='YlOrRd', 
            cbar_kws={'label': 'Positive Rate'}, ax=ax, linewidths=0.5)
ax.set_title('Target Positive Rates by Split', fontsize=16, fontweight='bold', pad=20)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=9)
plt.tight_layout()
save_plot_to_adls(fig, OUTPUT_PATH + "positive_rate_heatmap_by_split.png", dpi=150)
plt.close(fig)

# Temporal trends by category
for category, targets in TARGET_CATEGORIES.items():
    cat_data = results_by_month_df[results_by_month_df['target'].isin(targets)]
    if len(cat_data) == 0:
        continue
    
    fig, axes = plt.subplots(3, 1, figsize=(16, 12))
    for idx, split in enumerate(['train', 'valid', 'OOT']):
        split_data = cat_data[cat_data['split'] == split]
        for target in targets:
            target_data = split_data[split_data['target'] == target].sort_values('efectv_dt')
            if len(target_data) > 0:
                axes[idx].plot(target_data['efectv_dt'], target_data['positive_rate'], 
                              marker='o', label=target, linewidth=2, markersize=4, alpha=0.7)
        axes[idx].set_xlabel('Effective Date', fontsize=11)
        axes[idx].set_ylabel('Positive Rate', fontsize=11)
        axes[idx].set_title(f'{category} - {split.upper()} Split', fontsize=12, fontweight='bold')
        axes[idx].legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=8)
        axes[idx].grid(True, alpha=0.3)
        axes[idx].tick_params(axis='x', rotation=45)
    plt.tight_layout()
    save_plot_to_adls(fig, OUTPUT_PATH + f"temporal_trends_{category}.png", dpi=150)
    plt.close()

print("✓ Visualizations complete")
print("✓ Target analysis complete")
