# Feature Profiling by Table (Pandas Version)

## Overview
Comprehensive feature profiling using **pandas** with memory-efficient chunk processing.

### Pandas Optimizations:
- **Chunked reading**: Process data in manageable chunks
- **Streaming statistics**: Calculate stats without loading full table
- **Memory efficient**: Use pandas iterators and explicit cleanup

### Statistics Calculated:
- Data type, % zeros, n_unique
- Most frequent value and percentage
- Percentiles: min, 1%, 50%, 99%, max, mean

### Outputs:
- Feature profiling CSVs per table with **separate statistics for In-Time vs OOT**
  - Each feature has two rows: one for 'In-Time' period, one for 'OOT' period
  - Includes `time_period` column to distinguish periods
- Individual boxplots for each feature (one PNG per feature)
  - Saved in table-specific folders: `plots/{table_name}/`
  - Each file named: `{feature_name}.png`
  - Comparing OOT vs in-time distributions

---


In [None]:
%pip install --upgrade pandas==2 -i https://repo.td.com/repository/pypi-all/simple

In [None]:
dbutils.library.restartPython()

In [None]:
import pandas
print(pandas.__version__)

In [None]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from io import BytesIO
import gc

# Helper functions
def save_pandas_to_csv_adls(df_pandas, adls_path):
    csv_string = df_pandas.to_csv(index=False)
    dbutils.fs.put(adls_path, csv_string, overwrite=True)
    print(f"✓ Saved CSV to {adls_path}")

def save_plot_to_adls(fig, adls_path, dpi=150):
    import tempfile, os
    buf = BytesIO()
    fig.savefig(buf, format='png', dpi=dpi, bbox_inches='tight')
    buf.seek(0)
    with tempfile.NamedTemporaryFile(mode='wb', suffix='.png', delete=False) as tmp:
        tmp.write(buf.getvalue())
        tmp_path = tmp.name
    dbutils.fs.cp(f"file:{tmp_path}", adls_path)
    os.remove(tmp_path)
    print(f"✓ Saved plot to {adls_path}")

print("✓ Setup complete")


In [None]:
# Configuration
DATA_PATH = "abfss://home@edaaaazepcalayelaye0001.dfs.core.windows.net/MD_Artifacts/money-out/data/"
OUTPUT_PATH = "abfss://home@edaaaazepcalayelaye0001.dfs.core.windows.net/MD_Artifacts/money-out/mv/eda_validation/feature_profiling/"
PLOT_PATH = OUTPUT_PATH + "plots/"
dbutils.fs.mkdirs(OUTPUT_PATH)
dbutils.fs.mkdirs(PLOT_PATH)

SAMPLING_RATIO = 0.01
PLOT_SAMPLING_RATIO = 0.01
OOT_START_DATE = '2024-01-01'

# Feature tables to analyze
TABLES = [
    ("cust", "cust_basic_sumary", ''),
    ("cust", "batch_credit_bureau", ''),
    ("dem", "acct", 2438),
    ("cc", "acct", 2444),
    ("loc", "acct", 2442),
    ("loan", "acct", 2439),
    ("mtg", "acct", 2440),
    ("inv", "acct", 1331),
    ("dem", "acct_trans", 2438),
    ("cc", "acct_trans", 2444),
]

# Load metadata
feature_metadata_rows = spark.read.text(f"{DATA_PATH}/feature/feature_metadata.jsonl").collect()
feature_metadata = json.loads('\n'.join([row.value for row in feature_metadata_rows]))

print("✓ Config loaded")


## Processing Strategy: Sampled Full-Table (Accuracy Prioritized)

### Why This Approach?
This notebook calculates **median, percentiles (p1, p99)** which **CANNOT be calculated incrementally**. We must see all values to sort/rank them accurately.

### Memory Efficiency:
- **Memory usage**: Scales with SAMPLING_RATIO
- **Mitigation**: Process one table at a time (10 tables total), free memory between tables
- **Recommendation for memory issue**: Use `SAMPLING_RATIO = 0.01` (1%) for accurate results with manageable memory

### How It Works:
```
For each table (10 total):
  1. Load FULL table via Spark (efficient Parquet reading)
  2. Apply sampling at Spark level: .sample(fraction=SAMPLING_RATIO)
  3. Convert to pandas: .toPandas()
  4. Calculate accurate statistics:
     - median: df[col].median() ← requires sorted values
     - p99: df[col].quantile(0.99) ← requires percentile calculation
     - mean, min, max, n_unique, etc.
  5. Free memory before next table (del df; gc.collect())
```

### Why Incremental Doesn't Work Here:
- ❌ median(chunk1) + median(chunk2) ≠ median(all_data)
- ❌ p99(chunk1) combined with p99(chunk2) ≠ p99(all_data)
- ✅ Must see all sampled values together to calculate correct percentiles
- ✅ 1% sampling gives exact statistics on representative sample

### Alternative Considered:
Could use approximate algorithms (T-Digest, Q-Digest) for streaming percentiles, but:
- ❌ Introduces approximation error
- ❌ Complex to implement and debug
- ✅ 1% sampling gives exact results with manageable memory
- ✅ Simpler code is easier to maintain

---


## Success Criteria and Expected Results

### ✅ **Profiling Succeeds If**:
- All tables processed successfully
- Statistics calculated for **all features** (numerical + categorical) in metadata
- **Separate statistics** calculated for In-Time vs OOT periods
- No excessive missing values (>99.9%) unless expected
- Reasonable value ranges (no extreme outliers unless business-valid)
- Categorical features have reasonable cardinality
- **Time-period comparison** shows expected differences between In-Time and OOT

### 📊 **Statistics Calculated Per Feature (Per Time Period)**:
| Statistic | Numerical | Categorical | Notes |
|-----------|-----------|-------------|-------|
| time_period | ✓ | ✓ | 'In-Time' or 'OOT' |
| feature | ✓ | ✓ | Feature name |
| data_type | ✓ | ✓ | Identifies feature type |
| pct_zero | ✓ | ✓ | % of values that are 0 |
| n_unique | ✓ | ✓ | Number of distinct values |
| most_frequent_value | ✓ | ✓ | Mode |
| pct_most_frequent | ✓ | ✓ | % of samples with mode |
| min | ✓ | ✓ | Minimum value |
| max | ✓ | ✓ | Maximum value |
| p1 | ✓ | ✗ | 1st percentile |
| median (p50) | ✓ | ✗ | 50th percentile |
| p99 | ✓ | ✗ | 99th percentile |
| mean | ✓ | ✗ | Average value |

### 📈 **Time-Period Analysis**:
Each feature has **two rows** in the output CSV:
- **Row 1**: Statistics for 'In-Time' period (before 2024-01-01)
- **Row 2**: Statistics for 'OOT' period (2024-01-01 and after)

**What to Compare**:
- **Distributions**: Compare median, mean, percentiles between periods
- **Sparsity**: Compare pct_zero (may increase/decrease over time)
- **Cardinality**: Compare n_unique for categorical features
- **Ranges**: Compare min/max values (may indicate data quality issues)
- **Mode**: Compare most_frequent_value (distribution shifts)

### ⚠️ **Potential Issues to Flag**:
- **Features with >99% zeros** (may be redundant or sparse)
- **Features with only 1 unique value** (constant features - no information)
- **Features with extreme ranges** (may need normalization or clipping)
- **Categorical features with very high cardinality** (>1000 categories - may need bucketing)
- **Features missing from certain tables** (expected for table-specific features)
- **Large differences between In-Time and OOT**:
  - Significant shifts in median/mean (potential drift)
  - Large changes in pct_zero (sparsity changes)
  - Cardinality changes in categoricals (new categories appear/disappear)

### 📊 **Boxplot Visualizations**:
- **One boxplot per numerical feature** comparing In-Time vs OOT
- Saved individually: `plots/{table_name}/{feature_name}.png`
- **What to look for**:
  - Distribution shifts (boxes at different positions)
  - Spread changes (box width differences)
  - Outlier patterns (fliers in different locations)
  - Median differences (horizontal line position)

---


In [None]:
print("="*80)
print("FEATURE PROFILING")
print("="*80)

for fam_name, table, fam in TABLES:
    print(f"\nProcessing: {fam_name}-{table}")
    
    table_path = f"{DATA_PATH}/feature/{table}/parquet" if not fam else f"{DATA_PATH}/feature/{table}_{fam}/parquet"
    table_meta_key = table if not fam else f"{table}_{fam}"
    
    if fam_name not in feature_metadata or table_meta_key not in feature_metadata[fam_name]:
        continue
    
    num_features = feature_metadata[fam_name][table_meta_key].get("num_features", [])
    cat_features = list(feature_metadata[fam_name][table_meta_key].get("cat_features", {}).keys())
    
    # Load via spark then convert to pandas (more efficient for parquet)
    df_spark = spark.read.format("parquet").load(table_path)
    if SAMPLING_RATIO < 1.0:
        df_spark = df_spark.sample(fraction=SAMPLING_RATIO, withReplacement=False, seed=42)
    
    # Load with efectv_dt if available for time-period splitting
    cols_to_load = [c for c in (num_features + cat_features) if c in df_spark.columns]
    if 'efectv_dt' in df_spark.columns:
        cols_to_load = ['efectv_dt'] + cols_to_load
    
    df = df_spark.select(cols_to_load).toPandas()
    
    # Split into In-Time and OOT periods if efectv_dt is available
    if 'efectv_dt' in df.columns:
        df['efectv_dt'] = pd.to_datetime(df['efectv_dt'])
        df['time_period'] = df['efectv_dt'].apply(
            lambda x: 'OOT' if x >= pd.to_datetime(OOT_START_DATE) else 'In-Time'
        )
        df_intime = df[df['time_period'] == 'In-Time'].copy()
        df_oot = df[df['time_period'] == 'OOT'].copy()
        
        # Profile features separately for In-Time and OOT
        all_stats = []
        
        for period_name, period_df in [('In-Time', df_intime), ('OOT', df_oot)]:
            for feature in num_features + cat_features:
                if feature not in period_df.columns:
                    continue
                is_cat = feature in cat_features
                
                # Skip if no data in this period
                if len(period_df) == 0:
                    continue
                
                stats = {
                    'time_period': period_name,
                    'feature': feature,
                    'data_type': 'categorical' if is_cat else 'numerical',
                    'pct_zero': (period_df[feature] == 0).mean() if len(period_df[feature].dropna()) > 0 else None,
                    'n_unique': period_df[feature].nunique(),
                    'most_frequent_value': period_df[feature].mode()[0] if len(period_df[feature].mode()) > 0 else None,
                    'pct_most_frequent': period_df[feature].value_counts(normalize=True).iloc[0] if len(period_df[feature].dropna()) > 0 else None,
                }
                
                if not is_cat:
                    feature_clean = period_df[feature].dropna()
                    if len(feature_clean) > 0:
                        # Convert to float to handle Decimal types
                        feature_clean = pd.to_numeric(feature_clean, errors='coerce').dropna()
                        if len(feature_clean) > 0:
                            stats.update({
                                'min': float(feature_clean.min()),
                                'p1': float(feature_clean.quantile(0.01)),
                                'median': float(feature_clean.median()),
                                'p99': float(feature_clean.quantile(0.99)),
                                'max': float(feature_clean.max()),
                                'mean': float(feature_clean.mean())
                            })
                        else:
                            stats.update({
                                'min': None, 'p1': None, 'median': None,
                                'p99': None, 'max': None, 'mean': None
                            })
                    else:
                        stats.update({
                            'min': None, 'p1': None, 'median': None,
                            'p99': None, 'max': None, 'mean': None
                        })
                else:
                    feature_clean = period_df[feature].dropna()
                    if len(feature_clean) > 0:
                        # Convert to numeric for min/max to handle Decimal types
                        try:
                            min_val = feature_clean.min()
                            max_val = feature_clean.max()
                            # Try to convert to float if possible
                            if hasattr(min_val, '__float__'):
                                min_val = float(min_val)
                            if hasattr(max_val, '__float__'):
                                max_val = float(max_val)
                            stats.update({
                                'min': min_val,
                                'max': max_val,
                                'p1': None, 'median': None, 'p99': None, 'mean': None
                            })
                        except:
                            stats.update({
                                'min': str(min_val) if 'min_val' in locals() else None,
                                'max': str(max_val) if 'max_val' in locals() else None,
                                'p1': None, 'median': None, 'p99': None, 'mean': None
                            })
                    else:
                        stats.update({
                            'min': None, 'max': None,
                            'p1': None, 'median': None, 'p99': None, 'mean': None
                        })
                
                all_stats.append(stats)
        
        # Save profiling results with time_period column
        if all_stats:
            results_df = pd.DataFrame(all_stats)
            # Reorder columns: time_period first, then feature, then rest
            col_order = ['time_period', 'feature', 'data_type'] + [c for c in results_df.columns if c not in ['time_period', 'feature', 'data_type']]
            results_df = results_df[col_order]
            save_pandas_to_csv_adls(results_df, f"{OUTPUT_PATH}feature_profile_{fam_name}_{table}.csv")
    else:
        # No efectv_dt column - profile on all data together (fallback)
        all_stats = []
        for feature in num_features + cat_features:
            if feature not in df.columns:
                continue
            is_cat = feature in cat_features
            stats = {
                'time_period': 'All',
                'feature': feature,
                'data_type': 'categorical' if is_cat else 'numerical',
                'pct_zero': (df[feature] == 0).mean(),
                'n_unique': df[feature].nunique(),
                'most_frequent_value': df[feature].mode()[0] if len(df[feature].mode()) > 0 else None,
                'pct_most_frequent': df[feature].value_counts(normalize=True).iloc[0] if len(df) > 0 else None,
            }
            if not is_cat:
                # Convert to float to handle Decimal types
                feature_data = pd.to_numeric(df[feature], errors='coerce').dropna()
                if len(feature_data) > 0:
                    stats.update({
                        'min': float(feature_data.min()),
                        'p1': float(feature_data.quantile(0.01)),
                        'median': float(feature_data.median()),
                        'p99': float(feature_data.quantile(0.99)),
                        'max': float(feature_data.max()),
                        'mean': float(feature_data.mean())
                    })
                else:
                    stats.update({
                        'min': None, 'p1': None, 'median': None,
                        'p99': None, 'max': None, 'mean': None
                    })
            else:
                # For categorical, convert min/max to handle Decimal types
                try:
                    min_val = df[feature].min()
                    max_val = df[feature].max()
                    # Try to convert to float if possible
                    if hasattr(min_val, '__float__'):
                        min_val = float(min_val)
                    if hasattr(max_val, '__float__'):
                        max_val = float(max_val)
                    stats.update({
                        'min': min_val,
                        'max': max_val,
                        'p1': None, 'median': None, 'p99': None, 'mean': None
                    })
                except:
                    stats.update({
                        'min': str(df[feature].min()) if len(df[feature].dropna()) > 0 else None,
                        'max': str(df[feature].max()) if len(df[feature].dropna()) > 0 else None,
                        'p1': None, 'median': None, 'p99': None, 'mean': None
                    })
            all_stats.append(stats)
        
        # Save profiling results
        if all_stats:
            results_df = pd.DataFrame(all_stats)
            col_order = ['time_period', 'feature', 'data_type'] + [c for c in results_df.columns if c not in ['time_period', 'feature', 'data_type']]
            results_df = results_df[col_order]
            save_pandas_to_csv_adls(results_df, f"{OUTPUT_PATH}feature_profile_{fam_name}_{table}.csv")
    
    # Create boxplots (OOT vs In-Time comparison) - One plot per feature
    if 'efectv_dt' in df.columns:
        # Ensure efectv_dt is datetime and convert OOT_START_DATE for comparison
        if not pd.api.types.is_datetime64_any_dtype(df['efectv_dt']):
            df['efectv_dt'] = pd.to_datetime(df['efectv_dt'])
        oot_date = pd.to_datetime(OOT_START_DATE)
        df['time_period'] = df['efectv_dt'].apply(
            lambda x: 'OOT' if x >= oot_date else 'In-Time'
        )
        
        # Create folder for this table's plots
        table_folder_name = f"{fam_name}_{table}" if not fam else f"{fam_name}_{table}_{fam}"
        table_plot_folder = f"{PLOT_PATH}{table_folder_name}/"
        dbutils.fs.mkdirs(table_plot_folder)
        
        # Get ALL numerical features for plotting
        plot_features = [f for f in num_features if f in df.columns]
        
        if len(plot_features) > 0:
            print(f"  Creating {len(plot_features)} individual boxplots...")
            saved_count = 0
            failed_count = 0
            
            for feature in plot_features:
                fig = None
                try:
                    # Create individual figure for each feature
                    fig, ax = plt.subplots(figsize=(10, 6))
                    
                    # Prepare data for boxplot
                    intime_data = df[df['time_period'] == 'In-Time'][feature].dropna()
                    oot_data = df[df['time_period'] == 'OOT'][feature].dropna()
                    
                    if len(intime_data) > 0 and len(oot_data) > 0:
                        # Create boxplot data structure
                        plot_data = {
                            'In-Time': intime_data,
                            'OOT': oot_data
                        }
                        
                        # Create boxplot
                        bp = ax.boxplot([plot_data['In-Time'], plot_data['OOT']], 
                                       labels=['In-Time', 'OOT'], 
                                       vert=True, patch_artist=True,
                                       showmeans=False, showfliers=True)
                        
                        # Style the boxes with colors
                        colors = ['lightblue', 'lightcoral']
                        for patch, color in zip(bp['boxes'], colors):
                            patch.set_facecolor(color)
                            patch.set_alpha(0.6)
                        
                        ax.set_title(f'{feature}\n({table_folder_name})', 
                                   fontsize=12, fontweight='bold')
                        ax.set_ylabel('Value', fontsize=10)
                        ax.set_xlabel('Time Period', fontsize=10)
                        ax.grid(True, alpha=0.2, axis='y', linestyle='--')
                        
                        plt.tight_layout()
                        
                        # Save individual plot (no display - saved directly to ADLS)
                        plot_file = f"{table_plot_folder}{feature}.png"
                        save_plot_to_adls(fig, plot_file, dpi=150)
                        plt.close(fig)  # Explicitly close to free memory
                        fig = None  # Prevent double-close
                        saved_count += 1
                    else:
                        if fig is not None:
                            plt.close(fig)
                        failed_count += 1
                        print(f"    Warning: Skipped {feature} (insufficient data)")
                        
                except Exception as e:
                    if fig is not None:
                        plt.close(fig)
                    failed_count += 1
                    print(f"    Warning: Could not plot {feature}: {str(e)}")
            
            print(f"  ✓ Boxplots saved: {saved_count} successful, {failed_count} failed")
            print(f"    Location: {table_plot_folder}")
    
    del df, df_spark
    gc.collect()

print("\n✓ Feature profiling complete")
