# Amex 1st Place Pipeline - Step 1: Feature Engineering

This notebook replicates the core feature engineering from the 1st place solution (`S1_denoise.py` and `S2_manual_feature.py`). 

**Pipeline:**
1.  **Load Raw Data**: Load `train.parquet` and `test.parquet` (denoised versions from `S1`).
2.  **Define Feature Types**: Identify `CAT_FEATURES` and `NUM_FEATURES`.
3.  **Create `lastk` Features**: Generate aggregate features for the **last 3** and **last 6** statements.
4.  **Create `rank` Features**: Generate within-customer percentile rank features.
5.  **Create `diff` Features**: Generate difference-from-lag features.
6.  **Create Main Aggregates**: Generate aggregates over the *entire* history.
7.  **Combine & Save**: Merge all feature sets and save to `train_fe.parquet` and `test_fe.parquet`.

In [2]:
import pandas as pd
import numpy as np
import gc
import os
import time
import warnings
from tqdm.auto import tqdm

warnings.simplefilter('ignore')
pd.set_option('display.max_columns', 500)

  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Define Paths & Feature Lists

In [None]:
# --- Paths ---
# IMPORTANT: This notebook assumes you have already run S1_denoise.py
# or have the 'denoised' parquet files available.
# For this example, we will use the 'optimized' files from our v3 notebook
# and apply denoising manually.

PARQUET_DATA_DIR = '../data_parquet/'
CSV_DATA_DIR = '../data/'
FE_OUTPUT_DIR = '../data_fe/' # New directory to save our features

if not os.path.exists(FE_OUTPUT_DIR):
    os.makedirs(FE_OUTPUT_DIR)

TRAIN_IN_PATH = os.path.join(PARQUET_DATA_DIR, 'train_optimized.parquet')
TEST_IN_PATH = os.path.join(PARQUET_DATA_DIR, 'test_optimized.parquet') 
LABELS_PATH = os.path.join(CSV_DATA_DIR, 'train_labels.csv')

TRAIN_OUT_PATH = os.path.join(FE_OUTPUT_DIR, 'train_fe.parquet')
TEST_OUT_PATH = os.path.join(FE_OUTPUT_DIR, 'test_fe.parquet')


# --- Features ---
CAT_FEATURES = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
# We will define NUM_FEATURES after loading



## Step 2: Denoising & Pre-Feature Engineering (S1 & S2 logic)

This function applies the 1st place solution's key ideas *before* aggregation:
1.  **Denoising**: Floors float values (`*100`).
2.  **Categorical Mapping**: Converts `D_63`/`D_64` to integers.
3.  **Lag/Diff Features**: Calculates `diff(1)`.
4.  **Rank Features**: Calculates `rank(pct=True)`.

In [4]:
def denoise_and_pre_feature_engineer(df, num_features):
    print("Starting denoising and pre-feature engineering...")
    
    # 1. Denoise (from S1_denoise.py)
    print("Applying 'denoising' (floor * 100)...")
    for col in tqdm(num_features):
        df[col] = np.floor(df[col] * 100)
    
    # 2. Categorical Mapping (from S1_denoise.py)
    print("Applying manual categorical mapping...")
    df['D_63'] = df['D_63'].apply(lambda t: {'CR':0, 'XZ':1, 'XM':2, 'CO':3, 'CL':4, 'XL':5, np.nan:-1}.get(t, -1)).astype(np.int8)
    df['D_64'] = df['D_64'].apply(lambda t: {np.nan:-1, 'O':0, '-1':1, 'R':2, 'U':3}.get(t, -1)).astype(np.int8)
    
    # 3. Lag/Diff Features (from S2_manual_feature.py)
    print("Creating lag/diff features...")
    cid_map = df['customer_ID'].to_dict()
    df_diff = df.groupby('customer_ID')[num_features].diff(1).add_prefix('diff_')
    df_diff['customer_ID'] = cid_map # Add customer_ID back for merging later
    
    # 4. Rank Features (from S2_manual_feature.py)
    print("Creating within-customer rank features...")
    df_rank = df.groupby('customer_ID')[num_features].rank(pct=True).add_prefix('rank_')
    df_rank['customer_ID'] = cid_map # Add customer_ID back
    
    print("Pre-feature engineering complete.")
    return df, df_diff, df_rank

## Step 3: Aggregation Function (S2 Logic)

This function aggregates the pre-engineered features. It's designed to be called for the full dataset, the last 3 statements, and the last 6 statements.

In [5]:
def process_and_aggregate(df, df_diff, df_rank, num_features, prefix=''):
    print(f"Starting aggregation with prefix: '{prefix}'...")
    
    # Define features for this aggregation
    diff_features = [f'diff_{col}' for col in num_features]
    rank_features = [f'rank_{col}' for col in num_features]

    # Group by customer_ID
    df_grouped = df.groupby("customer_ID")
    df_diff_grouped = df_diff.groupby("customer_ID")
    df_rank_grouped = df_rank.groupby("customer_ID")

    # 1. Numerical Aggregates (mean, std, min, max, sum, last)
    print("Aggregating numerical features...")
    num_agg = df_grouped[num_features].agg(['mean', 'std', 'min', 'max', 'sum', 'last'])
    num_agg.columns = ['_'.join(x) for x in num_agg.columns]
    
    # 2. Diff Aggregates (mean, std, min, max, sum, last)
    print("Aggregating diff features...")
    diff_agg = df_diff_grouped[diff_features].agg(['mean', 'std', 'min', 'max', 'sum', 'last'])
    diff_agg.columns = ['_'.join(x) for x in diff_agg.columns]
    
    # 3. Rank Aggregates (last)
    print("Aggregating rank features...")
    rank_agg = df_rank_grouped[rank_features].agg(['last'])
    rank_agg.columns = ['_'.join(x) for x in rank_agg.columns]
    
    # 4. Categorical Aggregates (last, nunique, count)
    print("Aggregating categorical features...")
    cat_agg = df_grouped[CAT_FEATURES].agg(['last', 'nunique', 'count'])
    cat_agg.columns = ['_'.join(x) for x in cat_agg.columns]
    
    # 5. Statement Count
    count_agg = df_grouped[['S_2']].agg(['count'])
    count_agg.columns = ['statement_count']

    # Combine all aggregates
    print("Combining aggregates...")
    df_agg = pd.concat([num_agg, diff_agg, rank_agg, cat_agg, count_agg], axis=1).reset_index()
    
    # Add prefix (e.g., 'last3_') to all columns except customer_ID
    if prefix != '':
        df_agg = df_agg.add_prefix(prefix)
        df_agg = df_agg.rename(columns={f'{prefix}customer_ID': 'customer_ID'}) 
    
    print(f"Aggregation for '{prefix}' complete. Shape: {df_agg.shape}")
    return df_agg

## Step 4: Process Full Training Data

This cell executes the full S1 -> S2 pipeline for the **training data**.

In [6]:
print(f"Loading full training data from {TRAIN_IN_PATH}...")
start_time = time.time()
train_df = pd.read_parquet(TRAIN_IN_PATH)
print(f"Train data loaded in {time.time() - start_time:.2f}s. Shape: {train_df.shape}")

# Define NUM_FEATURES now that we have the df
all_cols = [c for c in list(train_df.columns) if c not in ['customer_ID', 'S_2']]
NUM_FEATURES = [col for col in all_cols if col not in CAT_FEATURES]

# Apply S1 Denoising & S2 Pre-FE
train_denoised, train_diff, train_rank = denoise_and_pre_feature_engineer(train_df, NUM_FEATURES)
del train_df; gc.collect()

# --- S2 Aggregations ---
print("--- Starting All Aggregations for TRAIN ---")

# 1. Aggregate ALL statements (prefix='')
train_agg_all = process_and_aggregate(train_denoised, train_diff, train_rank, NUM_FEATURES, prefix='')

# 2. Aggregate LAST 3 statements (prefix='last3_')
train_denoised['rank'] = train_denoised.groupby('customer_ID')['S_2'].rank(ascending=False)
train_last3 = train_denoised.loc[train_denoised['rank'] <= 3].reset_index(drop=True)
train_agg_last3 = process_and_aggregate(train_last3, train_diff, train_rank, NUM_FEATURES, prefix='last3_')
del train_last3; gc.collect()

# 3. Aggregate LAST 6 statements (prefix='last6_')
train_last6 = train_denoised.loc[train_denoised['rank'] <= 6].reset_index(drop=True)
train_agg_last6 = process_and_aggregate(train_last6, train_diff, train_rank, NUM_FEATURES, prefix='last6_')
del train_last6, train_denoised, train_diff, train_rank; gc.collect()

# --- Combine all feature sets ---
print("Combining all TRAIN feature sets...")
X_train = train_agg_all.merge(train_agg_last3, how='inner', on='customer_ID')
X_train = X_train.merge(train_agg_last6, how='inner', on='customer_ID')

print("Loading and merging labels...")
train_labels = pd.read_csv(LABELS_PATH)
X_train = X_train.merge(train_labels, on='customer_ID', how='left')

print(f"Full training set ready. Shape: {X_train.shape}")

print(f"Saving processed train features to {TRAIN_OUT_PATH}...")
X_train.to_parquet(TRAIN_OUT_PATH, index=False)

del X_train, train_labels, train_agg_all, train_agg_last3, train_agg_last6
gc.collect()
print(f"Total time for train processing: {time.time() - start_time:.2f}s")

Loading full training data from ../data_parquet/train_optimized.parquet...
Train data loaded in 4.38s. Shape: (5531451, 190)
Starting denoising and pre-feature engineering...
Applying 'denoising' (floor * 100)...


100%|██████████| 177/177 [00:03<00:00, 57.41it/s]


Applying manual categorical mapping...
Creating lag/diff features...
Creating within-customer rank features...
Pre-feature engineering complete.
--- Starting All Aggregations for TRAIN ---
Starting aggregation with prefix: ''...
Aggregating numerical features...
Aggregating diff features...
Aggregating rank features...
Aggregating categorical features...
Combining aggregates...
Aggregation for '' complete. Shape: (458913, 2336)
Starting aggregation with prefix: 'last3_'...
Aggregating numerical features...
Aggregating diff features...
Aggregating rank features...
Aggregating categorical features...
Combining aggregates...
Aggregation for 'last3_' complete. Shape: (458913, 2336)
Starting aggregation with prefix: 'last6_'...
Aggregating numerical features...
Aggregating diff features...
Aggregating rank features...
Aggregating categorical features...
Combining aggregates...
Aggregation for 'last6_' complete. Shape: (458913, 2336)
Combining all TRAIN feature sets...
Loading and merging la

## Step 5: Process Test Data

This cell executes the full S1 -> S2 pipeline for the **test data**.

In [7]:
print(f"Loading full test data from {TEST_IN_PATH}...")
start_time = time.time()
test_df = pd.read_parquet(TEST_IN_PATH)
print(f"Test data loaded in {time.time() - start_time:.2f}s. Shape: {test_df.shape}")

# Apply S1 Denoising & S2 Pre-FE
# Note: We use the *same* NUM_FEATURES list defined from the train set
test_denoised, test_diff, test_rank = denoise_and_pre_feature_engineer(test_df, NUM_FEATURES)
del test_df; gc.collect()

# --- S2 Aggregations ---
print("--- Starting All Aggregations for TEST ---")

# 1. Aggregate ALL statements (prefix='')
test_agg_all = process_and_aggregate(test_denoised, test_diff, test_rank, NUM_FEATURES, prefix='')

# 2. Aggregate LAST 3 statements (prefix='last3_')
test_denoised['rank'] = test_denoised.groupby('customer_ID')['S_2'].rank(ascending=False)
test_last3 = test_denoised.loc[test_denoised['rank'] <= 3].reset_index(drop=True)
test_agg_last3 = process_and_aggregate(test_last3, test_diff, test_rank, NUM_FEATURES, prefix='last3_')
del test_last3; gc.collect()

# 3. Aggregate LAST 6 statements (prefix='last6_')
test_last6 = test_denoised.loc[test_denoised['rank'] <= 6].reset_index(drop=True)
test_agg_last6 = process_and_aggregate(test_last6, test_diff, test_rank, NUM_FEATURES, prefix='last6_')
del test_last6, test_denoised, test_diff, test_rank; gc.collect()

# --- Combine all feature sets ---
print("Combining all TEST feature sets...")
X_test = test_agg_all.merge(test_agg_last3, how='inner', on='customer_ID')
X_test = X_test.merge(test_agg_last6, how='inner', on='customer_ID')

print(f"Full test set ready. Shape: {X_test.shape}")

print(f"Saving processed test features to {TEST_OUT_PATH}...")
X_test.to_parquet(TEST_OUT_PATH, index=False)

del X_test, test_agg_all, test_agg_last3, test_agg_last6
gc.collect()
print(f"Total time for test processing: {time.time() - start_time:.2f}s")

Loading full test data from ../data_parquet/test_optimized.parquet...
Test data loaded in 9.21s. Shape: (11363762, 190)
Starting denoising and pre-feature engineering...
Applying 'denoising' (floor * 100)...


100%|██████████| 177/177 [00:06<00:00, 27.01it/s]


Applying manual categorical mapping...
Creating lag/diff features...
Creating within-customer rank features...
Pre-feature engineering complete.
--- Starting All Aggregations for TEST ---
Starting aggregation with prefix: ''...
Aggregating numerical features...
Aggregating diff features...
Aggregating rank features...
Aggregating categorical features...
Combining aggregates...
Aggregation for '' complete. Shape: (924621, 2336)
Starting aggregation with prefix: 'last3_'...
Aggregating numerical features...
Aggregating diff features...
Aggregating rank features...
Aggregating categorical features...
Combining aggregates...
Aggregation for 'last3_' complete. Shape: (924621, 2336)
Starting aggregation with prefix: 'last6_'...
Aggregating numerical features...
Aggregating diff features...
Aggregating rank features...
Aggregating categorical features...
Combining aggregates...
Aggregation for 'last6_' complete. Shape: (924621, 2336)
Combining all TEST feature sets...
Full test set ready. Sha

## Feature Engineering Complete

This notebook has created and saved `train_fe.parquet` and `test_fe.parquet` in the `../data_fe/` directory.

You can now proceed to the **`v4_LGBM_Training.ipynb`** notebook to load these powerful features and train the final model.