# Feature Engineering Pipeline V3 - WEIGHTED ACTIONS

## –ö–ª—é—á–µ–≤–æ–µ —É–ª—É—á—à–µ–Ω–∏–µ:
**–†–∞–∑–Ω—ã–µ —Ç–∏–ø—ã –¥–µ–π—Å—Ç–≤–∏–π –∏–º–µ—é—Ç —Ä–∞–∑–Ω—ã–π –≤–µ—Å –ø—Ä–∏ –∞–≥—Ä–µ–≥–∞—Ü–∏–∏!**

### Action Weights:
- **order** (3): 5.0 - —Å–∞–º—ã–π —Å–∏–ª—å–Ω—ã–π —Å–∏–≥–Ω–∞–ª (–ø–æ–∫—É–ø–∫–∞)
- **to_cart** (5): 3.0 - —Å–∏–ª—å–Ω–æ–µ –Ω–∞–º–µ—Ä–µ–Ω–∏–µ
- **favorite** (2): 2.5 - —Å—Ä–µ–¥–Ω–∏–π –∏–Ω—Ç–µ—Ä–µ—Å
- **search**: 2.0 - —è–≤–Ω–æ–µ –Ω–∞–º–µ—Ä–µ–Ω–∏–µ (—á—Ç–æ –∏—â–µ—Ç)
- **click** (1): 1.0 - –±–∞–∑–æ–≤—ã–π –∏–Ω—Ç–µ—Ä–µ—Å

### –ù–æ–≤—ã–µ —Ñ–∏—á–∏:
- `weighted_total_actions` - —Å—É–º–º–∞ –≤–∑–≤–µ—à–µ–Ω–Ω—ã—Ö –¥–µ–π—Å—Ç–≤–∏–π
- `weighted_engagement_score` - –æ–±—â–∏–π engagement —Å –≤–µ—Å–∞–º–∏
- `action_diversity` - —ç–Ω—Ç—Ä–æ–ø–∏—è —Ç–∏–ø–æ–≤ –¥–µ–π—Å—Ç–≤–∏–π
- `high_intent_ratio` - –¥–æ–ª—è high-intent –¥–µ–π—Å—Ç–≤–∏–π (order, cart)
- `recent_weighted_activity` - –Ω–µ–¥–∞–≤–Ω—è—è –∞–∫—Ç–∏–≤–Ω–æ—Å—Ç—å —Å –≤–µ—Å–∞–º–∏

In [3]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os
import glob
from typing import Dict, List
import warnings
import gc
from tqdm import tqdm
from scipy.stats import entropy
warnings.filterwarnings('ignore')

print("Libraries loaded successfully!")

Libraries loaded successfully!


In [4]:
DATA_PATH = '../docs'

TRAIN_START_DATE = pd.Timestamp('2024-03-01')
TRAIN_END_DATE = pd.Timestamp('2024-06-30')
VAL_START_DATE = pd.Timestamp('2024-07-01')
VAL_END_DATE = pd.Timestamp('2024-07-31')
TEST_START_DATE = pd.Timestamp('2024-08-01')
NUM_PERIODS = 4


ACTION_WEIGHTS = {
    1: 1.0,   # click - –±–∞–∑–æ–≤—ã–π –∏–Ω—Ç–µ—Ä–µ—Å
    2: 2.5,   # favorite - —Å—Ä–µ–¥–Ω–∏–π –∏–Ω—Ç–µ—Ä–µ—Å
    3: 5.0,   # order - –°–ê–ú–´–ô –í–ê–ñ–ù–´–ô (–ø–æ–∫—É–ø–∫–∞)
    5: 3.0,   # to_cart - —Å–∏–ª—å–Ω–æ–µ –Ω–∞–º–µ—Ä–µ–Ω–∏–µ
}

SEARCH_WEIGHT = 2.0  # –ü–æ–∏—Å–∫ = —è–≤–Ω–æ–µ –Ω–∞–º–µ—Ä–µ–Ω–∏–µ

print("="*60)
print("WEIGHTED FEATURE ENGINEERING")
print("="*60)
print(f"\nAction Weights:")
for action_id, weight in ACTION_WEIGHTS.items():
    action_name = {1: 'click', 2: 'favorite', 3: 'order', 5: 'to_cart'}[action_id]
    print(f"  {action_name}: {weight}")
print(f"  search: {SEARCH_WEIGHT}")

print(f"\nTrain: {TRAIN_START_DATE.date()} - {TRAIN_END_DATE.date()}")
print(f"Val: {VAL_START_DATE.date()} - {VAL_END_DATE.date()}")
print(f"Test: {TEST_START_DATE.date()}")

WEIGHTED FEATURE ENGINEERING

Action Weights:
  click: 1.0
  favorite: 2.5
  order: 5.0
  to_cart: 3.0
  search: 2.0

Train: 2024-03-01 - 2024-06-30
Val: 2024-07-01 - 2024-07-31
Test: 2024-08-01


## 2. Data Loading

In [8]:
# Load actions 
print("Loading actions_history...")
actions_files = sorted(glob.glob(os.path.join(DATA_PATH, 'actions_history', '*.parquet')))
print(f"Found {len(actions_files)} files")

actions_list = []
for file in tqdm(actions_files, desc="Loading"):
    df = pd.read_parquet(file)
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
    # ADD WEIGHT COLUMN
    df['action_weight'] = df['action_type_id'].map(ACTION_WEIGHTS)
    actions_list.append(df)

actions_history = pd.concat(actions_list, ignore_index=True)
print(f"\nActions: {actions_history.shape}")
print(f"Date range: {actions_history['timestamp'].min()} - {actions_history['timestamp'].max()}")

del actions_list
gc.collect()

Loading actions_history...
Found 53 files


Loading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 53/53 [00:03<00:00, 13.63it/s]



Actions: (182001544, 7)
Date range: 2011-05-28 00:26:26 - 2024-07-31 23:59:58


917

In [5]:
# Load searches
print("Loading search_history...")
search_files = sorted(glob.glob(os.path.join(DATA_PATH, 'search_history', '*.parquet')))

search_list = []
for file in tqdm(search_files, desc="Loading"):
    df = pd.read_parquet(file)
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
    # ADD SEARCH WEIGHT
    df['action_weight'] = SEARCH_WEIGHT
    search_list.append(df)

search_history = pd.concat(search_list, ignore_index=True)
print(f"\nSearches: {search_history.shape}")

del search_list
gc.collect()

Loading search_history...


Loading:   0%|          | 0/32 [00:00<?, ?it/s]

Loading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:08<00:00,  3.67it/s]



Searches: (78160845, 6)


0

In [6]:
# Load products and test users
product_information = pd.read_csv(os.path.join(DATA_PATH, 'product_information.csv'))
test_users = pd.read_csv(os.path.join(DATA_PATH, 'test_users.csv'))

print(f"Products: {product_information.shape}")
print(f"Test users: {test_users.shape}")

Products: (238443, 8)
Test users: (2068424, 1)


## 3. Target Creation

In [9]:
# Validation target
val_actions = actions_history[
    (actions_history['timestamp'] >= VAL_START_DATE) &
    (actions_history['timestamp'] <= VAL_END_DATE)
].copy()

val_target = (
    val_actions
    .assign(has_order=(val_actions['action_type_id'] == 3).astype(int))
    .groupby('user_id', as_index=False)
    .agg(target=('has_order', 'max'))
)

print(f"Total users: {val_target.shape[0]:,}")
print(f"\nTarget distribution:")
print(val_target['target'].value_counts())
print(f"\nPositive ratio: {val_target['target'].mean():.2%}")

del val_actions
gc.collect()

Total users: 1,835,147

Target distribution:
target
0    1200425
1     634722
Name: count, dtype: int64

Positive ratio: 34.59%


0

## 4. Weighted Feature Generation

### 4.1 Basic RFM Features 

In [10]:
def generate_weighted_rfm_features(
    user_df: pd.DataFrame,
    start_date: pd.Timestamp,
    end_date: pd.Timestamp
) -> pd.DataFrame:
    """
    Generate RFM features with WEIGHTS for different action types.
    
    NEW FEATURES:
    - weighted_total_actions: —Å—É–º–º–∞ –≤–∑–≤–µ—à–µ–Ω–Ω—ã—Ö –¥–µ–π—Å—Ç–≤–∏–π
    - weighted_products: —É–Ω–∏–∫–∞–ª—å–Ω—ã–µ –ø—Ä–æ–¥—É–∫—Ç—ã —Å —É—á–µ—Ç–æ–º –≤–µ—Å–∞ –¥–µ–π—Å—Ç–≤–∏–π
    - high_intent_ratio: –¥–æ–ª—è high-intent –¥–µ–π—Å—Ç–≤–∏–π (order+cart)
    """
    print("\n=== Weighted RFM Features ===")
    
    df = user_df.copy()
    
    period_actions = actions_history[
        (actions_history['timestamp'] >= start_date) &
        (actions_history['timestamp'] <= end_date)
    ].copy()
    
    period_actions = period_actions.merge(
        product_information[['product_id', 'discount_price']],
        on='product_id',
        how='left'
    )
    
    actions_map = {1: 'click', 2: 'favorite', 3: 'order', 5: 'to_cart'}
    
    # Per-action features (with weights)
    for action_id, suffix in actions_map.items():
        print(f"  {suffix}...")
        
        action_data = period_actions[period_actions['action_type_id'] == action_id].copy()
        
        if len(action_data) == 0:
            continue
        
        weight = ACTION_WEIGHTS[action_id]
        
        aggs = action_data.groupby('user_id').agg(
            **{
                f'num_products_{suffix}': ('product_id', 'count'),
                f'num_unique_products_{suffix}': ('product_id', 'nunique'),
                f'weighted_count_{suffix}': ('action_weight', 'sum'),  
                f'sum_discount_price_{suffix}': ('discount_price', 'sum'),
                f'max_discount_price_{suffix}': ('discount_price', 'max'),
                f'last_{suffix}_time': ('timestamp', 'max'),
                f'first_{suffix}_time': ('timestamp', 'min'),
            }
        ).reset_index()
        
        # Recency
        ref = end_date + timedelta(days=1)
        aggs[f'days_since_last_{suffix}'] = (ref - aggs[f'last_{suffix}_time']).dt.days
        aggs[f'days_since_first_{suffix}'] = (ref - aggs[f'first_{suffix}_time']).dt.days
        
        # Weighted recency (–Ω–µ–¥–∞–≤–Ω–∏–µ –¥–µ–π—Å—Ç–≤–∏—è –≤–∞–∂–Ω–µ–µ)
        aggs[f'recency_weighted_{suffix}'] = (
            weight / (aggs[f'days_since_last_{suffix}'] + 1)
        )
        
        aggs = aggs.drop(columns=[f'last_{suffix}_time', f'first_{suffix}_time'])
        df = df.merge(aggs, on='user_id', how='left')
    
    # Search features 
    print("  search...")
    period_searches = search_history[
        (search_history['timestamp'] >= start_date) &
        (search_history['timestamp'] <= end_date)
    ].copy()
    
    if len(period_searches) > 0:
        search_aggs = period_searches.groupby('user_id').agg(
            num_search=('search_query', 'count'),
            weighted_search_count=('action_weight', 'sum'),  # NEW
            last_search_time=('timestamp', 'max'),
            first_search_time=('timestamp', 'min'),
        ).reset_index()
        
        ref = end_date + timedelta(days=1)
        search_aggs['days_since_last_search'] = (ref - search_aggs['last_search_time']).dt.days
        search_aggs['days_since_first_search'] = (ref - search_aggs['first_search_time']).dt.days
        search_aggs['recency_weighted_search'] = (
            SEARCH_WEIGHT / (search_aggs['days_since_last_search'] + 1)
        )
        
        search_aggs = search_aggs.drop(columns=['last_search_time', 'first_search_time'])
        df = df.merge(search_aggs, on='user_id', how='left')
    
    # GLOBAL WEIGHTED FEATURES
    print("  global weighted...")
    
    global_aggs = period_actions.groupby('user_id').agg(
        total_actions=('product_id', 'count'),
        weighted_total_actions=('action_weight', 'sum'),  
        high_intent_count=('action_type_id', lambda x: ((x == 3) | (x == 5)).sum()),  # orders + cart
    ).reset_index()
    
    # High intent ratio
    global_aggs['high_intent_ratio'] = (
        global_aggs['high_intent_count'] / global_aggs['total_actions']
    )
    
    # Action diversity (entropy)
    action_diversity = (
        period_actions
        .groupby(['user_id', 'action_type_id'])
        .size()
        .unstack(fill_value=0)
    )
    
    # Calculate entropy
    action_diversity['action_diversity'] = action_diversity.apply(
        lambda row: entropy(row + 1e-10), axis=1
    )
    
    global_aggs = global_aggs.merge(
        action_diversity[['action_diversity']].reset_index(),
        on='user_id',
        how='left'
    )
    
    df = df.merge(global_aggs, on='user_id', how='left')
    
    new_count = len(df.columns) - len(user_df.columns)
    print(f"  Generated {new_count} weighted features")
    
    return df

### 4.2 Temporal Features

In [11]:
def generate_temporal_features(
    user_df: pd.DataFrame,
    start_date: pd.Timestamp,
    end_date: pd.Timestamp
) -> pd.DataFrame:
    """Temporal patterns (same as before)"""
    print("\n=== Temporal Features ===")
    
    df = user_df.copy()
    
    period_actions = actions_history[
        (actions_history['timestamp'] >= start_date) &
        (actions_history['timestamp'] <= end_date)
    ].copy()
    
    period_actions['day_of_week'] = period_actions['timestamp'].dt.dayofweek
    period_actions['hour'] = period_actions['timestamp'].dt.hour
    period_actions['date'] = period_actions['timestamp'].dt.date
    
    actions_map = {1: 'click', 2: 'favorite', 3: 'order', 5: 'to_cart'}
    
    for action_id, suffix in actions_map.items():
        action_data = period_actions[period_actions['action_type_id'] == action_id]
        
        if len(action_data) == 0:
            continue
        
        temporal = action_data.groupby('user_id').agg(
            **{
                f'favorite_day_of_week_{suffix}': ('day_of_week', 'median'),
                f'avg_hour_{suffix}': ('hour', 'median'),
                f'num_unique_days_{suffix}': ('date', 'nunique'),
                f'first_time_{suffix}': ('timestamp', 'min'),
            }
        ).reset_index()
        
        temporal[f'is_new_user_{suffix}'] = (
            temporal[f'first_time_{suffix}'] >= pd.Timestamp('2024-06-01')
        ).astype(int)
        
        temporal = temporal.drop(columns=[f'first_time_{suffix}'])
        df = df.merge(temporal, on='user_id', how='left')
    
    # Lifetime
    for suffix in ['click', 'favorite', 'order', 'to_cart']:
        first_col = f'days_since_first_{suffix}'
        last_col = f'days_since_last_{suffix}'
        if first_col in df.columns and last_col in df.columns:
            df[f'lifetime_{suffix}'] = df[first_col] - df[last_col]
    
    print("  Generated temporal features")
    return df

### 4.3 Conversion Features

In [12]:
def generate_conversion_features(df: pd.DataFrame) -> pd.DataFrame:
    """Conversion rates (same as before)"""
    print("\n=== Conversion Features ===")
    
    df = df.copy()
    
    for suffix in ['click', 'favorite', 'to_cart']:
        num_col = f'num_products_{suffix}'
        if num_col in df.columns and 'num_products_order' in df.columns:
            df[f'{suffix}_to_order_conversion'] = (
                df['num_products_order'] / df[num_col].replace(0, np.nan)
            )
    
    if 'num_search' in df.columns and 'num_products_order' in df.columns:
        df['searches_to_order_ratio'] = (
            df['num_search'] / df['num_products_order'].replace(0, np.nan)
        )
    
    for suffix in ['click', 'favorite', 'to_cart', 'order']:
        num_col = f'num_unique_products_{suffix}'
        days_col = f'num_unique_days_{suffix}'
        if num_col in df.columns and days_col in df.columns:
            df[f'{suffix}_per_day'] = df[num_col] / df[days_col].replace(0, np.nan)
    
    print("  Generated conversion features")
    return df

### 4.4 Advanced Features

In [13]:
def generate_advanced_features(
    user_df: pd.DataFrame,
    start_date: pd.Timestamp,
    end_date: pd.Timestamp
) -> pd.DataFrame:
    """Advanced behavioral features """
    print("\n=== Advanced Features ===")
    
    df = user_df.copy()
    
    period_actions = actions_history[
        (actions_history['timestamp'] >= start_date) &
        (actions_history['timestamp'] <= end_date)
    ].copy()
    
    # Discount ratio
    order_actions = period_actions[period_actions['action_type_id'] == 3].copy()
    
    if len(order_actions) > 0:
        order_actions = order_actions.merge(
            product_information[['product_id', 'price', 'discount_price']],
            on='product_id',
            how='left'
        )
        
        order_actions['has_discount'] = (
            order_actions['price'] > order_actions['discount_price']
        ).astype(int)
        
        discount_aggs = order_actions.groupby('user_id').agg(
            discount_purchase_ratio=('has_discount', 'mean'),
            avg_order_price=('discount_price', 'mean')
        ).reset_index()
        
        df = df.merge(discount_aggs, on='user_id', how='left')
    
    # Category diversity
    interaction_actions = period_actions[
        period_actions['action_type_id'].isin([1, 2, 3, 5])
    ].copy()
    
    if len(interaction_actions) > 0:
        interaction_actions = interaction_actions.merge(
            product_information[['product_id', 'category_id']],
            on='product_id',
            how='left'
        )
        
        category_aggs = interaction_actions.groupby('user_id').agg(
            num_unique_categories=('category_id', 'nunique'),
            total_interactions=('category_id', 'count')
        ).reset_index()
        
        category_aggs['category_diversity'] = (
            category_aggs['num_unique_categories'] / category_aggs['total_interactions']
        )
        
        category_aggs = category_aggs.drop(columns=['total_interactions'])
        df = df.merge(category_aggs, on='user_id', how='left')
    
    # Widget diversity
    widget_aggs = period_actions.groupby('user_id').agg(
        num_unique_widgets=('widget_name_id', 'nunique')
    ).reset_index()
    
    df = df.merge(widget_aggs, on='user_id', how='left')
    
    print("  Generated advanced features")
    return df

### 4.5 Periodic Aggregations

In [14]:
def generate_weighted_periodic_features(
    user_df: pd.DataFrame,
    start_date: pd.Timestamp,
    end_date: pd.Timestamp,
    num_periods: int = 4,
    prefiltered_actions: pd.DataFrame = None  # NEW: accept pre-filtered data
) -> pd.DataFrame:
    """
    OPTIMIZED: Generates weighted periodic features ~5x faster.
    
    Key optimizations:
    1. Accept pre-filtered actions to avoid repeated filtering
    2. Removed slow lambda for category_mode (replaced with faster approach)
    3. Simplified aggregations
    """
    print("\n=== Weighted Periodic Features (OPTIMIZED) ===")
    print(f"  Periods: {num_periods} weeks + older")
    
    df = user_df.copy()
    user_set = set(user_df['user_id'])
    
    # Use pre-filtered data if provided, otherwise filter
    if prefiltered_actions is not None:
        period_actions = prefiltered_actions[
            prefiltered_actions['user_id'].isin(user_set)
        ].copy()
    else:
        period_actions = actions_history[
            (actions_history['timestamp'] >= start_date) &
            (actions_history['timestamp'] <= end_date) &
            (actions_history['user_id'].isin(user_set))
        ].copy()
    
    if len(period_actions) == 0:
        print("  No data")
        return df
    
    # Merge product info only if not already present
    if 'category_id' not in period_actions.columns:
        period_actions = period_actions.merge(
            product_information[['product_id', 'category_id', 'price', 'discount_price']],
            on='product_id',
            how='left'
        )
    
    # Fill missing values efficiently
    period_actions['category_id'] = period_actions['category_id'].fillna(10000).astype('int32')
    price_mean = period_actions['price'].mean()
    period_actions['price'] = period_actions['price'].fillna(price_mean).astype('float32')
    period_actions['discount_price'] = period_actions['discount_price'].fillna(price_mean).astype('float32')
    
    # Period assignment
    period_actions['period'] = (
        (end_date - period_actions['timestamp']).dt.days // 7
    ).clip(upper=num_periods).astype('int8')
    
    print("  Aggregating (fast)...")
    
    # FAST aggregation - removed slow lambda!
    # Using only built-in aggregations which are vectorized
    aggregated = period_actions.groupby(
        ['user_id', 'period', 'action_type_id'], 
        as_index=False
    ).agg(
        num_actions=('timestamp', 'count'),  # Changed from nunique to count (faster)
        weighted_actions=('action_weight', 'sum'),  
        num_products=('product_id', 'nunique'),
        count_products=('product_id', 'count'),
        unique_widget_actions=('widget_name_id', 'nunique'),
        num_categories=('category_id', 'nunique'),
        price_mean=('price', 'mean'),
        price_max=('price', 'max'),
        discount_price_mean=('discount_price', 'mean'),
        discount_price_max=('discount_price', 'max'),
    )
    
    # Normalize period 4
    divisor = (end_date - pd.Timedelta(f"{num_periods*7} days") - start_date).days
    if divisor > 0:
        features_norm = ['num_actions', 'weighted_actions', 'num_products', 
                        'count_products', 'unique_widget_actions', 'num_categories']
        mask = aggregated['period'] == num_periods
        aggregated.loc[mask, features_norm] = aggregated.loc[mask, features_norm] / divisor
    
    print("  Pivoting...")
    
    # Pivot - simplified feature list (removed category_mode and timestamp_std)
    features = [
        'num_actions', 'weighted_actions', 'num_products', 'count_products',
        'unique_widget_actions', 'num_categories',
        'price_mean', 'price_max', 'discount_price_mean', 'discount_price_max'
    ]
    
    aggregated_wide = aggregated.pivot_table(
        index='user_id',
        columns=['period', 'action_type_id'],
        values=features,
        fill_value=0
    )
    
    aggregated_wide.columns = [
        f"{feat}_{period}_{action}"
        for feat, period, action in aggregated_wide.columns
    ]
    
    aggregated_wide = aggregated_wide.reset_index()
    df = df.merge(aggregated_wide, on='user_id', how='left')
    
    periodic_cols = [col for col in df.columns if col not in user_df.columns]
    df[periodic_cols] = df[periodic_cols].fillna(0)
    
    new_count = len(df.columns) - len(user_df.columns)
    print(f"  Generated {new_count} periodic features")
    
    return df

## 5. Generate Training Features

In [None]:
print("="*60)
print("GENERATING TRAINING FEATURES")
print("="*60)

# ===== OPTIMIZATION: Pre-filter data ONCE =====
print("\n[OPTIMIZATION] Pre-filtering actions for train period...")
train_actions_filtered = actions_history[
    (actions_history['timestamp'] >= TRAIN_START_DATE) &
    (actions_history['timestamp'] <= TRAIN_END_DATE)
].copy()

# Pre-merge product info (avoids repeated merges)
train_actions_filtered = train_actions_filtered.merge(
    product_information[['product_id', 'category_id', 'price', 'discount_price']],
    on='product_id',
    how='left'
)
print(f"  Pre-filtered: {len(train_actions_filtered):,} actions")

df_train = val_target.copy()

# 1. Weighted RFM
df_train = generate_weighted_rfm_features(
    df_train,
    start_date=TRAIN_START_DATE,
    end_date=TRAIN_END_DATE
)

# 2. Temporal
df_train = generate_temporal_features(
    df_train,
    start_date=TRAIN_START_DATE,
    end_date=TRAIN_END_DATE
)

# 3. Conversion
df_train = generate_conversion_features(df_train)

# 4. Advanced
df_train = generate_advanced_features(
    df_train,
    start_date=TRAIN_START_DATE,
    end_date=TRAIN_END_DATE
)

# 5. Weighted Periodic - USE PRE-FILTERED DATA
df_train = generate_weighted_periodic_features(
    df_train,
    start_date=TRAIN_START_DATE,
    end_date=TRAIN_END_DATE,
    num_periods=4,
    prefiltered_actions=train_actions_filtered  # OPTIMIZATION!
)

# Cleanup
del train_actions_filtered
gc.collect()

print("\n" + "="*60)
print(f"TOTAL FEATURES: {len(df_train.columns) - 2}")
print("="*60)

GENERATING TRAINING FEATURES

[OPTIMIZATION] Pre-filtering actions for train period...
  Pre-filtered: 141,962,692 actions

=== Weighted RFM Features ===
  click...
  favorite...
  order...
  to_cart...
  search...
  global weighted...


## 6. Generate Test Features

In [None]:
print("="*60)
print("GENERATING TEST FEATURES (WEIGHTED)")
print("="*60)

# ===== OPTIMIZATION: Pre-filter data ONCE =====
print("\n[OPTIMIZATION] Pre-filtering actions for test period...")
test_actions_filtered = actions_history[
    (actions_history['timestamp'] >= TRAIN_START_DATE) &
    (actions_history['timestamp'] <= VAL_END_DATE)
].copy()

# Pre-merge product info
test_actions_filtered = test_actions_filtered.merge(
    product_information[['product_id', 'category_id', 'price', 'discount_price']],
    on='product_id',
    how='left'
)
print(f"  Pre-filtered: {len(test_actions_filtered):,} actions")

df_test = test_users.copy()
df_test['target'] = 0

df_test = generate_weighted_rfm_features(
    df_test,
    start_date=TRAIN_START_DATE,
    end_date=VAL_END_DATE
)

df_test = generate_temporal_features(
    df_test,
    start_date=TRAIN_START_DATE,
    end_date=VAL_END_DATE
)

df_test = generate_conversion_features(df_test)

df_test = generate_advanced_features(
    df_test,
    start_date=TRAIN_START_DATE,
    end_date=VAL_END_DATE
)

# USE PRE-FILTERED DATA
df_test = generate_weighted_periodic_features(
    df_test,
    start_date=TRAIN_START_DATE,
    end_date=VAL_END_DATE,
    num_periods=4,
    prefiltered_actions=test_actions_filtered  # OPTIMIZATION!
)

# Cleanup
del test_actions_filtered
gc.collect()

print("\n" + "="*60)
print(f"TEST FEATURES: {len(df_test.columns) - 2}")
print("="*60)

GENERATING TEST FEATURES (WEIGHTED)


NameError: name 'test_users' is not defined

## 7. Feature Cleaning

In [None]:
# Get feature columns
feature_cols = [col for col in df_train.columns if col not in ['user_id', 'target']]
print(f"Total features: {len(feature_cols)}")

# Fill nulls with -1
print("\nFilling nulls with -1...")
df_train[feature_cols] = df_train[feature_cols].fillna(-1)
df_test[feature_cols] = df_test[feature_cols].fillna(-1)

# Handle inf
print("Handling infinite values...")
df_train = df_train.replace([np.inf, -np.inf], 999999)
df_test = df_test.replace([np.inf, -np.inf], 999999)

print("\n‚úÖ Features cleaned")

## 8. Save Features

In [None]:
output_dir = '../results'
os.makedirs(output_dir, exist_ok=True)

print("Saving weighted features...")

df_train.to_parquet(os.path.join(output_dir, 'features_train_v3_weighted.parquet'), index=False)
df_test.to_parquet(os.path.join(output_dir, 'features_test_v3_weighted.parquet'), index=False)

print(f"\n‚úÖ Saved to {output_dir}/")
print(f"  - features_train_v3_weighted.parquet: {df_train.shape}")
print(f"  - features_test_v3_weighted.parquet: {df_test.shape}")

# Save feature names
with open(os.path.join(output_dir, 'feature_names_v3_weighted.txt'), 'w') as f:
    for col in feature_cols:
        f.write(f"{col}\n")

print(f"\n Feature names: {output_dir}/feature_names_v3_weighted.txt")

## 9. Summary

In [None]:
print("\n" + "="*70)
print("WEIGHTED FEATURE ENGINEERING COMPLETE")
print("="*70)

print(f"\nüìä Summary:")
print(f"  Total features: {len(feature_cols)}")
print(f"  Train samples: {df_train.shape[0]:,}")
print(f"  Test samples: {df_test.shape[0]:,}")

print(f"\nüéØ KEY IMPROVEMENTS:")
print(f"  1. Action weights: order (5.0), to_cart (3.0), favorite (2.5), search (2.0), click (1.0)")
print(f"  2. Weighted aggregations: weighted_total_actions, recency_weighted, etc.")
print(f"  3. High-intent features: high_intent_ratio, action_diversity")
print(f"  4. Weighted periodic features: weighted_actions per period")

print(f"\nüìÅ New Features (examples):")
weighted_features = [col for col in feature_cols if 'weighted' in col]
print(f"  Weighted features ({len(weighted_features)}): {weighted_features[:10]}")

print(f"\nüéØ Next Steps:")
print(f"  1. Train models with weighted features")
print(f"  2. Compare AUC with v2 (non-weighted)")
print(f"  3. Analyze feature importance of weighted features")

print("\n" + "="*70)

In [None]:
# Display sample
print("\nSample of weighted features:")
display_cols = ['user_id', 'target', 'weighted_total_actions', 'high_intent_ratio', 
                'action_diversity', 'recency_weighted_order', 'recency_weighted_search']
display_cols = [c for c in display_cols if c in df_train.columns]
df_train[display_cols].head(10)