# Feature Engineering

## Objective
Create essential features for recommendation models

1. **User Features**: Core behavior metrics (ratings, activity, preferences)
2. **Item Features**: Essential product characteristics and popularity
3. **Interaction Matrix**: Customer-merchant collaborative filtering matrix
4. **Temporal Validation**: Prevent data leakage with time splits

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import warnings

from scipy.sparse import issparse
from datetime import datetime

# Import utils
sys.path.append('../utils')
from db_loader import load_amazon_data_k_core
from preprocessing import clean_reviews_data, clean_products_data
from feature_engineering import load_and_create_features

warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Data Loading & Merchant Extraction

In [2]:
reviews_raw, products_raw, features = load_and_create_features()

2025-10-19 10:13:36,103 | Loading data...


Loading Amazon data with 5-core filtering...
Applying 5-core filtering...
Original dataset: 4,000,000 reviews
Applying 5-core filtering...
Original dataset: 4,000,000 reviews
Original users: 708,183
Original users: 708,183
Original items: 85,124
Original items: 85,124
Iteration 1: 2,943,852 reviews, 346,998 users, 63,126 items
Iteration 1: 2,943,852 reviews, 346,998 users, 63,126 items
Iteration 2: 2,894,141 reviews, 339,516 users, 57,095 items
Iteration 2: 2,894,141 reviews, 339,516 users, 57,095 items
Iteration 3: 2,876,440 reviews, 335,444 users, 56,691 items
Iteration 3: 2,876,440 reviews, 335,444 users, 56,691 items
Iteration 4: 2,874,287 reviews, 335,121 users, 56,472 items
Iteration 4: 2,874,287 reviews, 335,121 users, 56,472 items
Iteration 5: 2,873,490 reviews, 334,940 users, 56,453 items
Iteration 5: 2,873,490 reviews, 334,940 users, 56,453 items
Iteration 6: 2,873,422 reviews, 334,931 users, 56,445 items
Iteration 6: 2,873,422 reviews, 334,931 users, 56,445 items
Iteration 7

2025-10-19 10:16:44,337 | Starting customer-merchant feature creation...


Cleaned products data: 56,370 records (4,109 removed)


2025-10-19 10:16:45,071 | Building user features...
2025-10-19 10:18:15,499 | Building item features...
2025-10-19 10:18:15,499 | Building item features...
2025-10-19 10:18:34,319 | Building interaction features...
2025-10-19 10:18:34,319 | Building interaction features...
2025-10-19 10:18:38,441 | Creating sparse customer-merchant interaction matrix...
2025-10-19 10:18:38,441 | Creating sparse customer-merchant interaction matrix...
2025-10-19 10:18:41,391 | Original data: 2,721,300 interactions
2025-10-19 10:18:41,392 | With merchant info: 2,718,509 interactions
2025-10-19 10:18:41,391 | Original data: 2,721,300 interactions
2025-10-19 10:18:41,392 | With merchant info: 2,718,509 interactions
2025-10-19 10:18:43,401 | Customers with >=5 interactions: 305,699
2025-10-19 10:18:43,403 | Merchants with >=5 interactions: 7,089
2025-10-19 10:18:43,401 | Customers with >=5 interactions: 305,699
2025-10-19 10:18:43,403 | Merchants with >=5 interactions: 7,089
2025-10-19 10:18:46,733 | After 

## 2. Explore Feature Creation

In [3]:
# Extract individual feature components
user_features = features['user_features']
item_features = features['item_features'] 
interaction_features = features['interaction_features']
interaction_matrix = features['interaction_matrix']  


In [4]:


# Merge reviews with products to get merchant (brand) information
reviews_with_merchants = reviews_raw.merge(
    products_raw[['asin', 'brand']], on='asin', how='inner'
)

# Clean merchant data - remove missing brands
reviews_with_merchants = reviews_with_merchants[
    (reviews_with_merchants['brand'].notna()) & 
    (reviews_with_merchants['brand'] != 'Unknown')
].copy()

print(f"Reviews with valid merchant data: {len(reviews_with_merchants):,}")
print(f"Unique customers: {reviews_with_merchants['reviewerid'].nunique():,}")
print(f"Unique merchants: {reviews_with_merchants['brand'].nunique():,}")

Reviews with valid merchant data: 3,087,654
Unique customers: 334,924
Unique customers: 334,924
Unique merchants: 7,139
Unique merchants: 7,139


In [5]:
# Create customer and merchant mappings
unique_customers = reviews_with_merchants['reviewerid'].unique()
unique_merchants = reviews_with_merchants['brand'].unique()

customer_to_idx = {customer: idx for idx, customer in enumerate(unique_customers)}
merchant_to_idx = {merchant: idx for idx, merchant in enumerate(unique_merchants)}

# Reverse mappings
idx_to_customer = {idx: customer for customer, idx in customer_to_idx.items()}
idx_to_merchant = {idx: merchant for merchant, idx in merchant_to_idx.items()}

print(f"Created mappings:")
print(f"  Customers: {len(customer_to_idx):,}")
print(f"  Merchants: {len(merchant_to_idx):,}")

# Add indices to dataframe
reviews_with_merchants['customer_idx'] = reviews_with_merchants['reviewerid'].map(customer_to_idx)
reviews_with_merchants['merchant_idx'] = reviews_with_merchants['brand'].map(merchant_to_idx)

Created mappings:
  Customers: 334,924
  Merchants: 7,139


In [6]:
# Create aggregated customer-merchant interactions
print("Aggregating customer-merchant interactions...")

# Check verified column and convert to numeric
if 'verified' in reviews_with_merchants.columns:
    print(f"Verified column dtype: {reviews_with_merchants['verified'].dtype}")
    print(f"Verified unique values: {reviews_with_merchants['verified'].unique()[:10]}")
    
    # Handle different data types for verified column
    if reviews_with_merchants['verified'].dtype == 'bool':
        reviews_with_merchants['verified_numeric'] = reviews_with_merchants['verified'].astype(int)
    else:
        # Convert to string first, then check for True/true values
        reviews_with_merchants['verified_numeric'] = (
            reviews_with_merchants['verified'].astype(str).str.lower() == 'true'
        ).astype(int)
else:
    reviews_with_merchants['verified_numeric'] = 0

# Group by customer-merchant pairs and aggregate ratings
print("Starting aggregation...")
print(f"Columns before aggregation: {reviews_with_merchants.columns.tolist()}")
print(f"Data types: {reviews_with_merchants.dtypes}")

# Select only the columns we need for aggregation
agg_columns = ['customer_idx', 'merchant_idx', 'overall', 'verified_numeric']
reviews_for_agg = reviews_with_merchants[agg_columns].copy()

# Ensure all columns are proper numeric types
reviews_for_agg['overall'] = pd.to_numeric(reviews_for_agg['overall'], errors='coerce')
reviews_for_agg['verified_numeric'] = pd.to_numeric(reviews_for_agg['verified_numeric'], errors='coerce')

customer_merchant_agg = reviews_for_agg.groupby(['customer_idx', 'merchant_idx']).agg({
    'overall': ['mean', 'count'],
    'verified_numeric': 'mean'
}).round(2)

customer_merchant_agg.columns = ['avg_rating', 'interaction_count', 'verified_pct']
customer_merchant_agg = customer_merchant_agg.reset_index()

# Create interaction score (weighted by frequency and verification)
customer_merchant_agg['interaction_score'] = (
    customer_merchant_agg['avg_rating'] * 0.7 +
    np.log1p(customer_merchant_agg['interaction_count']) * 0.2 +
    customer_merchant_agg['verified_pct'] * 0.1
)

print(f"Customer-merchant pairs: {len(customer_merchant_agg):,}")
print(f"Average interactions per pair: {customer_merchant_agg['interaction_count'].mean():.1f}")

Aggregating customer-merchant interactions...
Verified column dtype: object
Verified unique values: ['True' 'False']
Starting aggregation...
Columns before aggregation: ['id', 'asin', 'overall', 'reviewtext', 'reviewtime', 'reviewerid', 'reviewername', 'style', 'summary', 'unixreviewtime', 'verified', 'vote', 'brand', 'customer_idx', 'merchant_idx', 'verified_numeric']
Data types: id                   int64
asin                object
overall             object
reviewtext          object
reviewtime          object
reviewerid          object
reviewername        object
style               object
summary             object
unixreviewtime      object
verified            object
vote                object
brand               object
customer_idx         int64
merchant_idx         int64
verified_numeric     int64
dtype: object
Starting aggregation...
Columns before aggregation: ['id', 'asin', 'overall', 'reviewtext', 'reviewtime', 'reviewerid', 'reviewername', 'style', 'summary', 'unixreviewtim

In [7]:
# Create sparse interaction matrix
print("Creating sparse customer-merchant matrix...")

n_customers = len(customer_to_idx)
n_merchants = len(merchant_to_idx)

from scipy.sparse import csr_matrix
# Use interaction scores as matrix values
customer_merchant_matrix = csr_matrix(
    (customer_merchant_agg['interaction_score'],
     (customer_merchant_agg['customer_idx'], customer_merchant_agg['merchant_idx'])),
    shape=(n_customers, n_merchants)
)

# Matrix statistics
total_possible = n_customers * n_merchants
sparsity = (1 - customer_merchant_matrix.nnz / total_possible) * 100
memory_mb = (customer_merchant_matrix.data.nbytes + customer_merchant_matrix.indices.nbytes + customer_merchant_matrix.indptr.nbytes) / 1024**2

print(f"\nCustomer-Merchant Interaction Matrix:")
print(f"  Shape: {customer_merchant_matrix.shape[0]:,} x {customer_merchant_matrix.shape[1]:,}")
print(f"  Non-zero entries: {customer_merchant_matrix.nnz:,}")
print(f"  Sparsity: {sparsity:.2f}%")
print(f"  Memory usage: {memory_mb:.1f} MB")
print(f"  Score range: [{customer_merchant_matrix.data.min():.2f}, {customer_merchant_matrix.data.max():.2f}]")

Creating sparse customer-merchant matrix...

Customer-Merchant Interaction Matrix:
  Shape: 334,924 x 7,139
  Non-zero entries: 2,398,431
  Sparsity: 99.90%
  Memory usage: 28.7 MB
  Score range: [0.84, 4.35]


## 4. Save Results

In [8]:
# Create temporal split for validation
print("Creating train-test split...")

# Sort by review time if available
if 'unixreviewtime' in reviews_with_merchants.columns:
    reviews_with_merchants = reviews_with_merchants.sort_values('unixreviewtime')
    
    # Use 80% for training, 20% for testing (temporal split)
    split_idx = int(len(reviews_with_merchants) * 0.8)
    train_data = reviews_with_merchants.iloc[:split_idx].copy()
    test_data = reviews_with_merchants.iloc[split_idx:].copy()
    
    print(f"Temporal split based on review time")
else:
    # Random split if no time data
    train_data = reviews_with_merchants.sample(frac=0.8, random_state=42)
    test_data = reviews_with_merchants.drop(train_data.index)
    
    print(f"Random split (no temporal data)")

print(f"Train data: {len(train_data):,} interactions")
print(f"Test data: {len(test_data):,} interactions")

# Create test matrix (for evaluation)
# Ensure overall column is numeric before aggregation
test_data_clean = test_data.copy()
test_data_clean['overall'] = pd.to_numeric(test_data_clean['overall'], errors='coerce')

test_agg = test_data_clean.groupby(['customer_idx', 'merchant_idx']).agg({
    'overall': 'mean'
}).round(2)
test_agg.columns = ['avg_rating']
test_agg = test_agg.reset_index()

test_customer_merchant_matrix = csr_matrix(
    (test_agg['avg_rating'],
     (test_agg['customer_idx'], test_agg['merchant_idx'])),
    shape=(n_customers, n_merchants)
)

print(f"Test matrix: {test_customer_merchant_matrix.nnz:,} interactions")

Creating train-test split...
Temporal split based on review time
Train data: 2,470,123 interactions
Test data: 617,531 interactions
Temporal split based on review time
Train data: 2,470,123 interactions
Test data: 617,531 interactions
Test matrix: 509,432 interactions
Test matrix: 509,432 interactions


In [10]:
# Create features directory
import os 
features_dir = '../data/features'
os.makedirs(features_dir, exist_ok=True)

# Save interaction matrices
from scipy.sparse import save_npz

save_npz(f'{features_dir}/customer_merchant_matrix.npz', customer_merchant_matrix)
save_npz(f'{features_dir}/test_customer_merchant_matrix.npz', test_customer_merchant_matrix)

# Save mappings
mappings = {
    'customer_to_idx': customer_to_idx,
    'merchant_to_idx': merchant_to_idx,
    'idx_to_customer': idx_to_customer,
    'idx_to_merchant': idx_to_merchant
}

import pickle
with open(f'{features_dir}/merchant_mappings.pickle', 'wb') as f:
    pickle.dump(mappings, f)

# Save processed data
train_data.to_parquet(f'{features_dir}/train_merchant_data.parquet')
test_data.to_parquet(f'{features_dir}/test_merchant_data.parquet')
customer_merchant_agg.to_parquet(f'{features_dir}/customer_merchant_features.parquet')

# Save merchant summary
# Ensure overall column is numeric before aggregation
reviews_with_merchants['overall'] = pd.to_numeric(reviews_with_merchants['overall'], errors='coerce')

merchant_summary = reviews_with_merchants.groupby('merchant_idx').agg({
    'overall': ['count', 'mean'],
    'reviewerid': 'nunique',
    'brand': 'first'
}).round(2)
merchant_summary.columns = ['review_count', 'avg_rating', 'customer_count', 'merchant_name']
merchant_summary.to_parquet(f'{features_dir}/merchant_features.parquet')


for filename in os.listdir(features_dir):
    print(f"  - {filename}")

print(f"\nSaved to {features_dir}/:")


  - .gitkeep
  - customer_merchant_features.parquet
  - customer_merchant_matrix.npz
  - merchant_features.parquet
  - merchant_mappings.pickle
  - test_customer_merchant_matrix.npz
  - test_merchant_data.parquet
  - train_merchant_data.parquet

Saved to ../data/features/:
