# Week 3: Feature Engineering

# We will perform feature engineering for the getINNOtized recommendation system. I will create features for collaborative- and content-based filtering:
# - **User-Item Interaction Matrix**: A weighted matrix of user-item interactions (view=1, addtocart=3, transaction=5).
# - **Item Features**: Incorporating category depth from `category_tree.csv`.
# - **Category Features**: Incorporating category depth from `category_tree.csv`, to enhance diversity in recommendations.

# These features will support modeling to answer the seven business questions: personalization, conversion optimization, seasonal trends, popularity, diversity, and algorithm performance.

In [9]:
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix
import joblib
import os

# Define paths
preprocessed_data_dir = "../data/preprocessed_data/"
raw_data_dir = "../data/"

# Load preprocessed data
events = pd.read_csv(os.path.join(raw_data_dir, "events.csv"))
user_behavior = pd.read_csv(os.path.join(preprocessed_data_dir, "user_behavior.csv"))
item_properties_part1 = pd.read_csv(os.path.join(raw_data_dir, "item_properties_part1.csv"))
item_properties_part2 = pd.read_csv(os.path.join(raw_data_dir, "item_properties_part2.csv"))
category_tree = pd.read_csv(os.path.join(raw_data_dir, "category_tree.csv"))

# Concatenate item properties
item_properties = pd.concat([item_properties_part1, item_properties_part2])

# Load user and item ID mappings
user_ids = pd.read_csv(os.path.join(preprocessed_data_dir, "user_ids.csv"))
item_ids = pd.read_csv(os.path.join(preprocessed_data_dir, "item_ids.csv"))

# Convert to dictionary for mapping
user_ids = user_ids["0"].to_dict()
item_ids = item_ids["0"].to_dict()

# Reverse the mappings for lookup
user_id_to_idx = {k: v for k, v in user_ids.items()}
item_id_to_idx = {k: v for k, v in item_ids.items()}

## User-Item Interaction Matrix

# We will create a sparse user-item interaction matrix where rows are users (visitorid), columns are items (itemid), and values are weighted interaction scores:
# - View: 1
# - Add to Cart: 3
# - Transaction: 5

# This matrix will be used for collaborative filtering to personalize recommendations.

In [10]:
# Apply anomaly filtering
normal_users = user_behavior[user_behavior['events_per_day'] <= 100]['visitorid'].to_list()
events = events[events['visitorid'].isin(normal_users)]

# Map event types to weights
event_weights = {'view': 1, 'addtocart': 3, 'transaction': 5}
events['rating'] = events['event'].map(event_weights)

# Aggregate ratings by user and item
interactions = events.groupby(['visitorid', 'itemid'])['rating'].sum().reset_index()

# Map to indices
interactions['user_idx'] = interactions['visitorid'].map(user_id_to_idx)
interactions['item_idx'] = interactions['itemid'].map(item_id_to_idx)

# Remove unmapped entries
interactions = interactions.dropna(subset=['user_idx', 'item_idx'])
interactions['user_idx'] = interactions['user_idx'].astype(int)
interactions['item_idx'] = interactions['item_idx'].astype(int)

# Cap ratings at 5
interactions['rating'] = np.minimum(interactions['rating'], 5)

# Create sparse matrix
rows = interactions['user_idx'].values
cols = interactions['item_idx'].values
data = interactions['rating'].values
user_item_matrix = coo_matrix((data, (rows, cols)), shape=(len(user_ids), len(item_ids)))

# Save the sparse matrix
with open(os.path.join(preprocessed_data_dir, "user_item_sparse.pkl"), "wb") as f:
    joblib.dump(user_item_matrix, f)

print("Sparse matrix saved. Rating distribution:")
print(interactions['rating'].value_counts())

Sparse matrix saved. Rating distribution:
rating
1    719082
2     48417
5     18494
3     17191
4      8315
Name: count, dtype: int64


## Item Features

# I will extract item features from `item_properties_part1.csv` and `item_properties_part2.csv`.

# **Category**: The `categoryid` property for each item.

# These features will be used for content-based filtering to optimize conversion rates.

In [11]:
# Filter properties to get categoryid
category_properties = item_properties[item_properties['property'] == 'categoryid'].copy()

# Convert value to numeric
category_properties['value'] = pd.to_numeric(category_properties['value'], errors='coerce')

# Drop rows where value is NaN
category_properties = category_properties.dropna(subset=['value'])

# Convert timestamp to datetime
category_properties['timestamp'] = pd.to_datetime(category_properties['timestamp'], unit='ms')

# Sort by timestamp and keep the most recent record for each item
category_properties = category_properties.sort_values('timestamp').groupby('itemid').last().reset_index()

# Map itemid to item_idx
category_properties['item_idx'] = category_properties['itemid'].map(item_id_to_idx)

# Drop rows where item_idx is NaN
category_properties = category_properties.dropna(subset=['item_idx'])
category_properties['item_idx'] = category_properties['item_idx'].astype(int)

# Select relevant columns
item_features = category_properties[['item_idx', 'value']].copy()
item_features.columns = ['item_idx', 'categoryid']

# Save item features
item_features.to_csv(os.path.join(preprocessed_data_dir, "item_features.csv"), index=False)
print("Item features saved:", item_features.head())

Item features saved:    item_idx  categoryid
0         0         209
1         1        1114
2         2        1305
3         3        1171
4         4        1038


## Category Features

# I will compute category depths from `category_tree.csv`, to enhance diversity in recommendations.

In [12]:
# Function to calculate category depth
def calculate_depth(category_tree):
    depth = {}
    
    def traverse(node, current_depth):
        depth[node] = current_depth
        children = category_tree[category_tree['parentid'] == node]['categoryid']
        for child in children:
            traverse(child, current_depth + 1)
    
    # Start with root nodes (where parentid is NaN)
    root_nodes = category_tree[category_tree['parentid'].isna()]['categoryid']
    for root in root_nodes:
        traverse(root, 0)
    
    return depth

# Calculate depths
category_depths = calculate_depth(category_tree)

# Convert to DataFrame
category_depths_df = pd.DataFrame(list(category_depths.items()), columns=['categoryid', 'depth'])

# Save category features
category_depths_df.to_csv(os.path.join(preprocessed_data_dir, "category_features.csv"), index=False)
print("Category features saved:", category_depths_df.head())

Category features saved:    categoryid  depth
0         231      0
1         791      0
2         587      1
3         769      2
4        1680      2


## Feature Engineering Summary

# We have created the following features:
# - **User-Item Matrix**: A sparse matrix with shape (number of users, number of items) in `user_item_sparse.pkl`. User and item mappings saved as `user_ids.csv`, `item_ids.csv`.
# - **Item Features**: Computed category features, saved as `item_features.csv`.
# - **Category Features**: Computed category depths, saved as `category_features.csv`.

# These features are ready for collaborative filtering, content-based filtering, and modeling to address all seven business questions.

## Next Steps

# - **Modeling**: Build and evaluate a baseline collaborative filtering model using the user-item interaction matrix.
# - **Further Analysis**: Explore content-based filtering and approaches.