# **[Group 52] - RETAIL AND ECOMMERCE – RECOMMENDATION ENGINE FOR PERSONALIZED SHOPPING**

### **Project Overview**
This notebook documents the development of an AI-driven recommendation system for an e-commerce platform. The goal is to address declining revenue and low conversion rates by providing personalized product recommendations.

We will leverage the **Retailrocket dataset** to build a **Wide & Deep learning model**. This model architecture is ideal for this task as it can both **memorize** simple, direct interaction patterns (e.g., "popular items are frequently bought") and **generalize** complex, latent user preferences (e.g., "users who like this brand of shoes might also like this brand of apparel").

**The project follows these key stages:**
1.  **Business & Data Understanding:** Exploring the dataset to understand user behavior and data characteristics.
2.  **Data Preparation:** Cleaning, transforming, and engineering features for the model.
3.  **Model Development:** Designing and building the Wide & Deep architecture using TensorFlow/Keras.
4.  **Training & Evaluation:** Training the model and assessing its performance using metrics like accuracy, precision, and recall.
5.  **Inference:** Creating a function to generate real-time recommendations for a given user.

---

## **1. Setup and Environment**

First, let's import the necessary libraries and set up the environment.

In [None]:
# Install necessary libraries (if not already installed)
# !pip install tensorflow pandas scikit-learn seaborn

# --- Core Libraries ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from collections import defaultdict

# --- TensorFlow and Keras ---
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Embedding, Flatten, Concatenate, Dropout, Add
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model

# --- Scikit-learn for Preprocessing and Metrics ---
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix

# --- Notebook-specific settings ---
from IPython.display import Image
import warnings
warnings.filterwarnings("ignore")

# Print TensorFlow version to ensure compatibility
print(f"TensorFlow version: {tf.__version__}")

## **2. Data Loading**

We will use the Retailrocket dataset from Kaggle. The easiest way to access this in Colab is to upload your `kaggle.json` API key.

In [None]:
# Step 1: Upload your kaggle.json file
from google.colab import files
print("Please upload your kaggle.json file")
files.upload()

# Step 2: Set up Kaggle directory and permissions
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Step 3: Download and unzip the dataset
!kaggle datasets download -d retailrocket/ecommerce-dataset
!unzip ecommerce-dataset.zip

Now, let's load the datasets into Pandas DataFrames.

In [None]:
# Define file paths
events_path = "events.csv"
item_props1_path = "item_properties_part1.csv"
item_props2_path = "item_properties_part2.csv"
category_tree_path = "category_tree.csv"

# Load CSVs
events = pd.read_csv(events_path)
item_props1 = pd.read_csv(item_props1_path)
item_props2 = pd.read_csv(item_props2_path)
category_tree = pd.read_csv(category_tree_path)

# Combine item properties for convenience
item_props = pd.concat([item_props1, item_props2], axis=0, ignore_index=True)

print("--- Dataset Shapes ---")
print("Events shape:", events.shape)
print("Merged item_props shape:", item_props.shape)
print("Category tree shape:", category_tree.shape)

## **3. Business and Data Understanding (EDA)**

In this section, we'll perform Exploratory Data Analysis (EDA) to understand the data's structure, identify patterns, and uncover insights that will inform our modeling strategy.

### **3.1 A Quick Glance at the Data**

Let's look at the first few rows of each DataFrame.

In [None]:
print("\nEvents head:")
display(events.head())
print("\nItem Props head:")
display(item_props.head())
print("\nCategory Tree head:")
display(category_tree.head())

### **3.2 Deep Dive: `events.csv`**

This is our primary dataset, containing user interactions.

In [None]:
# Basic info, nulls, and unique values
print("=== EVENTS DATASET EXPLORATION ===\n")
print("1. Basic Info")
events.info()

print("\n2. Null Values per Column (%):")
print((events.isnull().sum() / events.shape[0]) * 100)

print("\n3. Unique Value Counts per Column:")
for col_name in events.columns:
    unique_vals = events[col_name].nunique()
    print(f"  {col_name}: {unique_vals}")

**Inferences for `events.csv`:**
*   **High Nulls in `transactionid`**: Over 99% of `transactionid` values are null. This is expected, as this ID only exists for 'transaction' events, which are rare.
*   **Large Scale**: The dataset contains over 2.7 million events from 1.4 million unique visitors and 235k unique items.
*   **Data Sparsity**: The high number of users and items relative to events suggests that most users interact with only a few items, a classic "long-tail" distribution.

#### **Distribution of Event Types**
This shows the e-commerce conversion funnel.

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='event', data=events, order=['view', 'addtocart', 'transaction'])
plt.title("Event Type Distribution (Conversion Funnel)", fontsize=14)
plt.xlabel("Event Type", fontsize=12)
plt.ylabel("Count (in millions)", fontsize=12)
plt.show()

**Inference:** There's a massive drop-off from `view` to `addtocart`, and another significant drop to `transaction`. Our model's goal is to predict the events on the right side of this funnel, which are valuable but rare.

### **3.3 User and Item-Level Analysis**

Let's examine how interactions are distributed across users and items.

In [None]:
# User-level analysis
user_event_counts = events.groupby('visitorid')['event'].count()
print("=== Distribution of events per user ===")
print(user_event_counts.describe())

# Item-level analysis
item_event_counts = events.groupby('itemid')['event'].count()
print("\n=== Distribution of events per item ===")
print(item_event_counts.describe())

**Inference:** Both user and item interactions are heavily skewed. The median user has only 1 event, and the median item has only 3 events. However, a small number of "power users" and "popular items" have thousands of interactions. This long-tail distribution is a key challenge for personalization.

---

## **4. Data Preparation & Feature Engineering**

Here, we prepare the data for our Wide & Deep model. This involves creating a target variable, engineering features for the "wide" part, and encoding IDs for the "deep" part.

### **4.1 Creating the Target Variable**

Our goal is to predict high-intent actions. We'll define our target `y=1` for `addtocart` or `transaction` events and `y=0` for `view` events.

In [None]:
df = events.copy()
df['target'] = df['event'].apply(lambda x: 1 if x in ['addtocart', 'transaction'] else 0)

print("Target variable distribution:")
print(df['target'].value_counts(normalize=True) * 100)

**Observation:** The positive class (target=1) is only about 3.3% of the dataset. This is a highly imbalanced classification problem.

### **4.2 Feature Engineering (for the Wide Branch)**

The "wide" branch of our model will learn from simple, interpretable features. We will engineer two:
1.  **`user_event_count`**: How active is this user?
2.  **`item_event_count`**: How popular is this item?

In [None]:
# Compute user and item interaction counts
user_event_count_dict = df.groupby('visitorid')['event'].count().to_dict()
item_event_count_dict = df.groupby('itemid')['event'].count().to_dict()

# A placeholder for item availability (can be enhanced with item_props data)
item_available_dict = {item_id: 1 for item_id in df['itemid'].unique()}

print(f"Created user count dict for {len(user_event_count_dict)} users.")
print(f"Created item count dict for {len(item_event_count_dict)} items.")

### **4.3 Data Transformation**

We need to transform our features into a format the model can accept.

#### **A. Normalizing Wide Features**
We scale the count features to a [0, 1] range to prevent features with large values from dominating the model.

In [None]:
# Create arrays for scaling
user_ids_arr = list(user_event_count_dict.keys())
user_counts_arr = np.array(list(user_event_count_dict.values()), dtype=float).reshape(-1, 1)

item_ids_arr = list(item_event_count_dict.keys())
item_counts_arr = np.array(list(item_event_count_dict.values()), dtype=float).reshape(-1, 1)

# Scale using MinMaxScaler
scaler_user = MinMaxScaler()
user_counts_scaled = scaler_user.fit_transform(user_counts_arr)

scaler_item = MinMaxScaler()
item_counts_scaled = scaler_item.fit_transform(item_counts_arr)

# Update dictionaries with scaled values
for i, uid in enumerate(user_ids_arr):
    user_event_count_dict[uid] = user_counts_scaled[i][0]
for i, iid in enumerate(item_ids_arr):
    item_event_count_dict[iid] = item_counts_scaled[i][0]

print("User and item counts have been scaled.")

#### **B. Encoding Deep Features**
The "deep" branch needs integer indices for its embedding layers. We use `LabelEncoder` for `visitorid` and `itemid`.

In [None]:
visitor_encoder = LabelEncoder()
item_encoder = LabelEncoder()

df['visitor_enc'] = visitor_encoder.fit_transform(df['visitorid'])
df['item_enc'] = item_encoder.fit_transform(df['itemid'])

print("Visitor and Item IDs have been label encoded.")

### **4.4 Assembling the Final Datasets**

Now, we construct the final input arrays for our model: one for the wide branch and one for the deep branch.

In [None]:
# Assemble wide features
wide_features = np.array(df.apply(
    lambda row: [
        user_event_count_dict.get(row['visitorid'], 0.0),
        item_event_count_dict.get(row['itemid'], 0.0),
        item_available_dict.get(row['itemid'], 0)
    ],
    axis=1
).tolist())

# Assemble deep features
deep_features = df[['visitor_enc', 'item_enc']].values

# Labels
labels = df['target'].values

# Train/Validation Split
(
    wide_features_train, wide_features_val,
    deep_features_train, deep_features_val,
    y_train, y_val
) = train_test_split(
    wide_features,
    deep_features,
    labels,
    test_size=0.2,
    random_state=42,
    stratify=labels  # Important for imbalanced data
)

print("--- Final Dataset Shapes ---")
print("Wide Features Train Shape:", wide_features_train.shape)
print("Deep Features Train Shape:", deep_features_train.shape)
print("Labels Train Shape:", y_train.shape)
print("Wide Features Val Shape:", wide_features_val.shape)
print("Deep Features Val Shape:", deep_features_val.shape)
print("Labels Val Shape:", y_val.shape)

### **4.5 Save Preprocessing Objects for Inference**

It's crucial to save the encoders and dictionaries so we can apply the exact same transformations during prediction time.

In [None]:
# Save dictionaries
with open('wide_features_dicts.pkl', 'wb') as f:
    pickle.dump({
        'user_event_count_dict': user_event_count_dict,
        'item_event_count_dict': item_event_count_dict,
        'item_available_dict': item_available_dict
    }, f)

# Save encoders
with open('visitor_encoder.pkl', 'wb') as f:
    pickle.dump(visitor_encoder, f)
with open('item_encoder.pkl', 'wb') as f:
    pickle.dump(item_encoder, f)

print("Preprocessing objects saved successfully.")

---
## **5. Model Development (Wide & Deep)**

Now we define our model architecture.

### **5.1 Defining Callbacks**

Callbacks help manage the training process by saving the best model, stopping training early if there's no improvement, and adjusting the learning rate.

In [None]:
# EarlyStopping to prevent overfitting
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True,
    verbose=1
)

# ModelCheckpoint to save the best model
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    filepath='best_model.keras',
    monitor='val_loss',
    save_best_only=True,
    verbose=1
)

# ReduceLROnPlateau to adjust learning rate
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=2,
    min_lr=1e-6,
    verbose=1
)

callbacks = [early_stop, model_ckpt, reduce_lr]

### **5.2 Building the Model Architecture**

We'll build the two branches and then combine them.

In [None]:
# --- Parameters ---
wide_dim = wide_features_train.shape[1]
max_visitor_id = df['visitor_enc'].max() + 1
max_item_id = df['item_enc'].max() + 1
embedding_dim = 8

# --- Wide Branch (Memorization) ---
wide_input = Input(shape=(wide_dim,), name="wide_input")
wide_output = Dense(1, activation="sigmoid", name="wide_output")(wide_input)

# --- Deep Branch (Generalization) ---
deep_input = Input(shape=(2,), name="deep_input")

# Embeddings for categorical features
visitor_ids = deep_input[:, 0]
item_ids = deep_input[:, 1]

visitor_embed = Embedding(input_dim=max_visitor_id, output_dim=embedding_dim, name="visitor_embedding")(visitor_ids)
item_embed = Embedding(input_dim=max_item_id, output_dim=embedding_dim, name="item_embedding")(item_ids)

# Flatten and concatenate embeddings
visitor_flat = Flatten()(visitor_embed)
item_flat = Flatten()(item_embed)
deep_concat = Concatenate(name="deep_concat")([visitor_flat, item_flat])

# Dense layers with dropout for regularization
deep_dense1 = Dense(64, activation="relu", name="deep_dense1")(deep_concat)
drop1 = Dropout(0.3, name="dropout1")(deep_dense1)
deep_dense2 = Dense(32, activation="relu", name="deep_dense2")(drop1)
drop2 = Dropout(0.3, name="dropout2")(deep_dense2)
deep_output = Dense(1, activation="sigmoid", name="deep_output")(drop2)

# --- Combine Branches ---
combined = Add(name="wide_deep_merge")([wide_output, deep_output])
final_output = Dense(1, activation="sigmoid", name="final_output")(combined)

# --- Create and Compile Model ---
model = Model(inputs=[wide_input, deep_input], outputs=final_output)
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy", tf.keras.metrics.Precision(name='precision'), tf.keras.metrics.Recall(name='recall')]
)

model.summary()

Let's visualize the architecture.

In [None]:
plot_model(model, to_file="wide_deep_model.png", show_shapes=True, show_layer_names=True)
Image("wide_deep_model.png")

---
## **6. Model Training**

We're now ready to train the model using our prepared datasets.

In [None]:
history = model.fit(
    x=[wide_features_train, deep_features_train],
    y=y_train,
    validation_data=([wide_features_val, deep_features_val], y_val),
    epochs=50,  # A high number, early stopping will handle the rest
    batch_size=1024, # Increased batch size for faster training
    callbacks=callbacks,
    verbose=1
)

### **Visualizing Training History**

Let's plot the training and validation metrics to check for overfitting and see how the model learned.

In [None]:
def plot_training_history(history):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

    # Plot Accuracy
    ax1.plot(history.history['accuracy'], label='Train Accuracy')
    ax1.plot(history.history['val_accuracy'], label='Val Accuracy')
    ax1.set_title("Model Accuracy over Epochs")
    ax1.set_xlabel("Epoch")
    ax1.set_ylabel("Accuracy")
    ax1.legend()

    # Plot Loss
    ax2.plot(history.history['loss'], label='Train Loss')
    ax2.plot(history.history['val_loss'], label='Val Loss')
    ax2.set_title("Model Loss over Epochs")
    ax2.set_xlabel("Epoch")
    ax2.set_ylabel("Loss")
    ax2.legend()

    plt.tight_layout()
    plt.show()

plot_training_history(history)

---
## **7. Model Evaluation**

Now, we'll load the best-performing model (saved by `ModelCheckpoint`) and evaluate it on our validation set.

### **7.1 Load Best Model and Make Predictions**

In [None]:
# Load the best model saved during training
best_model = tf.keras.models.load_model('best_model.keras')

# Generate predictions (probabilities)
val_probs = best_model.predict([wide_features_val, deep_features_val]).ravel()

# Convert probabilities to binary predictions (0 or 1)
val_preds = (val_probs > 0.5).astype(int)

### **7.2 Performance Assessment**

We'll use a classification report and confusion matrix to get a detailed view of performance.

In [None]:
print("--- Classification Report ---\n")
print(classification_report(y_val, val_preds, target_names=['View (0)', 'Purchase Intent (1)']))

print("\n--- Confusion Matrix ---")
cm = confusion_matrix(y_val, val_preds)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix', fontsize=16)
plt.ylabel('Actual Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.show()

**Evaluation Insights:**
*   **High Accuracy (96%)**: The model is correct most of the time. This is expected given the class imbalance (it's easy to be right by always predicting '0').
*   **High Precision (71%)**: This is our key success metric. When the model predicts a user has purchase intent, it's correct 71% of the time. This builds user trust, as the recommendations are highly relevant.
*   **Low Recall (14%)**: The model misses many of the actual purchase intents. This is a trade-off we accept to ensure high precision. It's better to show fewer, highly accurate recommendations than to spam the user with many irrelevant ones.

---

## **8. Recommendation Generation (Inference)**

This final section simulates how the trained model would be used to generate recommendations in a live system. We'll create a function that takes a user ID and a list of candidate items, and returns the top 5 recommended items.

### **8.1 Load Saved Preprocessing Objects**

First, load the objects we saved earlier.

In [None]:
# Load dictionaries
with open('wide_features_dicts.pkl', 'rb') as f:
    dicts = pickle.load(f)
    user_event_count_dict = dicts['user_event_count_dict']
    item_event_count_dict = dicts['item_event_count_dict']
    item_available_dict = dicts['item_available_dict']

# Load encoders
with open('visitor_encoder.pkl', 'rb') as f:
    visitor_encoder_loaded = pickle.load(f)
with open('item_encoder.pkl', 'rb') as f:
    item_encoder_loaded = pickle.load(f)

print("Inference objects loaded.")

### **8.2 Recommendation Function**

This function encapsulates the entire prediction pipeline.

In [None]:
def get_top_n_recommendations(user_id, candidate_items, n=5):
    """
    Generates top N recommendations for a given user from a list of candidate items.

    Args:
        user_id (int): The original visitorid.
        candidate_items (list): A list of original itemids to score.
        n (int): The number of recommendations to return.

    Returns:
        list: A list of tuples, where each tuple is (item_id, probability_score).
    """
    wide_list = []
    deep_list = []

    # Encode user ID (handle unseen users)
    try:
        user_enc = visitor_encoder_loaded.transform([user_id])[0]
    except ValueError:
        user_enc = len(visitor_encoder_loaded.classes_) # Fallback for new user

    for item_id in candidate_items:
        # --- Prepare Deep Features ---
        try:
            item_enc = item_encoder_loaded.transform([item_id])[0]
        except ValueError:
            item_enc = len(item_encoder_loaded.classes_) # Fallback for new item
        deep_list.append([user_enc, item_enc])

        # --- Prepare Wide Features ---
        u_count = user_event_count_dict.get(user_id, 0.0)
        i_count = item_event_count_dict.get(item_id, 0.0)
        avail = item_available_dict.get(item_id, 0) # Assumes 0 if not in dict
        wide_list.append([u_count, i_count, avail])

    # Convert to numpy arrays for the model
    wide_array = np.array(wide_list, dtype=np.float32)
    deep_array = np.array(deep_list, dtype=np.float32)

    # Predict probabilities
    pred_probs = best_model.predict([wide_array, deep_array]).ravel()

    # Get top N items
    # We use argsort to get indices, then reverse and take top N
    sorted_indices = np.argsort(-pred_probs)
    top_n_indices = sorted_indices[:n]

    # Create the final recommendation list
    recommendations = [
        (candidate_items[i], pred_probs[i]) for i in top_n_indices
    ]

    return recommendations


### **8.3 Example Usage**

Let's test our function with a sample user and some candidate items.

In [None]:
# Example: Recommend items for a user
# Let's pick a real user and some real items from the dataset
example_user_id = events['visitorid'].value_counts().index[100] # A moderately active user
candidate_items_example = events['itemid'].value_counts().index[100:110].tolist() # Some moderately popular items

print(f"Generating recommendations for User ID: {example_user_id}")
print(f"Scoring these candidate items: {candidate_items_example}\n")

recommendations = get_top_n_recommendations(example_user_id, candidate_items_example, n=5)

print("--- Top 5 Recommendations ---")
for item_id, prob in recommendations:
    print(f"Item ID: {item_id:<10} | Predicted Purchase Probability: {prob:.4f}")

---

## **9. Conclusion and Next Steps**

### **Conclusion**
This project successfully developed a Wide & Deep recommendation model capable of predicting user purchase intent with **96% accuracy** and, more importantly, **71% precision**. We built an end-to-end pipeline from data exploration and feature engineering to model training and a functional inference prototype. The high precision of the model ensures that recommendations are relevant, directly addressing the business goal of reducing user friction and improving the shopping experience.

### **Future Enhancements**
1.  **Enrich Features:** Integrate data from `item_properties.csv` (e.g., category, brand) and `category_tree.csv` to give the model more context about the items, improving its generalization ability.
2.  **Handle Cold Start:** Implement strategies for new users and items, such as using content-based features for new items or defaulting to "most popular" recommendations for new users.
3.  **Explore Session-Based Models:** For more dynamic recommendations, explore sequence-aware models like LSTMs, GRUs, or Transformers that can understand the user's actions within their current session.
4.  **Address Class Imbalance:** Experiment with techniques like SMOTE (oversampling) or using class weights during training to potentially improve the model's recall without sacrificing too much precision.