<a href="https://colab.research.google.com/github/Codewiz19/WWT_Team_CTRL-Z-Mandal-/blob/main/Main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# End-to-End Solution for WWT Competition 2025

This notebook provides a complete, end-to-end implementation of a sophisticated hybrid recommendation engine designed to increase average order value for Wings R Us. It follows the strategic blueprint outlined in the project documentation, combining multiple powerful recommendation techniques to deliver personalized, relevant, and novel suggestions.

System Architecture:
The core of this solution is a hybrid model that orchestrates five distinct algorithms, all trained on rich, context-aware data:

    🛒 Market Basket Analysis (Apriori): Uncovers foundational 'if-then' purchasing patterns between specific items (e.g., customers who buy "10 pc Spicy Wings" also buy "Ranch Dip - Regular").

    📝 Content-Based Filtering (TF-IDF): Recommends items by matching a detailed profile of the current order, including the specific items, customer type, order occasion (ToGo/Delivery), and city.

    🧠 Factorization Machines (FM): A powerful deep learning model that learns personalized tastes by modeling the interactions between a user's entire purchase history and specific menu items.

    🤖 Neural Collaborative Filtering (NCF): The most advanced component, using deep learning to capture subtle, non-linear user preferences for true 1-to-1 personalization.

    📈 Popularity-Based Fallback: A robust safety net that ensures all users receive sensible, crowd-pleasing recommendations when other models cannot find a strong signal.

**Key Innovations Implemented:**

    Context-Rich Feature Engineering: We've moved beyond simple item names. The model now learns from a detailed profile for each transaction that includes the customer type, order occasion (ToGo/Delivery), and the specific city of the store.

    Specific Item Recommendations: The models are trained directly on orderable item names (e.g., "10 pc Spicy Wings Combo"), allowing them to make specific, actionable recommendations without a secondary mapping step.

    Location-Aware Suggestions: By incorporating the city into the item profiles, the system can learn and recommend items based on local and regional taste preferences.

    Full Cold-Start Handling: The hybrid design ensures effective recommendations for new customers and newly added menu items by gracefully falling back from personalized models to cart-based and popularity-based logic.

## Part 1: Setup and Data Ingestion

### 1.1: Install and Import Libraries
First, we install and import all the necessary Python libraries. This includes standard data manipulation tools, machine learning libraries, and specific packages for association rule mining and deep learning.

In [1]:
!pip install pandas numpy scikit-learn mlxtend torch -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m69.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:


import pandas as pd
import numpy as np
import json
import re
import time
from collections import Counter
from itertools import chain

# Scikit-learn for content-based filtering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# MLxtend for Market Basket Analysis
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# PyTorch for Deep Learning Models (FM and NCF)
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Set device to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cuda


### 1.2: Load Datasets
We now load all the provided data files into pandas DataFrames. We'll also perform an initial inspection (`.info()`, `.head()`) to understand their structure and identify any immediate data quality issues.

In [4]:
from google.colab import drive
# Mount Google Drive
drive.mount('/content/drive')

# --- 1. Load Data ---
file_path = '/content/drive/MyDrive/Colab Notebooks/Dataset/'

order_data = pd.read_csv(file_path + 'order_data.csv')
customer_data = pd.read_csv(file_path + 'customer_data.csv')
store_data = pd.read_csv(file_path + 'store_data.csv')
test_data = pd.read_csv(file_path + 'test_data_question.csv')


print("\n--- Order Data Info ---")
order_data.info()
print("\n--- Customer Data Info ---")
customer_data.info()
print("\n--- Store Data Info ---")
store_data.info()
print("\n--- Test Data Info ---")
test_data.info()

Mounted at /content/drive

--- Order Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 8 columns):
 #   Column                 Non-Null Count    Dtype 
---  ------                 --------------    ----- 
 0   CUSTOMER_ID            1048575 non-null  int64 
 1   STORE_NUMBER           1048575 non-null  int64 
 2   ORDER_CREATED_DATE     1048575 non-null  object
 3   ORDER_ID               1048575 non-null  int64 
 4   ORDERS                 1048575 non-null  object
 5   ORDER_CHANNEL_NAME     1048575 non-null  object
 6   ORDER_SUBCHANNEL_NAME  1048575 non-null  object
 7   ORDER_OCCASION_NAME    1048575 non-null  object
dtypes: int64(3), object(5)
memory usage: 64.0+ MB

--- Customer Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 563346 entries, 0 to 563345
Data columns (total 2 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   CUSTOMER_ID    563346 non

## Data Preprocessing and Feature Engineering

This is the most critical stage. The quality of our recommendations depends entirely on the quality of our data. The following steps are not just a simple cleaning process; they are a direct result of insights gained during our initial Exploratory Data Analysis (EDA).

Our EDA revealed several key challenges and opportunities:

    1. The raw order data is nested within complex JSON strings, requiring careful parsing.
    2. The data contains significant noise, including non-food items like 'memos' and 'tips' that must be filtered out.
    3. Rich contextual information, such as the store's city, the customer's type, and the order occasion (ToGo/Delivery), is available and crucial for building a nuanced model.

Based on these findings, we will implement an advanced preprocessing strategy to create a clean, structured, and intelligent dataset ready for our hybrid recommendation engine.

In [5]:

# Assume order_data, customer_data, and store_data are already loaded

def clean_item_name(name):
    """Standardizes item names by converting to lowercase and removing special characters."""
    if not isinstance(name, str):
        return ""
    name = name.lower()
    name = re.sub(r'[^a-z0-9\s]', '', name)
    name = re.sub(r'\s+', ' ', name).strip()
    return name

def is_system_item(item):
    """Identifies and filters out non-product items."""
    system_keywords = [
        'tip', 'fee', 'bag', 'delivery', 'service', 'charge',
        'memo', 'blankline', 'asap', 'paid', 'subtotal', 'tax'
    ]
    name_to_check = ""
    if isinstance(item, dict):
        name_to_check = item.get('item_name', '').lower()
    elif isinstance(item, str):
        name_to_check = item.lower()
    if not name_to_check:
        return False
    return any(keyword in name_to_check for keyword in system_keywords)

def parse_order_json(orders_json):
    """Parses the nested JSON string to extract a list of clean item names."""
    if not isinstance(orders_json, str):
        return []
    try:
        data = json.loads(orders_json)
        if 'orders' in data and isinstance(data['orders'], list) and data['orders']:
            if 'item_details' in data['orders'][0] and isinstance(data['orders'][0]['item_details'], list):
                item_list = data['orders'][0]['item_details']
            else:
                return []
        else:
            return []
        cleaned_items = []
        for item in item_list:
            if is_system_item(item):
                continue
            name_to_clean = None
            if isinstance(item, dict) and 'item_name' in item:
                name_to_clean = item['item_name']
            elif isinstance(item, str):
                name_to_clean = item
            if name_to_clean:
                cleaned_name = clean_item_name(name_to_clean)
                if cleaned_name:
                    cleaned_items.append(cleaned_name)
        return cleaned_items
    except (json.JSONDecodeError, TypeError, KeyError, IndexError):
        return []

# --- Apply Preprocessing ---
print("Applying new preprocessing strategy...")

# 1. Merge all data sources into one DataFrame
print("Merging order, customer, and store data...")
merged_data = pd.merge(order_data, customer_data, on='CUSTOMER_ID', how='left')
merged_data = pd.merge(merged_data, store_data, on='STORE_NUMBER', how='left')

# Handle missing values for contextual features
merged_data['CUSTOMER_TYPE'].fillna('Guest', inplace=True)
merged_data['CITY'].fillna('Unknown', inplace=True)
merged_data['ORDER_OCCASION_NAME'].fillna('Unknown', inplace=True)

# 2. Parse the 'ORDERS' JSON column
print("Parsing order items from JSON...")
merged_data['parsed_orders'] = merged_data['ORDERS'].apply(parse_order_json)

# 3. Explode the DataFrame to have one row per item
# This structure is perfect for creating item profiles AND for Market Basket Analysis
full_data = merged_data.explode('parsed_orders').rename(columns={'parsed_orders': 'item_name'}).dropna(subset=['item_name'])

# This check prevents errors if the dataframe is empty
if not full_data.empty:
    # 4. Create the specific, trainable item name
    # In this new strategy, this is the cleaned version of the full item name
    full_data['specific_item_name'] = full_data['item_name'].apply(clean_item_name)

    # 5. Create the new, rich 'item_profile' for the Content-Based model
    # This profile combines all the context you requested.
    print("Creating rich item profiles with context...")
    full_data['item_profile'] = full_data['specific_item_name'] + ' | ' + \
                                full_data['CUSTOMER_TYPE'].str.lower() + ' | ' + \
                                full_data['ORDER_OCCASION_NAME'].str.lower() + ' | ' + \
                                full_data['CITY'].str.lower().str.replace(' ', '')

    print("\n✅ Preprocessing complete!")
    print(f"Processed item rows: {len(full_data)}")

    # Display the key new columns. This is the data ready for the models.
    display_cols = [
        'specific_item_name',
        'CUSTOMER_TYPE',
        'ORDER_OCCASION_NAME',
        'CITY',
        'item_profile'
    ]
    print(full_data[display_cols].head())

else:
    print("\n❌ CRITICAL ERROR: The 'full_data' DataFrame is empty after processing.")

Applying new preprocessing strategy...
Merging order, customer, and store data...
Parsing order items from JSON...
Creating rich item profiles with context...

✅ Preprocessing complete!
Processed item rows: 2084311
          specific_item_name CUSTOMER_TYPE ORDER_OCCASION_NAME          CITY  \
0  10 pc grilled wings combo    Registered                ToGo     GRAPEVINE   
0   8 pc grilled wings combo    Registered                ToGo     GRAPEVINE   
0     8 pc spicy wings combo    Registered                ToGo     GRAPEVINE   
1          ranch dip regular    Registered                ToGo  HUNTERSVILLE   
1        50 pc grilled wings    Registered                ToGo  HUNTERSVILLE   

                                        item_profile  
0  10 pc grilled wings combo | registered | togo ...  
0  8 pc grilled wings combo | registered | togo |...  
0  8 pc spicy wings combo | registered | togo | g...  
1  ranch dip regular | registered | togo | hunter...  
1  50 pc grilled wings | regi

## Part 3: Building the Hybrid Recommendation System Class

With our data now clean and feature-rich, we can build the core of our solution. We will encapsulate all our recommendation logic into a single, well-organized Python class: WingsRUsRecommendationSystem.

This object-oriented approach is crucial for managing the complexity of our hybrid strategy. The class will house the distinct training methods for all five of our models—from the straightforward Popularity and Market Basket models to the advanced, deep-learning-based Factorization Machine and NCF.

Finally, it will contain the most important method: a master recommend function that intelligently orchestrates the outputs from all five models, weighting their scores to produce a single, optimized list of recommendations.

In [6]:
# --- PyTorch Dataset and Model Definitions  ---
class FMDataset(Dataset):
    def __init__(self, user_item_pairs, labels):
        self.user_item_pairs = user_item_pairs
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        return self.user_item_pairs[idx, 0], self.user_item_pairs[idx, 1], self.labels[idx]

class NCFDatasetWithNegativeSampling(Dataset):
    def __init__(self, users, items, labels):
        self.users = users
        self.items = items
        self.labels = labels
    def __len__(self):
        return len(self.users)
    def __getitem__(self, idx):
        return self.users[idx], self.items[idx], self.labels[idx]

class FactorizationMachine(nn.Module):
    def __init__(self, num_users, num_items, k):
        super(FactorizationMachine, self).__init__()
        self.user_embed = nn.Embedding(num_users, k)
        self.item_embed = nn.Embedding(num_items, k)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_bias = nn.Embedding(num_items, 1)
        self.global_bias = nn.Parameter(torch.zeros(1))
    def forward(self, users, items):
        user_v = self.user_embed(users)
        item_v = self.item_embed(items)
        dot_product = (user_v * item_v).sum(1)
        user_b = self.user_bias(users).squeeze()
        item_b = self.item_bias(items).squeeze()
        return dot_product + user_b + item_b + self.global_bias

class NeuralCollaborativeFiltering(nn.Module):
    def __init__(self, num_users, num_items, mf_dim, layers):
        super(NeuralCollaborativeFiltering, self).__init__()
        self.mf_user_embed = nn.Embedding(num_users, mf_dim)
        self.mf_item_embed = nn.Embedding(num_items, mf_dim)
        self.mlp_user_embed = nn.Embedding(num_users, layers[0] // 2)
        self.mlp_item_embed = nn.Embedding(num_items, layers[0] // 2)
        self.mlp_layers = nn.ModuleList()
        for i in range(len(layers) - 1):
            self.mlp_layers.append(nn.Linear(layers[i], layers[i+1]))
            self.mlp_layers.append(nn.ReLU())
        self.final_layer = nn.Linear(mf_dim + layers[-1], 1)
    def forward(self, users, items):
        mf_user_v = self.mf_user_embed(users)
        mf_item_v = self.mf_item_embed(items)
        gmf_out = mf_user_v * mf_item_v
        mlp_user_v = self.mlp_user_embed(users)
        mlp_item_v = self.mlp_item_embed(items)
        mlp_in = torch.cat([mlp_user_v, mlp_item_v], dim=-1)
        mlp_out = mlp_in
        for layer in self.mlp_layers:
            mlp_out = layer(mlp_out)
        fusion = torch.cat([gmf_out, mlp_out], dim=-1)
        return self.final_layer(fusion).squeeze()

The WingsRUsRecommendationSystem Class: The Conductor 🎼

Think of this class as the conductor of an orchestra. You have five different musicians (your five models), each playing a different instrument and having a unique strength. The WingsRUsRecommendationSystem class is responsible for:

    1. Training each musician individually (the _train_* methods).
    2. Listening to all of them at once when a recommendation is needed.
    3. Deciding whose instrument should be loudest (the dynamic weights) to create the most harmonious and effective final recommendation.

By encapsulating everything in a single class, you keep your code organized, reusable, and easy to manage.

The Dynamic Weighting Strategy: Personalization vs. Generalization

The core idea behind using different weights for Registered and Guest users is to play to the strengths of your models based on the data you have for that specific customer.

For Registered Users: A Personalized Experience

    1. What You Have: For a registered user, you have a rich purchase history. You know what they've bought in the past, which gives you powerful clues about their personal tastes.
    2. The Strategy: The weights for registered users heavily favor the Factorization Machine (FM) model (40% weight). The FM model is the only one that learns a unique "taste profile" for each individual user. By giving it the highest weight, you are telling the system: "For this customer, prioritize what we know about their specific, personal history." The other models, like Market Basket Analysis (30%), provide strong complementary suggestions, but the personalization from FM takes the lead.

For Guest Users: A Smart, General Approach

    1. What You Have: For a guest user, you have no purchase history. You only know what is currently in their cart.
    2. The Strategy: The weights for guest users completely turn off the personalized models (FM and NCF have 0% weight) because there is no data to personalize from. Instead, the system relies on the powerful, non-personalized models:

        Market Basket Analysis (40%): This becomes the most important model. It doesn't know the user, but it knows that "people who buy X also buy Y." It provides excellent complementary recommendations based on the items in the guest's cart.
        Content-Based & Popularity (30% each): These models provide strong, logical suggestions (similar items) and safe, popular choices, which are perfect for a user you know nothing about.

This dynamic approach is highly effective because it ensures that every recommendation is generated using the most relevant information available for that specific customer, leading to a smarter and more tailored experience for everyone.

In [7]:
class WingsRUsRecommendationSystem:
    def __init__(self):
        # --- FIX: Remove static weights and add customer type dictionary ---
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {self.device}")

        self.customer_types = {} # To store a map of CUSTOMER_ID -> CUSTOMER_TYPE

        self.popular_items = []
        self.association_rules = pd.DataFrame()
        self.tfidf_vectorizer = TfidfVectorizer()
        self.tfidf_matrix = None
        self.item_encoder = LabelEncoder()
        self.user_encoder = LabelEncoder()
        self.fm_model = None
        self.ncf_model = None
        self.item_map = {}

    def _train_popularity_model(self, data):
        print("Training Popularity Model...")
        self.popular_items = data['specific_item_name'].value_counts().nlargest(20).index.tolist()

    def _train_market_basket_model(self, data):
        print("Training Market Basket Model...")
        transactions = data.groupby('ORDER_ID')['specific_item_name'].apply(list).tolist()
        te = TransactionEncoder()
        te_ary = te.fit(transactions).transform(transactions)
        df = pd.DataFrame(te_ary, columns=te.columns_)
        frequent_itemsets = apriori(df, min_support=0.001, use_colnames=True)
        if not frequent_itemsets.empty:
            rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
            self.association_rules = rules.sort_values(by='lift', ascending=False)

    def _train_content_model(self, data):
        print("Training Content-Based Model...")
        item_profiles = data[['specific_item_name', 'item_profile']].drop_duplicates()
        self.item_map = item_profiles['specific_item_name'].tolist()
        self.tfidf_matrix = self.tfidf_vectorizer.fit_transform(item_profiles['item_profile'])

    def _train_fm_model(self, train_loader, val_loader, num_users, num_items, epochs=8):
        print("Training Factorization Machine Model...")
        self.fm_model = FactorizationMachine(num_users, num_items, k=20).to(self.device)
        optimizer = optim.Adam(self.fm_model.parameters(), lr=0.01)
        criterion = nn.MSELoss()
        for epoch in range(epochs):
            self.fm_model.train()
            total_train_loss = 0
            for users, items, _ in train_loader:
                users, items = users.to(self.device), items.to(self.device)
                optimizer.zero_grad()
                outputs = self.fm_model(users, items)
                loss = criterion(outputs, torch.ones(outputs.size(0)).to(self.device))
                loss.backward()
                optimizer.step()
                total_train_loss += loss.item()
            self.fm_model.eval()
            total_val_loss = 0
            with torch.no_grad():
                for users, items, _ in val_loader:
                    users, items = users.to(self.device), items.to(self.device)
                    outputs = self.fm_model(users, items)
                    val_loss = criterion(outputs, torch.ones(outputs.size(0)).to(self.device))
                    total_val_loss += val_loss.item()
            avg_train_loss = total_train_loss / len(train_loader)
            avg_val_loss = total_val_loss / len(val_loader)
            print(f"FM Epoch {epoch+1}/{epochs}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

    def _train_ncf_model(self, train_loader, val_loader, num_users, num_items, epochs=5):
        print("Training Neural Collaborative Filtering Model...")
        self.ncf_model = NeuralCollaborativeFiltering(num_users, num_items, mf_dim=20, layers=[64, 32, 16]).to(self.device)
        optimizer = optim.Adam(self.ncf_model.parameters(), lr=0.001)
        criterion = nn.BCEWithLogitsLoss()
        for epoch in range(epochs):
            self.ncf_model.train()
            total_train_loss = 0
            for users, items, labels in train_loader:
                users, items, labels = users.to(self.device), items.to(self.device), labels.to(self.device)
                optimizer.zero_grad()
                outputs = self.ncf_model(users, items)
                loss = criterion(outputs, labels.float())
                loss.backward()
                optimizer.step()
                total_train_loss += loss.item()
            self.ncf_model.eval()
            total_val_loss = 0
            with torch.no_grad():
                for users, items, labels in val_loader:
                    users, items, labels = users.to(self.device), items.to(self.device), labels.to(self.device)
                    outputs = self.ncf_model(users, items)
                    val_loss = criterion(outputs, labels.float())
                    total_val_loss += val_loss.item()
            avg_train_loss = total_train_loss / len(train_loader)
            avg_val_loss = total_val_loss / len(val_loader)
            print(f"NCF Epoch {epoch+1}/{epochs}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

    def _get_market_basket_recs(self, cart_items, n=10):
        if self.association_rules.empty: return {}
        recs = {}
        for item in cart_items:
            rules = self.association_rules[self.association_rules['antecedents'].apply(lambda x: item in x)]
            for _, rule in rules.iterrows():
                for consequent in rule['consequents']:
                    if consequent not in cart_items:
                        recs[consequent] = max(recs.get(consequent, 0), rule['lift'])
        return dict(sorted(recs.items(), key=lambda x: x[1], reverse=True)[:n])

    def _get_content_recs(self, cart_items, n=10):
        if not cart_items or self.tfidf_matrix is None: return {}
        try:
            item_indices = [self.item_map.index(item) for item in cart_items if item in self.item_map]
            if not item_indices: return {}
            avg_vector = self.tfidf_matrix[item_indices].mean(axis=0)
            avg_vector_array = np.asarray(avg_vector)
            sim_scores = cosine_similarity(avg_vector_array, self.tfidf_matrix).flatten()
            top_indices = sim_scores.argsort()[-n-len(cart_items):][::-1]
            recs = {self.item_map[i]: sim_scores[i] for i in top_indices if self.item_map[i] not in cart_items}
            return recs
        except ValueError: return {}

    def _get_deep_learning_recs(self, model, user_id, n=10):
        if model is None or user_id not in self.user_encoder.classes_: return {}
        model.eval()
        with torch.no_grad():
            user_idx = self.user_encoder.transform([user_id])[0]
            all_item_indices = torch.arange(len(self.item_encoder.classes_)).to(self.device)
            user_indices = torch.full_like(all_item_indices, user_idx)
            scores = model(user_indices, all_item_indices)
            top_scores, top_indices = torch.topk(scores, n + 10)
            recs = {}
            item_names = self.item_encoder.inverse_transform(top_indices.cpu().numpy())
            for item, score in zip(item_names, top_scores.cpu().numpy()):
                recs[item] = score
            return recs

    def fit(self, data, ncf_neg_samples=4):
        print("--- Starting Training of Hybrid Recommendation System ---")
        start_time = time.time()

        # --- FIX: Create the customer type lookup dictionary ---
        print("Creating customer type lookup...")
        self.customer_types = data[['CUSTOMER_ID', 'CUSTOMER_TYPE']].drop_duplicates().set_index('CUSTOMER_ID')['CUSTOMER_TYPE'].to_dict()

        self._train_popularity_model(data)
        self._train_market_basket_model(data)
        self._train_content_model(data)

        print("\nPreparing data for deep learning models...")
        self.user_encoder.fit(data['CUSTOMER_ID'])
        self.item_encoder.fit(data['specific_item_name'])
        data['user_idx'] = self.user_encoder.transform(data['CUSTOMER_ID'])
        data['item_idx'] = self.item_encoder.transform(data['specific_item_name'])

        train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)
        num_users = len(self.user_encoder.classes_)
        num_items = len(self.item_encoder.classes_)

        # FM DataLoaders
        train_dataset_fm = FMDataset(train_data[['user_idx', 'item_idx']].values, np.ones(len(train_data)))
        val_dataset_fm = FMDataset(val_data[['user_idx', 'item_idx']].values, np.ones(len(val_data)))
        train_loader_fm = DataLoader(train_dataset_fm, batch_size=1024, shuffle=True)
        val_loader_fm = DataLoader(val_dataset_fm, batch_size=1024)
        self._train_fm_model(train_loader_fm, val_loader_fm, num_users, num_items, epochs=8)

        # NCF Data Preparation
        print("\nPerforming negative sampling for NCF model...")
        user_item_interactions = data.groupby('user_idx')['item_idx'].apply(set)
        users = train_data['user_idx'].tolist()
        items = train_data['item_idx'].tolist()
        labels = [1] * len(users)
        for user_id in train_data['user_idx'].unique():
            interacted_items = user_item_interactions.get(user_id, set())
            for _ in range(ncf_neg_samples):
                while True:
                    negative_item = np.random.randint(0, num_items)
                    if negative_item not in interacted_items:
                        users.append(user_id)
                        items.append(negative_item)
                        labels.append(0)
                        break
        train_dataset_ncf = NCFDatasetWithNegativeSampling(users, items, labels)
        val_dataset_ncf = NCFDatasetWithNegativeSampling(val_data['user_idx'].values, val_data['item_idx'].values, np.ones(len(val_data)))
        train_loader_ncf = DataLoader(train_dataset_ncf, batch_size=1024, shuffle=True)
        val_loader_ncf = DataLoader(val_dataset_ncf, batch_size=1024)
        self._train_ncf_model(train_loader_ncf, val_loader_ncf, num_users, num_items, epochs=5)

        print(f"--- Training Complete in {time.time() - start_time:.2f} seconds ---")

    def recommend(self, customer_id, cart_items, n=3):
        # --- DYNAMIC WEIGHTING IMPLEMENTATION ---
        # 1. Determine customer type with fallback to 'Guest'
        customer_type = self.customer_types.get(customer_id, 'Guest')

        # 2. Set weights based on customer type
        if customer_type == 'Registered':
            weights = {
                'popularity': 0.1,
                'market_basket': 0.3,
                'content': 0.2,
                'fm': 0.4,       # Higher personalization for registered users
                'ncf': 0.0       # NCF was overfitting, so its weight is 0 for now
            }
        else:  # Guest or other types
            weights = {
                'popularity': 0.3,
                'market_basket': 0.4,
                'content': 0.3,
                'fm': 0.0,       # No personalization for guests
                'ncf': 0.0
            }
        # --- END DYNAMIC WEIGHTING ---

        clean_cart = [clean_item_name(item) for item in cart_items if pd.notna(item)]

        pop_recs = {item: 1.0 for item in self.popular_items}
        mb_recs = self._get_market_basket_recs(clean_cart)
        content_recs = self._get_content_recs(clean_cart)
        fm_recs = self._get_deep_learning_recs(self.fm_model, customer_id)
        ncf_recs = self._get_deep_learning_recs(self.ncf_model, customer_id)

        combined_scores = Counter()
        all_recs = {'popularity': pop_recs, 'market_basket': mb_recs, 'content': content_recs, 'fm': fm_recs, 'ncf': ncf_recs}

        for model_name, recs in all_recs.items():
            weight = weights[model_name]
            if weight == 0: continue

            max_score = max(recs.values()) if recs else 1
            for item, score in recs.items():
                if item not in clean_cart:
                    normalized_score = score / max_score
                    combined_scores[item] += weight * normalized_score

        if not combined_scores:
            final_recs = [item for item in self.popular_items if item not in clean_cart]
        else:
            sorted_recs = combined_scores.most_common()
            final_recs = [item for item, score in sorted_recs]

        return (final_recs + self.popular_items)[:n]

## Part 4: Training the System

With the class defined, we can now instantiate it and call the `.fit()` method. This single command will orchestrate the training of all five sub-models, preparing the system to generate recommendations.

In [8]:
#  Reduce data size to prevent memory overload ---
# Create a smaller, random sample of the data for training.

if len(full_data) > 800000:
    print(f"Original data size: {len(full_data)}. Sampling down to 800,000 rows to conserve memory.")
    training_data = full_data.sample(n=800000, random_state=42)
else:
    training_data = full_data

print(f"Training will proceed with {len(training_data)} data points.")

# Instantiate the recommendation system
recsys = WingsRUsRecommendationSystem()

# Train the system on our sampled, preprocessed data
recsys.fit(training_data)

Original data size: 2084311. Sampling down to 800,000 rows to conserve memory.
Training will proceed with 800000 data points.
Using device: cuda
--- Starting Training of Hybrid Recommendation System ---
Creating customer type lookup...
Training Popularity Model...
Training Market Basket Model...
Training Content-Based Model...

Preparing data for deep learning models...
Training Factorization Machine Model...
FM Epoch 1/8, Train Loss: 3.7809, Val Loss: 0.8263
FM Epoch 2/8, Train Loss: 0.5515, Val Loss: 0.5735
FM Epoch 3/8, Train Loss: 0.2695, Val Loss: 0.4504
FM Epoch 4/8, Train Loss: 0.1243, Val Loss: 0.3894
FM Epoch 5/8, Train Loss: 0.0634, Val Loss: 0.3587
FM Epoch 6/8, Train Loss: 0.0390, Val Loss: 0.3406
FM Epoch 7/8, Train Loss: 0.0283, Val Loss: 0.3267
FM Epoch 8/8, Train Loss: 0.0233, Val Loss: 0.3159

Performing negative sampling for NCF model...
Training Neural Collaborative Filtering Model...
NCF Epoch 1/5, Train Loss: 0.4343, Val Loss: 0.5939
NCF Epoch 2/5, Train Loss: 0.41

## Part 5: Generating and Saving the Final Submission File

This is the final step. We will iterate through the `test_data_question.csv` file. For each row, we'll take the provided items as the customer's current cart and use our trained system to generate the top 3 recommendations. The results will be stored in a new DataFrame and saved as `recommendation.csv`.

In [9]:

print("Generating recommendations for the test data...")
results = []
start_time = time.time()

for _, row in test_data.iterrows():
    customer_id = row['CUSTOMER_ID']
    # Combine all item columns to form the cart
    cart = [item for item in [row.get('item1'), row.get('item2'), row.get('item3')] if pd.notna(item)]

    # Get top 3 recommendations
    recommendations = recsys.recommend(customer_id, cart, n=3)

    # Append to results list
    result_row = row.to_dict()
    result_row['RECOMMENDATION 1'] = recommendations[0] if len(recommendations) > 0 else 'N/A'
    result_row['RECOMMENDATION 2'] = recommendations[1] if len(recommendations) > 1 else 'N/A'
    result_row['RECOMMENDATION 3'] = recommendations[2] if len(recommendations) > 2 else 'N/A'
    results.append(result_row)

print(f"Generation complete in {time.time() - start_time:.2f} seconds.")

# Create a DataFrame from the results
full_results_df = pd.DataFrame(results)

# --- FIX: Define the exact columns needed for the final submission format ---
submission_columns = [
    'CUSTOMER_ID', 'ORDER_ID',
    'item1', 'item2', 'item3',
    'RECOMMENDATION 1', 'RECOMMENDATION 2', 'RECOMMENDATION 3'
]

# --- Select only those columns to create the final DataFrame ---
# This ensures the output matches the required format precisely.

final_submission_df = full_results_df[submission_columns]


# Save the final, formatted DataFrame to CSV
output_filename = 'Crtl+Z_Mandal_Recommendation Output Sheet.csv'
final_submission_df.to_csv(output_filename, index=False)

print(f"\n✅ Submission file '{output_filename}' created successfully in the correct format!")
display(final_submission_df.head())

Generating recommendations for the test data...
Generation complete in 13.88 seconds.

✅ Submission file 'Crtl+Z_Mandal_Recommendation Output Sheet.csv' created successfully in the correct format!


Unnamed: 0,CUSTOMER_ID,ORDER_ID,item1,item2,item3,RECOMMENDATION 1,RECOMMENDATION 2,RECOMMENDATION 3
0,997177535,9351345556,Chicken Sub Combo,Ranch Dip - Regular,10 pc Spicy Wings Combo,10 pc spicy wings,10 pc grilled wings,6 pc spicy wings combo
1,345593831,3595377080,Regular Buffalo Fries,10 pc Spicy Wings,3 pc Crispy Strips Combo,10 pc spicy wings combo,10 pc grilled wings,ranch dip regular
2,160955031,4071757785,Large Buffalo Fries,10 pc Spicy Wings,Ranch Dip - Regular,regular buffalo fries,ranch dip large,10 pc grilled wings
3,890671991,3931766769,6 pc Grilled Wings Combo,20 pc Grilled Wings,Fried Corn - Large,regular buffalo fries,large buffalo fries,ranch dip large
4,73989021,3739700809,Regular Buffalo Fries,20 pc Grilled Wings,Ranch Dip - Large,large buffalo fries,ranch dip regular,10 pc spicy wings


In [10]:

import random

print("--- Starting Self-Evaluation with Hold-out Set ---")

if 'full_data' in locals() and not full_data.empty:
    baskets = full_data.groupby('ORDER_ID')['specific_item_name'].apply(list).tolist()
    baskets = [b for b in baskets if len(b) > 1]
    print(f"Found {len(baskets)} orders with more than one item.")

    train_baskets, validation_baskets = train_test_split(baskets, test_size=0.2, random_state=42)
    print(f"Split into {len(train_baskets)} training baskets and {len(validation_baskets)} validation baskets.")

    # --- FIX 1: Sample the TRAINING baskets to prevent memory overload ---
    if len(train_baskets) > 100000:
        print(f"\nTraining set is large. Sampling 100,000 baskets for a faster, memory-safe model build.")
        train_baskets_sample = random.sample(train_baskets, 100000)
    else:
        train_baskets_sample = train_baskets

    # --- FIX 2: Sample the VALIDATION baskets for a quick evaluation ---
    if len(validation_baskets) > 20000:
        print(f"Validation set is large. Sampling 20,000 baskets for evaluation.")
        validation_baskets_sample = random.sample(validation_baskets, 20000)
    else:
        validation_baskets_sample = validation_baskets

    print("\nTraining a new model instance on the training sample...")
    validation_recsys = WingsRUsRecommendationSystem()

    # Use the sampled training baskets to create the DataFrame
    train_df_for_validation = pd.DataFrame({
        'ORDER_ID': [i for i, basket in enumerate(train_baskets_sample) for _ in basket],
        'specific_item_name': [item for basket in train_baskets_sample for item in basket]
    })
    train_df_for_validation['CUSTOMER_ID'] = 0
    train_df_for_validation['item_profile'] = train_df_for_validation['specific_item_name']

    validation_recsys._train_popularity_model(train_df_for_validation)
    validation_recsys._train_market_basket_model(train_df_for_validation)
    validation_recsys._train_content_model(train_df_for_validation)

    print("\n--- Evaluating model on the validation sample ---")
    hits = 0
    # Use the sampled validation baskets for the loop
    for basket in validation_baskets_sample:
        cart = basket[:-1]
        ground_truth = basket[-1]
        recommendations = validation_recsys.recommend(customer_id=0, cart_items=cart, n=3)
        if ground_truth in recommendations:
            hits += 1

    recall_at_3 = hits / len(validation_baskets_sample) if len(validation_baskets_sample) > 0 else 0

    print("\n--- Evaluation Complete ---")
    print(f"Total Validation Orders Evaluated: {len(validation_baskets_sample)}")
    print(f"Correct Recommendations (Hits): {hits}")
    print(f"Estimated Recall@3 Score: {recall_at_3:.2%}")

else:
    print("❌ Error: `full_data` DataFrame not found or is empty. Please run the preprocessing steps first.")

--- Starting Self-Evaluation with Hold-out Set ---
Found 600341 orders with more than one item.
Split into 480272 training baskets and 120069 validation baskets.

Training set is large. Sampling 100,000 baskets for a faster, memory-safe model build.
Validation set is large. Sampling 20,000 baskets for evaluation.

Training a new model instance on the training sample...
Using device: cuda
Training Popularity Model...
Training Market Basket Model...
Training Content-Based Model...

--- Evaluating model on the validation sample ---

--- Evaluation Complete ---
Total Validation Orders Evaluated: 20000
Correct Recommendations (Hits): 4726
Estimated Recall@3 Score: 23.63%
