# MLUL2 Group Assignment — “Did You Forget?” Recommendation System
   
**Team Members**:  
- Akanksha Parab (12420067)  
- Deepankar Garg (12420012)
- Krishnan Chathadi S (12420068)
- Shama Shilpi (12420033)

**Objective**:  
Design a recommendation system to predict items, which the customers may have forgotten to order, using past purchase behavior and partial last-order information.


# Libraries and functions

## Libraries

If there are any library related issues when running this notebook in Colab, follow these steps:
* Uncomment and run #!pip install "numpy<2.0"
* Uncomment and run #!pip install scikit-surprise --no-binary :all:
* Comment them out and restart the session.
* Run the rest of the code.

In [1]:
#!pip install "numpy<2.0"

In [2]:
#!pip install scikit-surprise --no-binary :all:

In [3]:
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)

# Libraries for plotting
import seaborn as sns
import matplotlib.pyplot as plt

# Libraries for handling pairs of data, dictionaries, datatime
import itertools
from itertools import combinations
from collections import defaultdict, Counter
from datetime import datetime

from sklearn.preprocessing import MinMaxScaler  # Scaling the data
from scipy.stats import norm # Normal distribution

# Matrix factorization
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

## Functions

In [4]:
def assign_member_cluster(row):
    """Assigns a member cluster based on diversity score."""

    if row['Diversity'] <= 0.5:
        return 'Repetitive'
    elif row['Diversity'] >= 0.75 and row['Num_orders'] >= 3:
        return 'Explorer'
    elif row['Num_orders'] <= 2:
        return 'NewUser'
    elif row['Num_orders'] >= 6:
        return 'FrequentBuyer'
    else:
        return 'Normal'

In [5]:
def assign_season(month):
    """Assigns a season based on the month."""
    if month in [3, 4, 5]:
        return "Summer"
    elif month in [6, 7, 8, 9, 10]:
        return "Monsoon"
    else:
        return "Winter"

In [6]:
# Function to split SKUs for a single order
def split_order_skus(sku_list, frac=0.3):
    """Splits a list of SKUs into observed and withheld sets."""

    skus = list(sku_list)  # Ensure it's a list
    np.random.shuffle(skus)  # Shuffle the SKUs
    split_point = max(1, int(len(skus) * frac))  # Ensure at least 1 SKU in observed

    observed = skus[:split_point]
    withheld = skus[split_point:]
    if len(withheld) > 5:
        withheld = withheld[:5]
    return observed, withheld

In [7]:
def compute_recall_at_5(df_predictions: pd.DataFrame, df_withheld: pd.DataFrame) -> float:
    """Computes Recall@5 for predictions against withheld items."""

    # Convert to sets grouped by Order
    pred_dict = df_predictions.groupby("Order")["SKU"].apply(set).to_dict()
    true_dict = df_withheld.groupby("Order")["SKU"].apply(set).to_dict()

    recall_scores = []

    # Compute the recall values
    for order_id, true_skus in true_dict.items():
        pred_skus = pred_dict.get(order_id, set())
        if not true_skus:
            continue  # Skip if no ground truth
        recall = len(true_skus & pred_skus) / len(true_skus)
        recall_scores.append(recall)

    # Average across all orders
    return sum(recall_scores) / len(recall_scores) if recall_scores else 0.0

In [8]:
def run_recall_evaluation(combined_preds, df_withheld, strategy_mode=False):
    results = []

    # Cluster-Level Evaluation
    for cluster_name, df_pred in combined_preds.items():
        df_truth = df_withheld[df_withheld['Member'].isin(df_pred['Member'].unique())]
        recall = compute_recall_at_5(df_pred, df_truth)

        results.append({
            'Level': 'Cluster',
            'Segment': cluster_name,
            'Recall@5': round(recall, 4)
        })

    # Overall Evaluation
    full_df = pd.concat(combined_preds.values(), ignore_index=True)
    full_truth = df_withheld[df_withheld['Member'].isin(full_df['Member'].unique())]

    overall_recall = compute_recall_at_5(full_df, full_truth)
    results.append({
        'Level': 'Overall',
        'Segment': 'All',
        'Recall@5': round(overall_recall, 4)
    })

    # Strategy-Level Evaluation
    if strategy_mode:
        for label, df_pred in zip(pred_labels, df_preds):  # assuming these exist
            cluster = label.split("_")[0]
            df_truth = df_withheld[df_withheld['Member'].isin(df_pred['Member'].unique())]
            recall = compute_recall_at_5(df_pred, df_truth)

            results.append({
                'Level': 'Strategy',
                'Segment': cluster,
                'Strategy': label.split("_")[1],
                'Recall@5': round(recall, 4)
            })

    return pd.DataFrame(results)


In [9]:
def position_of_truth(df_pred, df_truth):
    # Merge predicted and true SKUs per Member & Order
    merged = df_pred.copy()
    merged['Predicted_Rank'] = (
        merged
        .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
        .groupby(['Member', 'Order'])
        .cumcount() + 1  # 1-based indexing
    )

    # Label if SKU is a forgotten item (i.e., in withheld truth)
    merged['Is_Hit'] = merged.set_index(['Member', 'Order', 'SKU']).index.isin(
        df_truth.set_index(['Member', 'Order', 'SKU']).index
    )

    # Filter only hits and capture their rank
    hits_df = merged[merged['Is_Hit']]
    return hits_df[['Member', 'Order', 'SKU', 'Predicted_Rank']]

In [10]:
def build_jaccard_matrix(orders_df):
    """
    Takes a DataFrame with columns ['Order', 'SKU'] and returns a nested dict:
    jaccard_matrix[sku1][sku2] = Jaccard similarity between sku1 and sku2
    """
    # Step 1: Build item → set of orders
    sku_to_orders = defaultdict(set)

    for row in orders_df.itertuples(index=False):
        sku_to_orders[row.SKU].add(row.Order)

    all_skus = list(sku_to_orders.keys())
    jaccard_matrix = defaultdict(dict)

    # Step 2: Loop through combinations and compute Jaccard
    for sku1, sku2 in itertools.combinations(all_skus, 2):
        orders1 = sku_to_orders[sku1]
        orders2 = sku_to_orders[sku2]
        intersection = len(orders1 & orders2)
        union = len(orders1 | orders2)

        if union > 0:
            score = intersection / union
            if score > 0:
                jaccard_matrix[sku1][sku2] = score
                jaccard_matrix[sku2][sku1] = score  # symmetry

    return jaccard_matrix

# 1. Data loading and exploration

## Load the datasets

In [11]:
# Load the datasets
all_except_last_orders_df = pd.read_csv('all_except_last_orders.csv')
last_orders_subset_df = pd.read_csv('last_orders_subset.csv')

In [12]:
# Check the data
all_except_last_orders_df

Unnamed: 0,Order,SKU,Member,Delivery Date,Name
0,8358896,15668375,SSCEHNS,02/11/13,Root Vegetables
1,8358896,15668467,SSCEHNS,02/11/13,Beans
2,8358896,15669863,SSCEHNS,02/11/13,Moong Dal
3,8358896,15669778,SSCEHNS,02/11/13,Other Dals
4,8358896,15669767,SSCEHNS,02/11/13,Urad Dal
...,...,...,...,...,...
28979,7466404,15669886,SWRNHCS,01/09/13,Sooji & Rava
28980,7466404,15669874,SWRNHCS,01/09/13,Avalakki / Poha
28981,7466404,15670260,SWRNHCS,01/09/13,Organic F&V
28982,7466404,15670196,SWRNHCS,01/09/13,Organic F&V


In [13]:
# # Check the data
last_orders_subset_df

Unnamed: 0,Order,SKU,Member,Delivery Date,Name
0,7409204,15669778,SWLCNOE,05/09/13,Other Dals
1,8076206,15669977,SWOEZES,01/04/14,Almonds
2,7560723,7593949,SSWWRHW,30/06/13,Cream Biscuits
3,8362837,15669764,SWLSCOZ,06/11/13,Besan
4,8202458,15670196,SSRCRSO,03/02/14,Organic F&V
...,...,...,...,...,...
5482,8269882,15668469,SWNHZNW,05/01/14,Beans
5483,8384422,15669875,SSWNRHC,18/11/13,Toor Dal
5484,7493590,15668465,SWRELHW,07/08/13,Root Vegetables
5485,8080319,15670267,SSNSECH,03/04/14,Toor Dal


## Check info, nulls and duplicates

In [14]:
all_except_last_orders_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28984 entries, 0 to 28983
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Order          28984 non-null  int64 
 1   SKU            28984 non-null  int64 
 2   Member         28984 non-null  object
 3   Delivery Date  28984 non-null  object
 4   Name           28984 non-null  object
dtypes: int64(2), object(3)
memory usage: 1.1+ MB


In [15]:
all_except_last_orders_df.duplicated().sum()

0

In [16]:
last_orders_subset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5487 entries, 0 to 5486
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Order          5487 non-null   int64 
 1   SKU            5487 non-null   int64 
 2   Member         5487 non-null   object
 3   Delivery Date  5487 non-null   object
 4   Name           5487 non-null   object
dtypes: int64(2), object(3)
memory usage: 214.5+ KB


In [17]:
last_orders_subset_df.duplicated().sum()

0

### Summary
- Rows and columns in each file:
- **all_except_last_orders**: 28984 rows, 5 columns, no null or duplicates.
- **last_orders_subset**: 5487 rows, 5 columns, no null or duplicates.


## Get counts of unique members, SKUs and orders.

In [18]:
print('In all_except_last_orders: \n')
print('Number of unique members:', len(all_except_last_orders_df['Member'].unique()))
print('Number of unique SKUs:', len(all_except_last_orders_df['SKU'].unique()))
print('Number of unique orders:', len(all_except_last_orders_df['Order'].unique()))

In all_except_last_orders: 

Number of unique members: 638
Number of unique SKUs: 632
Number of unique orders: 2595


In [19]:
print('In last_orders_subset: \n')
print('Number of unique members:', len(last_orders_subset_df['Member'].unique()))
print('Number of unique SKUs:', len(last_orders_subset_df['SKU'].unique()))
print('Number of unique orders:', len(last_orders_subset_df['Order'].unique()))

In last_orders_subset: 

Number of unique members: 638
Number of unique SKUs: 565
Number of unique orders: 638


## Exploratory Data Analysis
Questions to Explore:
- How do the SKUs and names map to each other?
- SKU popularity.
- Which SKUs are most frequently bought together?
- Customer behaviour insights.


### How do the SKUs and names map to each other?

In [20]:
# Check the mapping between SKU and name to get an idea on the product
sku_name_map = all_except_last_orders_df.set_index("SKU")["Name"].to_dict()

In [21]:
# Are there SKUs with more than one name?
sku_name_counts = all_except_last_orders_df.groupby('SKU')['Name'].nunique()
multi_name_skus = sku_name_counts[sku_name_counts > 1]
len(multi_name_skus)

0

In [22]:
# Are there names which map to multiple SKUs?
name_sku_counts = all_except_last_orders_df.groupby('Name')['SKU'].nunique()
ambiguous_names = name_sku_counts[name_sku_counts > 1]
len(ambiguous_names)

91

#### SKU-Name mapping summary
- A particular product name can have multiple SKU. This situation arises because different sizes, flavours of the same product are tagged as different SKUs.
- None of the SKUs map to multiple products.
- The recommendations would be based on SKUs.
- The name column would only be useful for labelling and plotting.

### SKU Popularity

In [23]:
# SKU frequency - to understand SKU popularity
sku_counts = all_except_last_orders_df['SKU'].value_counts()
sku_counts.head()

SKU
15668381    530
15668688    482
15668460    479
15668379    479
15669878    445
Name: count, dtype: int64

### Which SKUs are frequently bought together?

In [24]:
# Initialize the global jaccard matrix. Not printing the output, since it could crash the runtime.
global_jaccard_matrix = build_jaccard_matrix(all_except_last_orders_df)

### Customer behaviour insights

In [25]:
# Distribution of number of orders per member: Useful to filter active users
all_except_last_orders_df.groupby('Member')['Order'].nunique().describe()

count    638.000000
mean       4.067398
std        2.958133
min        1.000000
25%        2.000000
50%        3.000000
75%        6.000000
max       16.000000
Name: Order, dtype: float64

In [26]:
# Cart size (SKUs per order)
all_except_last_orders_df.groupby('Order')['SKU'].count().describe()

count    2595.000000
mean       11.169171
std         3.375311
min         8.000000
25%         9.000000
50%        10.000000
75%        13.000000
max        31.000000
Name: SKU, dtype: float64

# 2. Feature Engineering

In [27]:
# Create copies of dataset before modifications
df_all = all_except_last_orders_df.copy()
df_last = last_orders_subset_df.copy()

In [28]:
# Convert delivery date to date time format
df_all['Delivery Date'] = pd.to_datetime(df_all['Delivery Date'], format='%d/%m/%y')
df_last['Delivery Date'] = pd.to_datetime(df_last['Delivery Date'], format='%d/%m/%y')

In [29]:
df_all["Month"] = df_all["Delivery Date"].dt.month

### Create Member Stats Dataframe to store key data

In [30]:
# Create Member_Stats with number of unique SKUs and number of unique orders from df_all
member_stats_df = df_all.groupby('Member').agg(
    Unique_SKUs=('SKU', 'nunique'),
    Total_SKUs=('SKU', 'count'),
    Num_orders=('Order', 'nunique')
).reset_index()

In [31]:
member_stats_df['Diversity'] = member_stats_df['Unique_SKUs'] / member_stats_df['Total_SKUs']

In [32]:
member_stats_df.describe()

Unnamed: 0,Unique_SKUs,Total_SKUs,Num_orders,Diversity
count,638.0,638.0,638.0,638.0
mean,28.827586,45.429467,4.067398,0.772037
std,17.04123,39.761462,2.958133,0.19205
min,8.0,8.0,1.0,0.230548
25%,16.0,18.0,2.0,0.636693
50%,25.5,32.0,3.0,0.802174
75%,38.0,62.0,6.0,0.944444
max,110.0,347.0,16.0,1.0


In [33]:
member_stats_df['Member_Cluster'] = member_stats_df.apply(assign_member_cluster, axis=1)
member_stats_df.groupby('Member_Cluster').size()

Member_Cluster
Explorer         134
FrequentBuyer     97
NewUser          249
Normal            91
Repetitive        67
dtype: int64

### Create Member-Order dataframe to analyze SKU frequency

In [34]:
# Group by member and order, aggregate SKUs into lists
member_order_df = (
    df_all
    .groupby(['Member', 'Order', 'Delivery Date'])['SKU']
    .apply(list)
    .reset_index()
    .sort_values(by=['Member', 'Delivery Date'])
)

In [35]:
# Explode SKU list to get one SKU per row
sku_order_df = member_order_df.explode('SKU')

# Sort for time-differencing
sku_order_df['Date'] = pd.to_datetime(sku_order_df['Delivery Date'], format='%d/%m/%y')
sku_order_df = sku_order_df.sort_values(by=['Member', 'SKU', 'Date'])

# Compute days between repeated purchases of same SKU by the same member
sku_order_df['Days_Since_Last_Purchase'] = (
    sku_order_df.groupby(['Member', 'SKU'])['Date']
    .diff().dt.days
)

In [36]:
# Create a table for SKU periodicity
sku_periodicity = (
    sku_order_df[sku_order_df['Days_Since_Last_Purchase'].notnull()]
    .groupby(['Member', 'SKU'])['Days_Since_Last_Purchase']
    .agg(['mean', 'std'])
    .reset_index()
    .rename(columns={'mean': 'Typical_Gap', 'std': 'Gap_STD'})
)

In [37]:
sku_periodicity.describe()

Unnamed: 0,SKU,Typical_Gap,Gap_STD
count,5468.0,5468.0,2410.0
mean,16234410.0,81.130739,44.372903
std,10344370.0,79.383242,50.420225
min,7541573.0,3.0,0.0
25%,15668460.0,32.0,13.435029
50%,15669770.0,55.0,27.577164
75%,15669860.0,99.541667,55.921586
max,93289490.0,692.0,396.686904


In [38]:
# Apply a Z-score based scoring
sku_scored_df = sku_order_df.merge(sku_periodicity, on=['Member', 'SKU'], how='left')

sku_scored_df['Overdue_Z'] = (
    (sku_scored_df['Days_Since_Last_Purchase'] - sku_scored_df['Typical_Gap']) /
    sku_scored_df['Gap_STD']
) # Leave NaNs as-is — avoid fillna(0) to prevent false positives

# Multiple cdf by 2, so that the value likes between 0 and 1
sku_scored_df['Periodicity_Score'] = 2 * (1 - norm.cdf(np.abs(sku_scored_df['Overdue_Z'])))

### Seasonal Trends

In [39]:
df_all["Season_Group"] = df_all["Delivery Date"].dt.month.map(assign_season)

In [40]:
# Quick test season_group instead of month
sku_month_counts = (
    df_all
    .groupby(["SKU", "Season_Group"])
    .size()
    .reset_index(name="Count")
)

sku_month_dist = sku_month_counts.pivot_table(
    index="SKU",
    columns="Season_Group",
    values="Count",
    fill_value=0
)

# Normalize row-wise to get distribution
sku_month_dist = sku_month_dist.div(sku_month_dist.sum(axis=1), axis=0)

### Create dictionaries for Jaccard scores in member carts

In [41]:
# Co-occurrence counts and marginal SKU counts
member_co_counts = defaultdict(lambda: defaultdict(int))
member_sku_counts = defaultdict(lambda: defaultdict(int))

# Loop over each order
for _, row in member_order_df.iterrows():
    member = row['Member']
    sku_list = set(row['SKU'])  # remove duplicates

    # Update marginal counts
    for sku in sku_list:
        member_sku_counts[member][sku] += 1

    # Update co-occurrence counts
    for sku_i, sku_j in combinations(sku_list, 2):
        member_co_counts[member][(sku_i, sku_j)] += 1
        member_co_counts[member][(sku_j, sku_i)] += 1  # symmetric

In [42]:
# Create the dictionary of SKU co-occurrence for different members
member_jaccard = defaultdict(dict)

# Store member jaccard scores
for member in member_co_counts:
    for (sku_i, sku_j), co_count in member_co_counts[member].items():
        count_i = member_sku_counts[member][sku_i]
        count_j = member_sku_counts[member][sku_j]
        union = count_i + count_j - co_count

        if union > 0:
            jaccard_score = co_count / union
            member_jaccard[member][(sku_i, sku_j)] = jaccard_score

# 3. Recommendation Strategy

### Approach Considered
- Candidate Generation: Based on historical SKUs
  * The goal is to identify items a user has purchased in the past but did not include in their latest order.
  * The list of items considered for 'Did you forget?' would be: (All past orders by member - Orders in the cart for that member)
  * Create a fallback pool, if there are less than 5 candidates. This could be a list of most popular SKUs (overall or recently popular)
- Ranking the candidates to generate top 5.
  * Start with simple frequency based ranking.
  * Consider other factors like co-occurance, periodicity, seasonality etc.
  * Create a weighted ranking score based on all the parameters.

### Candidate generation

In [43]:
# Create a dictionary with a list of SKUs for each member, sorted by frequency and made unique
def get_member_sku_dict(df):
    """Returns a dictionary of member to a list of unique SKUs sorted by frequency."""

    member_sku_dict = {}
    grouped = df.groupby('Member')['SKU'].apply(list)

    for member, skus in grouped.items():
        # Count SKU frequencies for the current member
        sku_counts = Counter(skus)

        # Sort SKUs by frequency in descending order, then get unique SKUs in that order
        sorted_unique_skus = []
        for sku in sorted(skus, key=lambda sku: sku_counts[sku], reverse=True):
            if sku not in sorted_unique_skus:
                sorted_unique_skus.append(sku)
        member_sku_dict[member] = sorted_unique_skus
    return member_sku_dict

### Fallback pool

In [44]:
# For fallback pool - If there are less than 5 recommendations for a order, include additional SKUs from these.
def get_top_skus(df, top_n=5):
    """Returns the top N most frequent SKUs in the DataFrame."""

    top_skus = df_all['SKU'].value_counts().head(5).index.tolist()
    return top_skus

### Member Jaccard

In [199]:
def jaccard_cart_reranker(last_orders_df, test_mode=False, k=12):
    """
    Reranks recommendations based on member-specific Jaccard similarity.
    Input: Last orders in Dataframe format.
    Output: Recommendations in Dataframe format.
    """

    recommendations = []
    df = df_input if test_mode else df_all
    mem_sku_all = get_member_sku_dict(df)

    # Group input DataFrame by Order and Member
    grouped_orders = last_orders_df.groupby(['Order', 'Member'])['SKU'].apply(set).to_dict()

    for (order_id, member_id), current_skus_set in grouped_orders.items():
        score_dict = defaultdict(float)

        for (sku_a, sku_b), jscore in member_jaccard.get(member_id, {}).items():
            # If sku_a is in current cart and sku_b is not, we consider sku_b as a candidate
            if sku_a in current_skus_set and sku_b not in current_skus_set:
                if sku_b in mem_sku_all.get(member_id, []):
                    score_dict[sku_b] += jscore

        sorted_candidates = sorted(score_dict.items(), key=lambda x: -x[1])[:k]
        for sku, score in sorted_candidates:
            recommendations.append({'Order': order_id, 'Member': member_id, 'SKU': sku, 'Member_Jaccard_Score': score})

    return pd.DataFrame(recommendations)

### Global Jaccard

In [200]:
def jaccard_recommendations(last_orders_df, jaccard_matrix, test_mode=False, k=12):
    """
    Generates recommendations based on global Jaccard similarity.
    Input: Last orders in Dataframe format, Jaccard matrix.
    Output: Recommendations in Dataframe format.
    """
    all_recs = []

    # Ensure 'Cart_SKUs' is available. Create it if necessary
    if 'Cart_SKUs' not in last_orders_df.columns:
        last_orders_df = last_orders_df.groupby(['Order', 'Member'])['SKU'].apply(set).reset_index(name='Cart_SKUs')

    for _, row in last_orders_df.iterrows():
        member_id = row['Member']
        order_id = row['Order']
        cart = row['Cart_SKUs']

        candidate_scores = defaultdict(float)
        for sku in cart:
            neighbors = jaccard_matrix.get(sku, {})
            for neighbor, score in neighbors.items():
                if neighbor not in cart:
                    candidate_scores[neighbor] += score

        sorted_candidates = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)[:k]

        for sku, score in sorted_candidates:
            all_recs.append({'Member': member_id, 'Order': order_id, 'SKU': sku, 'Global_Jaccard_Score': score})

    return pd.DataFrame(all_recs)

### Periodicity ranker

In [47]:
def periodicity_reranker(last_orders_df, test_mode=False, k=8):
    """
    Reranks recommendations based on SKU purchase periodicity.
    Input: Last orders in Dataframe format.
    Output: Recommendations in Dataframe format.
    """

    recommendations = []
    df = df_input if test_mode else df_all
    mem_sku_all = get_member_sku_dict(df)

    # Group input DataFrame by Order and Member
    grouped_orders = last_orders_df.groupby(['Order', 'Member'])['SKU'].apply(set).to_dict()

    for (order_id, member_id), current_skus_set in grouped_orders.items():
        past_skus = mem_sku_all[member_id]
        candidate_skus = [sku for sku in past_skus if sku not in current_skus_set]

        # Get latest Overdue_Z scores for candidate SKUs
        candidate_scores = sku_scored_df[
            (sku_scored_df['Member'] == member_id) &
            (sku_scored_df['SKU'].isin(candidate_skus)) &
            (sku_scored_df['Periodicity_Score'].notnull())
        ].sort_values(by='Periodicity_Score', ascending=False).head(k)

        # Append all candidates with their scores
        if (len(candidate_scores) == 0):
            recommendations.append({'Order': order_id, 'Member': member_id, 'SKU': past_skus[0], 'Periodicity_Score': 0.0})
        else:
            for index, row in candidate_scores.iterrows():
                recommendations.append({'Order': order_id, 'Member': member_id, 'SKU': row['SKU'], 'Periodicity_Score': row['Periodicity_Score']})

    return pd.DataFrame(recommendations)

### History based recommendations

In [360]:
def history_based_recommendations(last_orders_df, test_mode=False, k=15):
    """
    Generates recommendations based on member's purchase history frequency.
    Input: Last orders in Dataframe format.
    Output: Recommendations in Dataframe format.
    """

    recommendations = []
    df = df_input if test_mode else df_all
    mem_sku_all = get_member_sku_dict(df)

    # Group input DataFrame by Order and Member
    grouped_orders = last_orders_df.groupby(['Order', 'Member'])['SKU'].apply(set).to_dict()

    for (order_id, member_id), current_skus_set in grouped_orders.items():
        # Retrieve the member's full purchase history (SKUs ordered by frequency) from the dictionary
        past_skus = mem_sku_all[member_id]

        # Find forgotten items (in history but not in current order), maintaining frequency order
        forgotten_skus = [sku for sku in past_skus if sku not in current_skus_set]

        # Assign a frequency-based score (higher for more frequent)
        sku_freq = Counter(df[df['Member'] == member_id]['SKU'])
        scored_forgotten_skus = [(sku, sku_freq[sku]) for sku in forgotten_skus]

        # Extract raw scores
        #raw_scores = [score for _, score in scored_forgotten_skus]
        #max_score = max(raw_scores) if raw_scores else 0
        #min_score = min(raw_scores) if raw_scores else 0

        # Avoid divide-by-zero
        #if max_score != min_score:
            #normalized_scores = [
                #(sku, (score - min_score) / (max_score - min_score))
                #for sku, score in scored_forgotten_skus
                #]
        #else:
            #normalized_scores = [(sku, 0.0) for sku, _ in scored_forgotten_skus]

        # Sort by frequency in descending order
        sorted_forgotten_skus = sorted(scored_forgotten_skus, key=lambda x: x[1], reverse=True)[:k]

        # Append all forgotten skus with their frequency score
        for sku, score in sorted_forgotten_skus:
            recommendations.append({'Order': order_id, 'Member': member_id, 'SKU': sku, 'Frequency_Score': score})

    out_df = pd.DataFrame(recommendations)
    scaler = MinMaxScaler()
    out_df["Frequency_Score_Scaled"] = scaler.fit_transform(out_df[["Frequency_Score"]])
    return out_df

### Seasonal recommendations

In [49]:
def get_rolling_popular_skus(last_orders_df, days=20, test_mode=False, k=5):
    """
    Generates recommendations based on most popular SKUs in the last x days.
    Input: Last orders in Dataframe format.
    Output: Recommendations in Dataframe format.
    """

    recommendations = []
    df = df_input if test_mode else df_all

    last_orders_df['Delivery Date'] = pd.to_datetime(last_orders_df['Delivery Date'])
    df['Delivery Date'] = pd.to_datetime(df['Delivery Date'])
    
    cutoff_date = last_orders_df['Delivery Date'].max() - pd.Timedelta(days=days)
    recent_orders = df[df['Delivery Date'] > cutoff_date]

    for _, row in last_orders_df.iterrows():
        member = row["Member"]
        order = row["Order"]
        target_date = row["Delivery Date"]
        cutoff_date = target_date - pd.Timedelta(days=days)
        recent_orders = df[
            (df['Delivery Date'] > cutoff_date) & 
            (df['Delivery Date'] <= target_date)
        ]

        sku_counts = (
            recent_orders.groupby('SKU')
            .size()
            .reset_index(name='Seasonal_Score')
            .sort_values(by='Seasonal_Score', ascending=False).head(k)
        )

        # Scale such that the score likes between 0 and 1
        max_score = sku_counts['Seasonal_Score'].max()
        min_score = sku_counts['Seasonal_Score'].min()
        if max_score != min_score:
            sku_counts['Seasonal_Score'] = (sku_counts['Seasonal_Score'] - min_score) / (max_score - min_score)
        else:
            sku_counts['Seasonal_Score'] = 0.0

        for row in sku_counts.itertuples(index=False):
            recommendations.append({"Order": order, "Member": member, "SKU": row.SKU, "Seasonal_Score": row.Seasonal_Score})

    return pd.DataFrame(recommendations)

In [50]:
def seasonality_recommendations(last_orders_df, test_mode=False, k=5):
    """
    Generates recommendations based on SKU seasonality.
    Input: Last orders in Dataframe format.
    Output: Recommendations in Dataframe format.
    """

    # Get current month for each order
    # last_orders_df["Month"] = pd.to_datetime(last_orders_df["Delivery Date"]).dt.month
    last_orders_df["Season_Group"] = pd.to_datetime(last_orders_df["Delivery Date"]).dt.month.map(assign_season)

    # Get the last observed order per member
    df_for_iter = last_orders_df.groupby("Member").last().reset_index()

    recommendations = []

    for _, row in df_for_iter.iterrows():
        member = row["Member"]
        order = row["Order"]
        current_month = row["Season_Group"]

        # Get all SKUs ever seen (can be filtered further later)
        skus = sku_month_dist.index

        # Extract seasonal score for this month
        seasonal_scores = sku_month_dist.get(current_month)
        if seasonal_scores is None:
            seasonal_scores = pd.Series(0, index=sku_month_dist.index)

        # Filter out already-in-cart SKUs
        cart_skus = last_orders_df[last_orders_df["Order"] == order]["SKU"].astype(str).tolist()
        recs = (
            seasonal_scores.drop(labels=cart_skus, errors="ignore")
            .sort_values(ascending=False)
            .head(k)
        )

        for sku, score in recs.items():
            recommendations.append({"Order": order, "Member": member, "SKU": sku, "Month": current_month, "Seasonal_Score": score})

    return pd.DataFrame(recommendations)

### Matrix Factorization - Surprise

In [51]:
def get_MF_recommendations(last_orders_df, k=2):
    """Generate top-k MF-based recommendations per member using Surprise SVD."""

    # 1. Prepare Surprise Dataset
    df_interactions = last_orders_df[['Member', 'SKU']].copy()
    df_interactions['rating'] = 1

    reader = Reader(rating_scale=(0, 1))
    data = Dataset.load_from_df(df_interactions[['Member', 'SKU', 'rating']], reader)
    trainset = data.build_full_trainset()

    # 2. Train SVD Model
    algo = SVD(n_factors=50, biased=False)
    algo.fit(trainset)

    # 3. Get all members and items
    all_members = df_interactions['Member'].unique()
    all_items = df_interactions['SKU'].unique()

    # 4. Avoid already interacted items
    interacted = df_interactions.groupby('Member')['SKU'].apply(set).to_dict()

    # 5. Generate predictions
    recommendations = []

    for member in all_members:
        preds = []
        known_items = interacted.get(member, set())

        for item in all_items:
            if item not in known_items:
                est = algo.predict(member, item).est
                preds.append((item, est))

        # Scale the predicted scores
        mf_scores = [score for _, score in preds]
        max_score = max(mf_scores)
        min_score = min(mf_scores)
        if max_score != min_score:
            normalized_preds = [(item, (score - min_score) / (max_score - min_score)) for item, score in preds]
        else:
            normalized_preds = [(item, 0.0) for item, _ in preds]

        top_k = sorted(preds, key=lambda x: x[1], reverse=True)[:k]
        
        for item_id, mf_score in top_k:
            recommendations.append({
                "Member": member,
                "SKU": item_id,
                "MF_score": mf_score
            })

    # Create a lookup for Member → Order
    member_order_map = last_orders_df.set_index('Member')['Order'].to_dict()

    # Add Order info to each recommendation entry
    for rec in recommendations:
        rec['Order'] = member_order_map.get(rec['Member'])

    return pd.DataFrame(recommendations)

# 4. Model Implementation

### Key Steps
- Create dataframes for model building
  * From the all_except_last_orders dataset, remove the last order for each member. This becomes the new input.
  * Create a new dataset, called last_orders, based on the last order for each member in all_except_last_orders dataset.
  * From last_orders, remove a few SKUs for each order. This becomes our new last_orders_subset.
  * The SKUs that were removed from the previous step, can be used to test recall@5.  
- Functions are created in the strategy section.
- Generate candidate SKUs, along with their scores, from multiple recommendation functions.
- Rerank the candidates based on a combined score.
- Ensure that the model output covers every order in "last_orders_subset.csv".
- There should be exactly **5 unique SKUs**.


### Create dataframes for model building
This section only contains the code for creating new input data, observed data and withheld data from all_except_last_orders

Create a new input dataframe remove all but last order from all_except_last_orders.

In [52]:
# Group by Member and get the last Order for each Member
last_orders_per_member = df_all.groupby('Member')['Order'].max().reset_index()
last_orders_per_member = last_orders_per_member.rename(columns={'Order': 'Last_Order_ID'})

# Merge with the original DataFrame to flag the last order rows
df_all_with_last_flag = pd.merge(df_all, last_orders_per_member, on='Member', how='left')

# Filter out rows where member has only one order.
order_counts = df_all_with_last_flag.groupby('Member')['Order'].nunique()
members_with_multiple_orders = order_counts[order_counts > 1].index
df_all_with_last_flag = df_all_with_last_flag[df_all_with_last_flag['Member'].isin(members_with_multiple_orders)].copy()

# Create df_input by filtering out the rows that are the last order for that member
df_input = df_all_with_last_flag[df_all_with_last_flag['Order'] != df_all_with_last_flag['Last_Order_ID']].copy()

# Drop the helper column
df_input = df_input.drop(columns=['Last_Order_ID'])

In [53]:
# Check the shape of the new DataFrame
print(f"Shape of df_input: {df_input.shape}")

# Verify that for each member in df_input, their maximum order ID is less than the max order ID in df_all
max_order_input = df_input.groupby('Member')['Order'].max().reset_index()
max_order_all = df_all.groupby('Member')['Order'].max().reset_index()

verification_df = pd.merge(max_order_input, max_order_all, on='Member', suffixes=('_input', '_all'))

# Check if the max order in df_input is less than the max order in df_all for all members
print("Verification of last order removal:")
print((verification_df['Order_input'] < verification_df['Order_all']).all())

# Display the first few rows of df_input
df_input.head()

Shape of df_input: (22232, 7)
Verification of last order removal:
True


Unnamed: 0,Order,SKU,Member,Delivery Date,Name,Month,Season_Group
0,8358896,15668375,SSCEHNS,2013-11-02,Root Vegetables,11,Winter
1,8358896,15668467,SSCEHNS,2013-11-02,Beans,11,Winter
2,8358896,15669863,SSCEHNS,2013-11-02,Moong Dal,11,Winter
3,8358896,15669778,SSCEHNS,2013-11-02,Other Dals,11,Winter
4,8358896,15669767,SSCEHNS,2013-11-02,Urad Dal,11,Winter


* Select only the last orders and split the SKUs into observed and withheld.
* The predictions will be made based on observed.
* The recall@5 will be evaluated using withheld.

In [54]:
# Select only the last orders (where Order == Last_Order_ID)
df_last_orders_full = df_all_with_last_flag[df_all_with_last_flag['Order'] == df_all_with_last_flag['Last_Order_ID']].copy()
df_last_orders_full = df_last_orders_full.drop(columns=['Last_Order_ID'])  # Drop the helper column

# Check the shape
print(f"Shape of df_last_orders_full (full last orders): {df_last_orders_full.shape}")

# Apply the splitting function to each order
split_results = df_last_orders_full.groupby(['Order', 'Member', 'Delivery Date'])['SKU'].apply(lambda x: split_order_skus(x, frac=0.3)).reset_index()

Shape of df_last_orders_full (full last orders): (5610, 7)


In [55]:
# Separate the results into two lists of tuples (Order, Member, SKU_list)
observed_tuples = []
withheld_tuples = []

for _, row in split_results.iterrows():
    order_id = row['Order']
    member_id = row['Member']
    observed_skus, withheld_skus = row['SKU']
    delivery_date = row['Delivery Date']
    observed_tuples.extend([(order_id, member_id, sku, delivery_date) for sku in observed_skus])
    withheld_tuples.extend([(order_id, member_id, sku, delivery_date) for sku in withheld_skus])

# Create the two DataFrames
df_observed = pd.DataFrame(observed_tuples, columns=['Order', 'Member', 'SKU', 'Delivery Date'])
df_withheld = pd.DataFrame(withheld_tuples, columns=['Order', 'Member', 'SKU', 'Delivery Date'])

print(f"Shape of df_observed: {df_observed.shape}")
print(f"Shape of df_withheld: {df_withheld.shape}")

Shape of df_observed: (1461, 4)
Shape of df_withheld: (2605, 4)


In [56]:
# Verify the split
# Calculate total original SKUs
total_original_skus = df_last_orders_full.shape[0]

# Calculate total SKUs in split dataframes
total_split_skus = df_observed.shape[0] + df_withheld.shape[0]
print(f"Total original SKUs in last orders: {total_original_skus}")
print(f"Total SKUs in observed and withheld: {total_split_skus}")
print(f"Verification: Total split SKUs match original: {total_original_skus == total_split_skus}")

Total original SKUs in last orders: 5610
Total SKUs in observed and withheld: 4066
Verification: Total split SKUs match original: False


In [57]:
# Check if all orders are present in both (unless an order had only 1 SKU, in which case it's only in observed)
orders_observed = set(df_observed['Order'].unique())
orders_withheld = set(df_withheld['Order'].unique())
orders_full = set(df_last_orders_full['Order'].unique())

print(f"Number of orders in full last orders: {len(orders_full)}")
print(f"Number of orders in df_observed: {len(orders_observed)}")
print(f"Number of orders in df_withheld: {len(orders_withheld)}")  # This will be less if orders had only 1 SKU

# Display head of the new dataframes
print("\nHead of df_observed:")
print(df_observed.head())
print("\nHead of df_withheld:")
print(df_withheld.head())

Number of orders in full last orders: 521
Number of orders in df_observed: 521
Number of orders in df_withheld: 521

Head of df_observed:
     Order   Member       SKU Delivery Date
0  7387496  SSZWOEE  15668381    2013-10-01
1  7387496  SSZWOEE  15668460    2013-10-01
2  7395007  SSNSCNE  15670251    2013-10-05
3  7395007  SSNSCNE  15668379    2013-10-05
4  7408892  SSLSWRE  15669821    2013-09-06

Head of df_withheld:
     Order   Member       SKU Delivery Date
0  7387496  SSZWOEE  15668477    2013-10-01
1  7387496  SSZWOEE  92388167    2013-10-01
2  7387496  SSZWOEE  15668462    2013-10-01
3  7387496  SSZWOEE  15669817    2013-10-01
4  7387496  SSZWOEE  15668378    2013-10-01


### Apply Member Clustering

In [361]:
df_last_observed = pd.merge(df_observed, member_stats_df[['Member', 'Member_Cluster']], on='Member', how='left')

clusters = ['Repetitive', 'Explorer', 'NewUser', 'FrequentBuyer', 'Normal']
df_observed_list = [df_last_observed[df_last_observed['Member_Cluster'] == cluster] for cluster in clusters]

### Get recommendations using df_observed and evaluate recall@5 using df_withheld

In [362]:
strategies = [
    lambda df: jaccard_recommendations(df, global_jaccard_matrix, test_mode=True),
    lambda df: jaccard_cart_reranker(df, test_mode=True),
    lambda df: history_based_recommendations(df, test_mode=True),
    lambda df: periodicity_reranker(df, test_mode=True),
    lambda df: get_rolling_popular_skus(df, test_mode=True),
    lambda df: get_MF_recommendations(df)
]

strategy_names = ['Jaccard', 'MemJaccard', 'History', 'Periodicity', 'Seasonality', 'Surprise']

In [363]:
df_preds = []  # Will contain all 15 recommendation DataFrames
pred_labels = []  # To track which cluster and strategy

for i, df_cluster in enumerate(df_observed_list):
    for j, strategy in enumerate(strategies):
        df_pred = strategy(df_cluster)
        df_preds.append(df_pred)
        pred_labels.append(f"{clusters[i]}_{strategy_names[j]}")

In [364]:
df_pred_dict = dict(zip(pred_labels, df_preds))

# Organize predictions per cluster
cluster_preds = defaultdict(list)

for label, df in zip(pred_labels, df_preds):
    cluster = label.split("_")[0]  # 'Explorer', 'Repetitive', etc.
    cluster_preds[cluster].append(df)

In [365]:
for cluster, dfs in cluster_preds.items():
    print(f"Cluster: {cluster}")
    for i, df in enumerate(dfs):
        print(f"  Strategy {i}: shape={df.shape}, columns={df.columns.tolist()}")


Cluster: Repetitive
  Strategy 0: shape=(804, 4), columns=['Member', 'Order', 'SKU', 'Global_Jaccard_Score']
  Strategy 1: shape=(803, 4), columns=['Order', 'Member', 'SKU', 'Member_Jaccard_Score']
  Strategy 2: shape=(1002, 5), columns=['Order', 'Member', 'SKU', 'Frequency_Score', 'Frequency_Score_Scaled']
  Strategy 3: shape=(536, 4), columns=['Order', 'Member', 'SKU', 'Periodicity_Score']
  Strategy 4: shape=(1105, 4), columns=['Order', 'Member', 'SKU', 'Seasonal_Score']
  Strategy 5: shape=(134, 4), columns=['Member', 'SKU', 'MF_score', 'Order']
Cluster: Explorer
  Strategy 0: shape=(1608, 4), columns=['Member', 'Order', 'SKU', 'Global_Jaccard_Score']
  Strategy 1: shape=(946, 4), columns=['Order', 'Member', 'SKU', 'Member_Jaccard_Score']
  Strategy 2: shape=(1987, 5), columns=['Order', 'Member', 'SKU', 'Frequency_Score', 'Frequency_Score_Scaled']
  Strategy 3: shape=(274, 4), columns=['Order', 'Member', 'SKU', 'Periodicity_Score']
  Strategy 4: shape=(1780, 4), columns=['Order', '

In [366]:
combined_preds = {}

for cluster, dfs in cluster_preds.items():
    merged_df = dfs[0]

    for df in dfs[1:]:
        merged_df = pd.merge(merged_df, df, on=['Member','Order','SKU'], how='outer')
        merged_df = merged_df.drop_duplicates(subset=['Member', 'SKU'], keep='first')
        
    combined_preds[cluster] = merged_df

combined_preds['Normal'] = combined_preds['Normal'].fillna(0)
combined_preds['Explorer'] = combined_preds['Explorer'].fillna(0)
combined_preds['Repetitive'] = combined_preds['Repetitive'].fillna(0)
combined_preds['FrequentBuyer'] = combined_preds['FrequentBuyer'].fillna(0)
combined_preds['NewUser'] = combined_preds['NewUser'].fillna(0)

In [367]:
# Check how good the candidate SKUs are.
# Recall will be higher because the filtering is not applied.
recall_df = run_recall_evaluation(combined_preds, df_withheld)
print(recall_df.sort_values(by='Recall@5', ascending=False))

     Level        Segment  Recall@5
0  Cluster     Repetitive    0.8358
4  Cluster         Normal    0.6901
3  Cluster  FrequentBuyer    0.6742
5  Overall            All    0.5923
2  Cluster        NewUser    0.4727
1  Cluster       Explorer    0.4627


In [368]:
weights_rep = {'Jaccard': 0.1, 'MemJaccard' : 0.3, 'History': 0.6, 'Periodicity': 0.0, 'Seasonality': 0.0, 'Surprise': 0.0}

combined_preds['Repetitive']['Combined_Score'] = (
    weights_rep['Jaccard'] * combined_preds['Repetitive']['Global_Jaccard_Score'] +
    weights_rep['MemJaccard'] * combined_preds['Repetitive']['Member_Jaccard_Score'] +
    weights_rep['History'] * combined_preds['Repetitive']['Frequency_Score'] +
    weights_rep['Periodicity'] * combined_preds['Repetitive']['Periodicity_Score'] +
    weights_rep['Seasonality'] * combined_preds['Repetitive']['Seasonal_Score'] +
    weights_rep['Surprise'] * combined_preds['Repetitive']['MF_score']
)

combined_preds['Repetitive'] = (
    combined_preds['Repetitive']
    .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
    .groupby('Member')
    .head(5)
)

In [369]:
weights_norm = {'Jaccard': 0.2, 'MemJaccard' : 0.25, 'History': 0.45, 'Periodicity': 0.05, 'Seasonality': 0.05, 'Surprise': 0.0}

combined_preds['Normal']['Combined_Score'] = (
    weights_norm['Jaccard'] * combined_preds['Normal']['Global_Jaccard_Score'] +
    weights_norm['MemJaccard'] * combined_preds['Normal']['Member_Jaccard_Score'] +
    weights_norm['History'] * combined_preds['Normal']['Frequency_Score'] +
    weights_norm['Periodicity'] * combined_preds['Normal']['Periodicity_Score'] +
    weights_norm['Seasonality'] * combined_preds['Normal']['Seasonal_Score'] +
    weights_norm['Surprise'] * combined_preds['Normal']['MF_score']
)

combined_preds['Normal'] = (
    combined_preds['Normal']
    .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
    .groupby('Member')
    .head(5)
)

In [370]:
weights_freq = {'Jaccard': 0.0, 'MemJaccard' : 0.4, 'History': 0.5, 'Periodicity': 0.1, 'Seasonality': 0.0, 'Surprise': 0.0}

combined_preds['FrequentBuyer']['Combined_Score'] = (
    weights_freq['Jaccard'] * combined_preds['FrequentBuyer']['Global_Jaccard_Score'] +
    weights_freq['MemJaccard'] * combined_preds['FrequentBuyer']['Member_Jaccard_Score'] +
    weights_freq['History'] * combined_preds['FrequentBuyer']['Frequency_Score'] +
    weights_freq['Periodicity'] * combined_preds['FrequentBuyer']['Periodicity_Score'] +
    weights_freq['Seasonality'] * combined_preds['FrequentBuyer']['Seasonal_Score'] +
    weights_freq['Surprise'] * combined_preds['FrequentBuyer']['MF_score']
)

combined_preds['FrequentBuyer'] = (
    combined_preds['FrequentBuyer']
    .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
    .groupby('Member')
    .head(5)
)

In [371]:
weights_newUser = {'Jaccard': 0.35, 'MemJaccard' : 0.1, 'History': 0.55, 'Periodicity': 0.0, 'Seasonality': 0.0, 'Surprise': 0.0}

combined_preds['NewUser']['Combined_Score'] = (
    weights_newUser['Jaccard'] * combined_preds['NewUser']['Global_Jaccard_Score'] +
    weights_newUser['MemJaccard'] * combined_preds['NewUser']['Member_Jaccard_Score'] +
    weights_newUser['History'] * combined_preds['NewUser']['Frequency_Score'] +
    weights_newUser['Periodicity'] * combined_preds['NewUser']['Periodicity_Score'] +
    weights_newUser['Seasonality'] * combined_preds['NewUser']['Seasonal_Score'] +
    weights_newUser['Surprise'] * combined_preds['NewUser']['MF_score']
)

combined_preds['NewUser'] = (
    combined_preds['NewUser']
    .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
    .groupby('Member')
    .head(5)
)

In [372]:
weights_exp = {'Jaccard': 0.3, 'MemJaccard' : 0.25, 'History': 0.45, 'Periodicity': 0.0, 'Seasonality': 0.0, 'Surprise': 0.0}

combined_preds['Explorer']['Combined_Score'] = (
    weights_exp['Jaccard'] * combined_preds['Explorer']['Global_Jaccard_Score'] +
    weights_exp['MemJaccard'] * combined_preds['Explorer']['Member_Jaccard_Score'] +
    weights_exp['History'] * combined_preds['Explorer']['Frequency_Score'] +
    weights_exp['Periodicity'] * combined_preds['Explorer']['Periodicity_Score'] +
    weights_exp['Seasonality'] * combined_preds['Explorer']['Seasonal_Score'] +
    weights_exp['Surprise'] * combined_preds['Explorer']['MF_score']
)

combined_preds['Explorer'] = (
    combined_preds['Explorer']
    .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
    .groupby('Member')
    .head(5)
)

In [373]:
# Check the recall after ranking
recall_df = run_recall_evaluation(combined_preds, df_withheld)
print(recall_df.sort_values(by='Recall@5', ascending=False))

     Level        Segment  Recall@5
4  Cluster         Normal    0.3495
0  Cluster     Repetitive    0.3045
3  Cluster  FrequentBuyer    0.2907
5  Overall            All    0.2583
2  Cluster        NewUser    0.2227
1  Cluster       Explorer    0.1851


In [374]:
hits_df = position_of_truth(combined_preds['Explorer'], df_withheld)
print(hits_df.sort_values(by='Predicted_Rank').head(10))

       Member    Order       SKU  Predicted_Rank
2089  SWCEWOZ  8383034  15668478               1
1446  SSWECLS  7513293  15669957               1
1463  SSWERSN  7540624  15668473               1
1523  SSWLNWS  8375902  15668460               1
1547  SSWNLNH  8286266  15668468               1
1593  SSWNOEH  7565384  15669878               1
1648  SSWZRHL  7539082  15668468               1
1815  SSZLHHE  7561456  15669878               1
1825  SSZRSZL  8376951  15668458               1
1996  SWCCZRH  8378882  15668381               1


### Apply prediction algorithms to original data

In [750]:
df_last_orders = df_last.copy()
df_last_orders = pd.merge(df_last, member_stats_df[['Member', 'Member_Cluster']], on='Member', how='left')

In [751]:
df_last_orders_clustered = [df_last_orders[df_last_orders['Member_Cluster'] == cluster] for cluster in clusters]

In [752]:
strategies = [
    lambda df: jaccard_recommendations(df, global_jaccard_matrix),
    lambda df: jaccard_cart_reranker(df),
    lambda df: history_based_recommendations(df),
    lambda df: periodicity_reranker(df),
    lambda df: get_rolling_popular_skus(df),
    lambda df: get_MF_recommendations(df)
]

strategy_names = ['Jaccard', 'MemJaccard', 'History', 'Periodicity', 'Seasonality', 'Surprise']

In [753]:
df_preds = []  # Will contain all 15 recommendation DataFrames
pred_labels = []  # To track which cluster and strategy

for i, df_cluster in enumerate(df_last_orders_clustered):
    for j, strategy in enumerate(strategies):
        df_pred = strategy(df_cluster)
        df_preds.append(df_pred)
        pred_labels.append(f"{clusters[i]}_{strategy_names[j]}")

In [754]:
df_pred_dict = dict(zip(pred_labels, df_preds))

# Organize predictions per cluster
cluster_preds = defaultdict(list)

for label, df in zip(pred_labels, df_preds):
    cluster = label.split("_")[0]  # 'Explorer', 'Repetitive', etc.
    cluster_preds[cluster].append(df)

In [755]:
combined_preds = {}

for cluster, dfs in cluster_preds.items():
    merged_df = dfs[0]

    for df in dfs[1:]:
        merged_df = pd.merge(merged_df, df, on=['Member','Order','SKU'], how='outer')
        merged_df = merged_df.drop_duplicates(subset=['Member', 'SKU'], keep='first')
        
    combined_preds[cluster] = merged_df

In [756]:
combined_preds['Normal'] = combined_preds['Normal'].fillna(0)
combined_preds['Explorer'] = combined_preds['Explorer'].fillna(0)
combined_preds['Repetitive'] = combined_preds['Repetitive'].fillna(0)
combined_preds['FrequentBuyer'] = combined_preds['FrequentBuyer'].fillna(0)
combined_preds['NewUser'] = combined_preds['NewUser'].fillna(0)

In [757]:
# weights_rep gets picked from the last tested data
weights_rep = {'Jaccard': 0.1, 'MemJaccard' : 0.3, 'History': 0.6, 'Periodicity': 0.0, 'Seasonality': 0.0, 'Surprise': 0.0}

combined_preds['Repetitive']['Combined_Score'] = (
    weights_rep['Jaccard'] * combined_preds['Repetitive']['Global_Jaccard_Score'] +
    weights_rep['MemJaccard'] * combined_preds['Repetitive']['Member_Jaccard_Score'] +
    weights_rep['History'] * combined_preds['Repetitive']['Frequency_Score'] +
    weights_rep['Periodicity'] * combined_preds['Repetitive']['Periodicity_Score'] +
    weights_rep['Seasonality'] * combined_preds['Repetitive']['Seasonal_Score'] +
    weights_rep['Surprise'] * combined_preds['Repetitive']['MF_score']
)

combined_preds['Repetitive'] = (
    combined_preds['Repetitive']
    .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
    .groupby('Member')
    .head(5)
)

In [758]:
# weights_norm gets picked from the last tested data
weights_norm = {'Jaccard': 0.2, 'MemJaccard' : 0.3, 'History': 0.5, 'Periodicity': 0.0, 'Seasonality': 0.0, 'Surprise': 0.0}

combined_preds['Normal']['Combined_Score'] = (
    weights_norm['Jaccard'] * combined_preds['Normal']['Global_Jaccard_Score'] +
    weights_norm['MemJaccard'] * combined_preds['Normal']['Member_Jaccard_Score'] +
    weights_norm['History'] * combined_preds['Normal']['Frequency_Score'] +
    weights_norm['Periodicity'] * combined_preds['Normal']['Periodicity_Score'] +
    weights_norm['Seasonality'] * combined_preds['Normal']['Seasonal_Score'] +
    weights_norm['Surprise'] * combined_preds['Normal']['MF_score']
)

combined_preds['Normal'] = (
    combined_preds['Normal']
    .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
    .groupby('Member')
    .head(5)
)

In [759]:
# weights_freq gets picked from the last tested data
weights_freq = {'Jaccard': 0.0, 'MemJaccard' : 0.4, 'History': 0.6, 'Periodicity': 0.0, 'Seasonality': 0.0, 'Surprise': 0.0}

combined_preds['FrequentBuyer']['Combined_Score'] = (
    weights_freq['Jaccard'] * combined_preds['FrequentBuyer']['Global_Jaccard_Score'] +
    weights_freq['MemJaccard'] * combined_preds['FrequentBuyer']['Member_Jaccard_Score'] +
    weights_freq['History'] * combined_preds['FrequentBuyer']['Frequency_Score'] +
    weights_freq['Periodicity'] * combined_preds['FrequentBuyer']['Periodicity_Score'] +
    weights_freq['Seasonality'] * combined_preds['FrequentBuyer']['Seasonal_Score'] +
    weights_freq['Surprise'] * combined_preds['FrequentBuyer']['MF_score']
)

combined_preds['FrequentBuyer'] = (
    combined_preds['FrequentBuyer']
    .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
    .groupby('Member')
    .head(5)
)

In [760]:
# weights_newUser gets picked from the last tested data
weights_newUser = {'Jaccard': 0.35, 'MemJaccard' : 0.0, 'History': 0.65, 'Periodicity': 0.0, 'Seasonality': 0.0, 'Surprise': 0.0}

combined_preds['NewUser']['Combined_Score'] = (
    weights_newUser['Jaccard'] * combined_preds['NewUser']['Global_Jaccard_Score'] +
    weights_newUser['MemJaccard'] * combined_preds['NewUser']['Member_Jaccard_Score'] +
    weights_newUser['History'] * combined_preds['NewUser']['Frequency_Score'] +
    weights_newUser['Periodicity'] * combined_preds['NewUser']['Periodicity_Score'] +
    weights_newUser['Seasonality'] * combined_preds['NewUser']['Seasonal_Score'] +
    weights_newUser['Surprise'] * combined_preds['NewUser']['MF_score']
)

combined_preds['NewUser'] = (
    combined_preds['NewUser']
    .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
    .groupby('Member')
    .head(5)
)

In [761]:
# weights_exp gets picked from the last tested data
weights_exp = {'Jaccard': 0.25, 'MemJaccard' : 0.25, 'History': 0.5, 'Periodicity': 0.0, 'Seasonality': 0.0, 'Surprise': 0.0}

combined_preds['Explorer']['Combined_Score'] = (
    weights_exp['Jaccard'] * combined_preds['Explorer']['Global_Jaccard_Score'] +
    weights_exp['History'] * combined_preds['Explorer']['Frequency_Score'] +
    weights_exp['Periodicity'] * combined_preds['Explorer']['Periodicity_Score'] +
    weights_exp['Seasonality'] * combined_preds['Explorer']['Seasonal_Score'] +
    weights_exp['Surprise'] * combined_preds['Explorer']['MF_score']
)

combined_preds['Explorer'] = (
    combined_preds['Explorer']
    .sort_values(by=['Member', 'Combined_Score'], ascending=[True, False])
    .groupby('Member')
    .head(5)
)

In [762]:
df_combined = pd.concat([
    combined_preds['Explorer'],
    combined_preds['Repetitive'],
    combined_preds['Normal'],
    combined_preds['FrequentBuyer'],
    combined_preds['NewUser']
], ignore_index=True)


In [763]:
# Sort by priority, then slice top 5 per order
df_final = (
    df_combined
    .sort_values(by=['Order', 'Combined_Score'], ascending=[True, False])
    .groupby('Order')
    .head(5)
    .reset_index(drop=True)
)
len(df_final)

3190

# 5. Submission file creation
### Format
- CSV file with 4 columns: ID, Order, SKU, Member.
- Exactly 5 rows per order.
- There are 638 unique orders in "last_orders_subset.csv". The csv file should contain 638 * 5 = 3190 rows.
- No duplicate members.
- File should be named: **GR_12_rec_5_sets.csv**

In [764]:
# Write submission file
df_submission = df_final.copy()
assert len(df_submission) == 3190, "Number of rows doesn't match expected count."

# Add the ID column if it hasn't been added yet
df_submission.insert(0, 'ID', range(1, len(df_submission) + 1))

# Reorder the columns
desired_order = ['ID', 'Order', 'SKU', 'Member']
df_submission = df_submission[desired_order]

df_submission.to_csv("GR12_rec_5_sets.csv", index=False)

# Ensure that there are exactly 5 rows for each order
df_submission.groupby('Order').size().describe()

count    638.0
mean       5.0
std        0.0
min        5.0
25%        5.0
50%        5.0
75%        5.0
max        5.0
dtype: float64

# PEP8 Compliance Changes

The following changes were made to improve PEP8 compliance:

- Added blank lines around function definitions and class definitions.
- Added blank lines within functions to separate logical sections.
- Removed unnecessary blank lines.
- Corrected indentation where needed.
- Added or adjusted comments for clarity.
- Ensured consistent spacing around operators and in function calls.
- Ensured consistent naming conventions (variables, functions).
- Added docstrings to functions where appropriate.
- Wrapped long lines to improve readability (although some long lines with data or print statements were left as is for clarity in this context).