# Task 2: Shopping Basket Recommendation

**Reference / Dataset:** Groceries dataset from Kaggle: https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset

**Goal:** Given a shopping basket (e.g. "pasta" and "olive oil"), recommend additional items that customers often buy together (e.g. "canned tomato").

**Approach:**
- Load real shopping transaction data (who bought what, and when)
- Group purchases into baskets (same customer, same date = one basket)
- Count how often each pair of items appears in the same basket (co-occurrence)
- Reuse Task 1's decision tree model to categorize items and optionally boost recommendations by category
- Build a function that scores candidate items by how often they appear with the current basket, then return the top N

**Why this approach?** I wanted something simple and interpretable. The idea is that if many people buy X and Y together, then when someone has X in their basket, suggesting Y is reasonable. I'm not building a complex model from scratch here; I'm reusing Task 1's categorization as an extra signal (same-category items get a small score boost) and relying mainly on co-occurrence counts. The downside is that very popular items like "whole milk" show up a lot because they appear with almost everything, but the logic is easy to explain and the outcome of Task 1 is directly useful here.


## Step 1: Import Libraries

**Why these libraries?**

- **pandas**: To load the transaction CSV and work with tables (columns like Member_number, Date, itemDescription). I use it to read the file and inspect the data.

- **numpy**: For arrays when building feature vectors for Task 1's model (e.g. `np.array([create_simple_features(item)])`) and for simple numeric operations.

- **collections (defaultdict, Counter)**: To count how often items appear in baskets and how often pairs of items appear together. `defaultdict(int)` is used for the co occurrence dictionary so I can do `co_occurrence[pair] += 1` without checking if the key exists; `Counter` is handy for counting single items and getting "most common" lists.

- **os**: Used to check whether the saved model files exist before trying to load them (e.g. `os.path.exists(model_path)`). The actual load of Task 1's classifier and encoder is done with **joblib** in Step 4.


In [None]:
import pandas as pd
import numpy as np
from collections import defaultdict, Counter
import os

print("Libraries imported successfully!")


Libraries imported successfully!


## Step 2: Load the Shopping Basket Data

**What kind of data do I need?** I need transaction-level grocery data: each row should represent one item purchased, with some way to know which items were bought together in the same trip. So I need at least: a customer or transaction ID, a date (or trip ID), and the item name.

**Where did I get the data?** I'm using the Groceries dataset from Kaggle (link in the intro). It has three columns: `Member_number`, `Date`, and `itemDescription`. Each row is one item one customer bought on one date. So if customer 1808 bought "tropical fruit" and "whole milk" on 21-07-2015, there are two rows with the same Member_number and Date; that tells me those two items were in the same basket.

**How will I process it?** I load the CSV with pandas, then in the next step I'll group rows by `Member_number` and `Date` to build a list of baskets. For now I just load and print basic stats (number of purchases, unique customers, unique items, and a sample of rows) so I can confirm the data looks right.

![Step 2 results](../resultsimages/task2results/task2step2results.png)


In [None]:
# Load the grocery transaction data
df = pd.read_csv('Groceries_dataset.csv')

print(f"Total purchases: {len(df):,}")
print(f"Unique customers: {df['Member_number'].nunique():,}")
print(f"Unique items: {df['itemDescription'].nunique():,}")
print(f"\nFirst 10 purchases:")
print(df.head(10))

print("\n\nWhat this means:")
print("  - Each row is one item someone bought")
print("  - Items with same Member_number + Date = same shopping trip")


Total purchases: 38,765
Unique customers: 3,898
Unique items: 167

First 10 purchases:
   Member_number        Date   itemDescription
0           1808  21-07-2015    tropical fruit
1           2552  05-01-2015        whole milk
2           2300  19-09-2015         pip fruit
3           1187  12-12-2015  other vegetables
4           3037  01-02-2015        whole milk
5           4941  14-02-2015        rolls/buns
6           4501  08-05-2015  other vegetables
7           3803  23-12-2015        pot plants
8           2762  20-03-2015        whole milk
9           4119  12-02-2015    tropical fruit


What this means:
  - Each row is one item someone bought
  - Items with same Member_number + Date = same shopping trip


## Step 3: Group Items into Shopping Baskets

**Why group into baskets?** The raw data is one row per item. To recommend "what goes with what" I need to know which items were bought in the same trip. So I define one basket = one customer on one date: every row with the same `Member_number` and `Date` belongs to the same basket.

**What we're doing:** We loop over the dataframe and, for each row, append the item (lowercased and stripped) to a list keyed by `Member_number_Date`. That gives a dictionary of basket_id -> list of items. We then turn that into a list of baskets, and for each basket we remove duplicates with `set` so the same item doesn't count twice in one basket. The result is a list where each element is a list of item names, e.g. ["tropical fruit", "whole milk", "rolls/buns"].

**What we get:** The number of baskets (one per customer-date), the average basket size, and a few example baskets. This structure is what we use in the next step to count how often pairs of items appear together.

![Step 3 results](../resultsimages/task2results/task2step3results.png)


In [None]:
# Group items by transaction (Member_number + Date)
# This creates shopping baskets

baskets = []
basket_dict = defaultdict(list)

# Group by Member_number and Date
for _, row in df.iterrows():
    basket_id = f"{row['Member_number']}_{row['Date']}"
    item = row['itemDescription'].lower().strip()  # Normalize: lowercase, no extra spaces
    basket_dict[basket_id].append(item)

# Convert to list of baskets (each basket is a set of items)
baskets = [list(set(basket)) for basket in basket_dict.values()]  # Remove duplicates

print(f"Total shopping baskets: {len(baskets):,}")
print(f"Average items per basket: {np.mean([len(b) for b in baskets]):.1f}")
print(f"\nExample baskets:")
for i, basket in enumerate(baskets[:5]):
    print(f"  Basket {i+1}: {basket[:5]}{'...' if len(basket) > 5 else ''}")


Total shopping baskets: 14,963
Average items per basket: 2.5

Example baskets:
  Basket 1: ['candy', 'rolls/buns', 'tropical fruit']
  Basket 2: ['tropical fruit', 'chocolate', 'whole milk']
  Basket 3: ['other vegetables', 'pip fruit', 'flour']
  Basket 4: ['other vegetables', 'onions', 'shopping bags']
  Basket 5: ['other vegetables', 'white bread', 'whole milk']


## Step 4: Load Task 1's Categorization Model

**Why reuse Task 1's model?** The instructions said the outcome of the previous task should be useful for directly feeding into this task. Task 1 produced a trained decision tree and a label encoder, saved as `decision_tree_model.joblib` and `label_encoder.joblib` in the task1 folder. Here I load those files with joblib so I don't retrain and I use the same model Task 1 saved. That way recommendations can optionally use category: items in the same category as the basket get a small score boost, so we're reusing Task 1's output as an extra signal.

**What we're doing:** We define `create_simple_features(item_name)` so that it returns exactly the same 7 features Task 1 used (length, first/last character, vowels, consonants, spaces, average character value). The decision tree was trained on those 7 features, so if we passed 9 or different features we'd get a shape error. We then check if the two .joblib files exist under `../task1/`; if they do, we load the model and encoder, print the list of categories, and run a quick test (e.g. "pasta" -> predicted category). If the files are missing or loading fails, we set `task1_available = False` and the rest of the notebook runs without categorization.

![Step 4 results](../resultsimages/task2results/task2step4.png)


In [None]:
# Load Task 1's model and encoder
# First, we need to recreate the feature function from Task 1

def create_simple_features(item_name):
    """Extract 7 features from item name (same as Task 1's decision tree: extract_features)."""
    item_lower = str(item_name).lower()
    features = []
    features.append(len(item_name))
    features.append(ord(item_lower[0]) if len(item_lower) > 0 else 0)
    features.append(ord(item_lower[-1]) if len(item_lower) > 0 else 0)
    vowels = 'aeiou'
    features.append(sum(1 for char in item_lower if char in vowels))
    features.append(sum(1 for char in item_lower if char.isalpha() and char not in vowels))
    features.append(item_name.count(' '))
    letters = [c for c in item_lower if c.isalpha()]
    features.append(sum(ord(c) for c in letters) / len(letters) if letters else 0)
    return features

# Try to load Task 1's decision tree model and encoder saved from task1dt1-saved.ipynb
try:
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.preprocessing import LabelEncoder
    import joblib

    model_path = '../task1/decision_tree_model.joblib'
    encoder_path = '../task1/label_encoder.joblib'

    if os.path.exists(model_path) and os.path.exists(encoder_path):
        task1_model = joblib.load(model_path)
        task1_label_encoder = joblib.load(encoder_path)

        print("Task 1's decision tree categorization model loaded successfully!")
        print("  Categories (from Task 1):")
        for i, cat in enumerate(task1_label_encoder.classes_, 1):
            print(f"    {i:2d}. {cat}")

        # Quick test
        test_item = "pasta"
        features = np.array([create_simple_features(test_item)])
        prediction = task1_model.predict(features)
        predicted_category = task1_label_encoder.inverse_transform(prediction)[0]
        print(f"\n  Test: '{test_item}' = {predicted_category}")

        task1_available = True
    else:
        print("Task 1 model files not found. We'll work without categorization.")
        task1_available = False
        task1_model = None
        task1_label_encoder = None

except Exception as e:
    print(f" Could not load Task 1 decision tree model: {e}")
    print("  We'll work without categorization.")
    task1_available = False
    task1_model = None
    task1_label_encoder = None


Task 1's decision tree categorization model loaded successfully!
  Categories (from Task 1):
     1. Bakery
     2. Beverages
     3. Canned Goods
     4. Condiments & Sauces
     5. Dairy & Eggs
     6. Deli
     7. Frozen Foods
     8. Household
     9. Meat & Seafood
    10. Pantry
    11. Pasta & Grains
    12. Personal Care
    13. Pet Supplies
    14. Produce
    15. Snacks

  Test: 'pasta' = Pasta & Grains


## Step 5: Find Items That Appear Together (Co-occurrence)

**Why count pairs?** To recommend "what to add to the basket" I need a score for each candidate item. The simplest score is: how often does this item appear in a basket together with one of the items already in the basket? So I need to count, for every pair of items (A, B), how many baskets contain both A and B. That number is the co-occurrence count. Later, when the user's basket has item A, we'll add the co-occurrence count of (A, B) to B's recommendation score.

**What we're doing:** We loop over every basket. For each basket we update `item_counts` (how many baskets each item appears in) and then, for every pair of distinct items in that basket, we form a sorted tuple (so (A,B) and (B,A) become the same key) and increment `co_occurrence[pair]`. At the end we have a dictionary mapping item pairs to counts, plus overall item frequencies. We print the total number of pairs, the most frequent items, and the most frequent pairs so we can see that the data looks sensible (e.g. "whole milk" and "other vegetables" often together).

![Step 5 results](../resultsimages/task2results/task2step5results.png)


In [None]:
# Count how often items appear together in baskets
co_occurrence = defaultdict(int)
item_counts = Counter()  # How often each item appears overall

# For each basket, count pairs of items
for basket in baskets:
    # Count individual items
    for item in basket:
        item_counts[item] += 1
    
    # Count pairs (items that appear together)
    for i in range(len(basket)):
        for j in range(i + 1, len(basket)):
            item1, item2 = basket[i], basket[j]
            # Store in alphabetical order to avoid duplicates
            pair = tuple(sorted([item1, item2]))
            co_occurrence[pair] += 1

print(f"Total item pairs found: {len(co_occurrence):,}")
print(f"Most common items:")
for item, count in item_counts.most_common(10):
    print(f"  {item:30s}: appears in {count:,} baskets")

print(f"\nMost common pairs (items bought together):")
for pair, count in sorted(co_occurrence.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"  {pair[0]:20s} + {pair[1]:20s}: {count:,} times together")


Total item pairs found: 6,260
Most common items:
  whole milk                    : appears in 2,363 baskets
  other vegetables              : appears in 1,827 baskets
  rolls/buns                    : appears in 1,646 baskets
  soda                          : appears in 1,453 baskets
  yogurt                        : appears in 1,285 baskets
  root vegetables               : appears in 1,041 baskets
  tropical fruit                : appears in 1,014 baskets
  bottled water                 : appears in 908 baskets
  sausage                       : appears in 903 baskets
  citrus fruit                  : appears in 795 baskets

Most common pairs (items bought together):
  other vegetables     + whole milk          : 222 times together
  rolls/buns           + whole milk          : 209 times together
  soda                 + whole milk          : 174 times together
  whole milk           + yogurt              : 167 times together
  other vegetables     + rolls/buns          : 158 times to

## Step 6: Build the Recommendation Function

**Why a single function?** I want one place that takes a basket (list of item names), the cooccurrence and item-count structures we built, and optional Task 1 model/encoder, and returns the top N recommended items with scores. That way we can call it from the test cell and from the "more examples" cell with different baskets and parameters.

**How it works:** (1) Normalise the basket (lowercase, strip). (2) Initialise a score for each candidate item at 0. For each item in the basket, scan all co-occurrence pairs; if the pair contains that basket item, the other item in the pair is a candidate add the pair's count to that candidate's score. Don't add items already in the basket. (3) If Task 1's model and encoder are loaded, we get the predicted category for each basket item and for each candidate; if a candidate's category is in the basket's categories, we multiply its score by 1.1 (small boost). (4) Sort candidates by score descending and return the top N. So the core logic is "score = sum of co-occurrence counts with basket items," with an optional category boost.

**Why this approach?** It's transparent and easy to debug: we're not fitting a new model here, we're reusing Task 1's output and the co-occurrence counts. The downside is that very popular items (e.g. whole milk) get high scores for almost any basket because they co-occur with everything; that's a known limitation we discuss in the Summary.


In [None]:
def recommend_items(basket, co_occurrence, item_counts, top_n=5, task1_model=None, task1_encoder=None):
    """
    Recommend items based on a shopping basket.
    
    Parameters:
    - basket: List of items currently in the basket (e.g., ["pasta", "olive oil"])
    - co_occurrence: Dictionary of item pairs and how often they appear together
    - item_counts: How often each item appears overall
    - top_n: How many recommendations to return
    - task1_model: Optional - Task 1's categorization model
    - task1_encoder: Optional - Task 1's label encoder
    
    Returns:
    - List of recommended items with scores
    """
    # Normalize basket items (lowercase, strip spaces)
    basket = [item.lower().strip() for item in basket]
    
    # Score each potential item
    recommendation_scores = defaultdict(float)
    
    # For each item in the basket, find items that appear with it
    for basket_item in basket:
        # Look through all co-occurrence pairs
        for (item1, item2), count in co_occurrence.items():
            # If this pair includes our basket item
            if basket_item == item1:
                other_item = item2
                # Don't recommend items already in basket
                if other_item not in basket:
                    # Score = how often they appear together
                    recommendation_scores[other_item] += count
            elif basket_item == item2:
                other_item = item1
                if other_item not in basket:
                    recommendation_scores[other_item] += count
    
    # Use Task 1's categorization to boost recommendations
    if task1_model is not None and task1_encoder is not None:
        # Get categories of items in basket
        basket_categories = set()
        for item in basket:
            try:
                features = np.array([create_simple_features(item)])
                # Decision tree returns class indices directly
                prediction = task1_model.predict(features)
                category = task1_encoder.inverse_transform(prediction)[0]
                basket_categories.add(category)
            except Exception:
                pass  # Skip if categorization fails

        # Boost items in same categories
        for item in recommendation_scores.keys():
            try:
                features = np.array([create_simple_features(item)])
                prediction = task1_model.predict(features)
                category = task1_encoder.inverse_transform(prediction)[0]
                if category in basket_categories:
                    # Small boost for same category
                    recommendation_scores[item] *= 1.1
            except Exception:
                pass
    
    # Sort by score and return top N
    sorted_recommendations = sorted(recommendation_scores.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_recommendations[:top_n]

print("Recommendation function created!")


Recommendation function created!


## Step 7: Test the Recommendation System

**What we're checking:** Whether the suggested items are plausible (e.g. dairy, grains, or vegetables often show up), whether the scores reflect how often items appear with the basket items, and whether we get any obvious mistakes.

![Step 7 results](../resultsimages/task2results/task2step7results.png)


In [None]:
# Test with the example from the task description
test_basket = ["pasta", "olive oil"]

print(f"Shopping basket: {test_basket}")
print("\nRecommendations:")
print("-" * 70)

recommendations = recommend_items(
    test_basket, 
    co_occurrence, 
    item_counts, 
    top_n=10,
    task1_model=task1_model if task1_available else None,
    task1_encoder=task1_label_encoder if task1_available else None
)

for i, (item, score) in enumerate(recommendations, 1):
    print(f"{i:2d}. {item:30s} (appeared {score:.0f} times with basket items)")

# Check if "canned tomato" or similar is recommended
print("\n\nChecking for 'canned tomato' or similar items:")
for item, score in recommendations:
    if 'tomato' in item or 'canned' in item:
        print(f"  Found: {item} (score: {score:.0f})")


Shopping basket: ['pasta', 'olive oil']

Recommendations:
----------------------------------------------------------------------
 1. whole milk                     (appeared 18 times with basket items)
 2. yogurt                         (appeared 14 times with basket items)
 3. soda                           (appeared 12 times with basket items)
 4. sausage                        (appeared 11 times with basket items)
 5. rolls/buns                     (appeared 11 times with basket items)
 6. pip fruit                      (appeared 10 times with basket items)
 7. tropical fruit                 (appeared 9 times with basket items)
 8. beef                           (appeared 7 times with basket items)
 9. shopping bags                  (appeared 7 times with basket items)
10. hamburger meat                 (appeared 7 times with basket items)


Checking for 'canned tomato' or similar items:


## Step 8: Try More Examples

**What we're doing:** Loop over a fixed list of example baskets; for each, call `recommend_items` with top_n=5 and print the basket and the top 5 items with scores (and categories if available). No new logic. Just repeated use of the same function to show behaviour.

![Step 8 results](../resultsimages/task2results/task2step8results.png)


In [None]:
# Test with different baskets
test_baskets = [
    ["whole milk", "bread"],
    ["chicken", "vegetables"],
    ["yogurt", "berries", "whole milk"],
    ["coffee", "sugar"]
]

for basket in test_baskets:
    print(f"\n{'='*70}")
    print(f"Basket: {basket}")
    print("-" * 70)
    
    recommendations = recommend_items(
        basket,
        co_occurrence,
        item_counts,
        top_n=5,
        task1_model=task1_model if task1_available else None,
        task1_encoder=task1_label_encoder if task1_available else None
    )
    
    for i, (item, score) in enumerate(recommendations, 1):
        print(f"  {i}. {item:30s} (score: {score:.0f})")



Basket: ['whole milk', 'bread']
----------------------------------------------------------------------
  1. other vegetables               (score: 222)
  2. rolls/buns                     (score: 209)
  3. soda                           (score: 174)
  4. yogurt                         (score: 167)
  5. sausage                        (score: 134)

Basket: ['chicken', 'vegetables']
----------------------------------------------------------------------
  1. whole milk                     (score: 51)
  2. rolls/buns                     (score: 47)
  3. soda                           (score: 35)
  4. other vegetables               (score: 33)
  5. yogurt                         (score: 27)

Basket: ['yogurt', 'berries', 'whole milk']
----------------------------------------------------------------------
  1. other vegetables               (score: 421)
  2. rolls/buns                     (score: 351)
  3. soda                           (score: 283)
  4. tropical fruit                 (score

---

## Why Did We Get These Results?


#### **Why Some Recommendations Make Sense**

**Basket: "whole milk" + "bread"**  
**Top recommendation: "other vegetables" (score: 222)**  
- This might seem unrelated, but it reflects true shopping behaviour. People buying daily staples often also grab vegetables. 
- "rolls/buns" and "yogurt" also appear because these items frequently co occur in general grocery trips.  
- The model is not really guessing recipes. It is detecting high frequency co purchases across many shoppers.

  
**Basket: "chicken" + "vegetables"**  
**Top recommendation: "whole milk" (score: 51)**  
- This is not a cooking related connection, but a basket correlation. Families who buy proteins and fresh produce also tend to buy milk in the same trip.  
- "rolls/buns", "soda", and "other vegetables" appear because they are commonly added side items in larger shopping baskets.  
- The model finds patterns, not meaning. It is about what people actually buy together.

  
**Basket: "yogurt" + "berries"**  
**Top recommendation: "whole milk" (score: 221)**  
- Even though yogurt and berries feel like a breakfast combo, milk is recommended because these three items often appear together in health conscious or breakfast oriented baskets.  
- "other vegetables" and "rolls/buns" show up because consumers buying fresh items often add more staple foods to their baskets.

#### **Why Some Recommendations Might Be Wrong**

**Rare combinations:**
- If "pasta" + "olive oil" + "canned tomato" rarely appear together, it won't be recommended
- The system only recommends what it has seen in the data by probability

**Too common items:**
- Very popular items (like "whole milk") appear with everything
- They might get recommended even when not relevant
- Solution: Filter out items that appear in too many baskets

### Limitations of Our Approach

#### **1. Popularity Bias**
- **Common items dominate:** Items that appear in many baskets get recommended often
  - **Impact:** Less popular but relevant items might be missed
- **Example:** "whole milk" might be recommended for everything because it's very common
- **Solution:** Once in a while, suggest a less popular but related item. Like Spotify, It plays songs people love but sneaks in new artists so people donâ€™t get stuck hearing the same hit song forever

#### **2. Task 1 Integration**
- **Categorization accuracy:** Task 1's model has around 20% accuracy
  - **Impact:** Category-based recommendations might be wrong
- **Limited categories:** Only 15 categories might not capture all relationships
  - **Impact:** Some item relationships might be missed

### What This Tells Us

**The Good:**
- Simple approach that's easy to understand
- Based on real customer behavior
- Can find patterns in shopping data
- Works better with more data

**The Problems:**
- Simple counting doesn't capture complex relationships
- Popular items dominate recommendations
- Task 1's categorization helps but has limitations

**Conclusion:**
Our recommendation system works by finding items that commonly appear together in shopping baskets. It's simple and transparent, but has limitations due to the simplicity of the approach.
