# Content-Based Recommender Training (Sport-Only)

This notebook trains a **content-based recommendation model** that learns sport association patterns.

**Key Difference from Item-Based CF:**
- Item-based: "Product 5 is similar to Product 12"
- Content-based: "Soccer products are similar to Basketball products"

**Why This Works:**
- Learns transferable patterns from synthetic data
- Can recommend for ANY products with sport categories
- Works with your real database products

In [8]:
import pandas as pd
import numpy as np
import pickle
import os
from collections import defaultdict

## Step 1: Load Synthetic Training Data

In [9]:
# Load synthetic data
raw_df = pd.read_excel("CC_Synthetic_Training_Data.xlsx", header=None)
raw_df.columns = ["customer_name", "customer_age", "location", "sport", "brand", "product_name", "quantity", "order_amount"]

df = raw_df.copy()
print(f"Loaded {len(df)} purchase records")
print(f"Unique customers: {df['customer_name'].nunique()}")
print(f"Unique sports: {df['sport'].nunique()}")
print(f"Unique products: {df['product_name'].nunique()}")

df.head()

Loaded 5000 purchase records
Unique customers: 1000
Unique sports: 11
Unique products: 215


Unnamed: 0,customer_name,customer_age,location,sport,brand,product_name,quantity,order_amount
0,Aaron Bechtelar,85,"Sherylfort, ND",Soccer,Adidas,Adidas Soccer Jersey,3,945.16
1,Aaron Bechtelar,85,"Sherylfort, ND",Football,Nike,Nike Football Jersey,3,925.27
2,Aaron Bechtelar,84,"Sherylfort, ND",Football,Riddell,Riddell Football Helmet,3,586.16
3,Aaron Bechtelar,83,"Sherylfort, ND",Swimming,TYR,TYR Swimming Goggles,3,602.34
4,Aaron Bechtelar,82,"Sherylfort, ND",Baseball,Wilson,Wilson Baseball Bat,3,722.47


## Step 2: Learn Sport-to-Sport Associations

For each sport, find which other sports are frequently purchased by the same users.

In [10]:
def compute_sport_associations(df):
    """
    Compute sport-to-sport association scores.
    
    Logic: If users who buy Sport A also frequently buy Sport B,
    then A and B have high association.
    
    Returns:
        dict: {sport: {other_sport: association_score, ...}, ...}
    """
    # Get all sports purchased by each customer
    customer_sports = df.groupby('customer_name')['sport'].apply(set).to_dict()
    
    # Count co-occurrences
    sport_cooccurrence = defaultdict(lambda: defaultdict(int))
    sport_totals = defaultdict(int)
    
    for customer, sports in customer_sports.items():
        sports_list = list(sports)
        for i, sport_a in enumerate(sports_list):
            sport_totals[sport_a] += 1
            for sport_b in sports_list[i+1:]:
                sport_cooccurrence[sport_a][sport_b] += 1
                sport_cooccurrence[sport_b][sport_a] += 1
    
    # Convert to association scores (normalized by total occurrences)
    sport_associations = {}
    for sport_a in sport_cooccurrence:
        sport_associations[sport_a] = {}
        for sport_b, cooccur_count in sport_cooccurrence[sport_a].items():
            # Association score: how often sport_b appears with sport_a
            score = cooccur_count / sport_totals[sport_a]
            sport_associations[sport_a][sport_b] = score
    
    return sport_associations, sport_totals

sport_associations, sport_totals = compute_sport_associations(df)

print("\nSport Association Scores:")
print("=" * 60)
for sport in sorted(sport_associations.keys())[:5]:  # Show first 5
    print(f"\n{sport}:")
    sorted_assoc = sorted(sport_associations[sport].items(), key=lambda x: x[1], reverse=True)[:3]
    for other_sport, score in sorted_assoc:
        print(f"  → {other_sport}: {score:.3f}")


Sport Association Scores:

Baseball:
  → Golf: 0.346
  → Surfing: 0.343
  → Football: 0.340

Basketball:
  → Golf: 0.361
  → Fencing: 0.358
  → Surfing: 0.340

Fencing:
  → Swimming: 0.358
  → Football: 0.341
  → Basketball: 0.336

Football:
  → Running: 0.381
  → Fencing: 0.350
  → Surfing: 0.340

Golf:
  → Basketball: 0.361
  → Surfing: 0.345
  → Baseball: 0.345


## Step 3: Compute Sport Popularity

For cold-start scenarios, we need to know which sports are most popular overall.

In [11]:
# Compute sport popularity (total purchases per sport)
sport_popularity = df.groupby('sport')['quantity'].sum().sort_values(ascending=False)

print("Top 10 Most Popular Sports:")
print(sport_popularity.head(10))

Top 10 Most Popular Sports:
sport
Fencing       1373
Football      1367
Tennis        1347
Swimming      1336
Basketball    1312
Baseball      1297
Golf          1286
Surfing       1284
Running       1284
Soccer        1281
Name: quantity, dtype: int64


## Step 4: Extract Product Metadata by Sport

Group products by their sport category for quick lookup.

In [12]:
# Get unique products with their sport and brand
product_metadata = df[['product_name', 'sport', 'brand']].drop_duplicates()

# Create sport-to-products mapping
products_by_sport = product_metadata.groupby('sport')['product_name'].apply(list).to_dict()

print("\nProducts per sport:")
for sport, products in list(products_by_sport.items())[:5]:
    print(f"{sport}: {len(products)} products")


Products per sport:
Baseball: 25 products
Basketball: 24 products
Fencing: 15 products
Football: 25 products
Golf: 16 products


## Step 5: Save Model Artifacts

Package everything needed for recommendations into a pickle file.

In [13]:
# Create artifacts dictionary
artifacts = {
    "sport_associations": sport_associations,  # Sport-to-sport association scores
    "sport_popularity": sport_popularity.to_dict(),  # Popularity ranking for cold start
    "products_by_sport": products_by_sport,  # Quick lookup: sport -> list of products
    "product_metadata": product_metadata.set_index('product_name').to_dict('index'),  # Product details
    "all_sports": list(sport_associations.keys())  # List of all sports
}

# Save to pickle
MODEL_DIR = "model_artifacts"
os.makedirs(MODEL_DIR, exist_ok=True)

pickle_path = os.path.join(MODEL_DIR, "content_based_artifacts.pkl")
with open(pickle_path, "wb") as f:
    pickle.dump(artifacts, f)

print(f"\n✅ Model artifacts saved to: {pickle_path}")
print(f"\nArtifacts contain:")
for key, value in artifacts.items():
    if isinstance(value, dict):
        print(f"  - {key}: {len(value)} entries")
    elif isinstance(value, list):
        print(f"  - {key}: {len(value)} items")
    else:
        print(f"  - {key}: {type(value)}")


✅ Model artifacts saved to: model_artifacts/content_based_artifacts.pkl

Artifacts contain:
  - sport_associations: 11 entries
  - sport_popularity: 11 entries
  - products_by_sport: 11 entries
  - product_metadata: 215 entries
  - all_sports: 11 items


## Step 6: Test the Model

Verify that sport associations make sense.

In [14]:
def test_sport_recommendations(sport, top_n=5):
    """
    Test: Given a sport, what other sports should we recommend?
    """
    if sport not in sport_associations:
        print(f"Sport '{sport}' not found in training data")
        return
    
    print(f"\nIf user bought {sport} products, also recommend:")
    print("=" * 50)
    
    # Get associated sports sorted by score
    associated = sorted(
        sport_associations[sport].items(), 
        key=lambda x: x[1], 
        reverse=True
    )[:top_n]
    
    for other_sport, score in associated:
        print(f"  {other_sport}: {score:.1%} association")

# Test with a few sports
test_sports = list(sport_associations.keys())[:3]
for sport in test_sports:
    test_sport_recommendations(sport, top_n=3)


If user bought Swimming products, also recommend:
  Fencing: 36.9% association
  Tennis: 36.4% association
  Football: 33.8% association

If user bought Football products, also recommend:
  Running: 38.1% association
  Fencing: 35.0% association
  Surfing: 34.0% association

If user bought Soccer products, also recommend:
  Tennis: 35.4% association
  Baseball: 34.9% association
  Fencing: 34.6% association


## Summary

**What This Model Learned:**
- Sport-to-sport purchase associations
- Overall sport popularity rankings
- Product groupings by sport

**How It Works With Real Products:**
1. Real customer buys a product with sport="Soccer"
2. Model knows: "Soccer buyers also like Basketball (60%), Football (40%)"
3. Recommender returns: More Soccer products + some Basketball/Football products
4. Works with ANY products that have sport categories!

**Next Steps:**
1. Update `Recommender.py` to use these artifacts
2. Modify recommendation logic to match by sport
3. Update FastAPI endpoints to pass sport information