# Item-Based Collaborative Filtering - Model Training

Training pipeline for recommendation model using Item-Based CF approach.

Steps:
1. Load purchase and view data from database
2. Build user-item interaction matrix
3. Calculate item-item similarity matrix (Cosine Similarity)
4. Apply co-occurrence filtering
5. Save trained model to pickle file
6. Test model recommendations

## 1. Setup and Imports

In [48]:
import pandas as pd
import numpy as np
import pickle
from datetime import datetime
from pathlib import Path
from sqlalchemy import create_engine
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Database Connection

In [49]:
# Database configuration
DB_CONFIG = {
    'host': 'localhost',
    'port': 3306,
    'database': 'ecommerce_db',
    'user': 'root',
    'password': 'root'
}

# Create connection
engine = create_engine(
    f"mysql+pymysql://{DB_CONFIG['user']}:{DB_CONFIG['password']}@"
    f"{DB_CONFIG['host']}:{DB_CONFIG['port']}/{DB_CONFIG['database']}"
)

print("Database connection established!")

Database connection established!


## 3. Load Purchase Data

In [50]:
# Load purchase history from orders and order_items
query_purchases = """
SELECT
    HEX(o.user_id) as user_id,
    HEX(pv.product_id) as product_id,
    oi.quantity,
    o.created_at as purchase_date
FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id
INNER JOIN product_variants pv ON oi.variant_id = pv.id
WHERE o.order_status IN ('CONFIRMED', 'DELIVERED', 'SHIPPED')
  AND o.is_active = 1
  AND oi.is_active = 1
ORDER BY o.created_at DESC
"""

df_purchases = pd.read_sql(query_purchases, engine)

print(f"Loaded {len(df_purchases):,} purchase records")
print(f"Unique users: {df_purchases['user_id'].nunique():,}")
print(f"Unique products: {df_purchases['product_id'].nunique():,}")
print(f"\nSample data:")
df_purchases.head()

Loaded 8,014 purchase records
Unique users: 102
Unique products: 33

Sample data:


Unnamed: 0,user_id,product_id,quantity,purchase_date
0,E06384C1443347D3B03BE62D34E75B9C,8DFA6150D7514B86A5DC4D26D4EAC44F,1,2025-11-04 23:20:09.676367
1,4A0744BBEBD04D328C48CAC30491A726,738A974F3F374DFCB5D8FD17A3A8A5F2,1,2025-11-04 23:12:35.676367
2,CD2C649F7531440588A57EF2262CDB46,BA6F86D76CA44499AEE9108F60A9A476,1,2025-11-04 23:07:27.676367
3,CD2C649F7531440588A57EF2262CDB46,CF54E8C21B8A44548A184BEFC1DA5611,1,2025-11-04 23:07:27.676367
4,305AF675B5094C5AA566E95ACA2227CF,25BB55DAFE254A358396CC98410FB5C0,1,2025-11-04 23:06:40.676367


## 4. Load View Data (Implicit Feedback)

In [51]:
# Load product view history
query_views = """
SELECT
    HEX(user_id) as user_id,
    HEX(product_id) as product_id,
    view_count,
    last_viewed_at
FROM user_product_views
"""

df_views = pd.read_sql(query_views, engine)

print(f"Loaded {len(df_views):,} view records")
print(f"Unique users: {df_views['user_id'].nunique():,}")
print(f"Unique products: {df_views['product_id'].nunique():,}")
print(f"\nSample data:")
df_views.head()

Loaded 3,173 view records
Unique users: 102
Unique products: 33

Sample data:


Unnamed: 0,user_id,product_id,view_count,last_viewed_at
0,00865C51DFA045A7AD4F510E366CCFCB,0840E707EDFC4C31AA5097D0F004A8CE,3,2025-05-26 17:15:14.676367
1,00925D0E0A3B4F668AC6521C4525555E,0840E707EDFC4C31AA5097D0F004A8CE,6,2025-05-21 05:32:28.676367
2,054AF889BC5E4AAE947C0AC124803AE8,0840E707EDFC4C31AA5097D0F004A8CE,1,2025-06-06 13:48:45.676367
3,094B9067489349A1BDCCE008413E7DDE,0840E707EDFC4C31AA5097D0F004A8CE,3,2025-05-26 00:57:26.676367
4,0E8949A2F260466F9ADC9D8C7FDC6F19,0840E707EDFC4C31AA5097D0F004A8CE,2,2025-05-18 09:06:42.676367


## 5. Build User-Item Interaction Matrix

Combine purchase and view data with weighted scoring:
- Purchase: weight = 1.0 (explicit feedback, more important)
- View: weight = 0.3 (implicit feedback, less important)

In [52]:
# Aggregate purchases: sum quantities per user-product pair
purchase_matrix = df_purchases.groupby(['user_id', 'product_id'])['quantity'].sum()
purchase_matrix = purchase_matrix * 1.0  # Purchase weight = 1.0

print("Purchase matrix created")
print(f"Shape: {purchase_matrix.shape}")

# Aggregate views: sum view counts per user-product pair
view_matrix = df_views.groupby(['user_id', 'product_id'])['view_count'].sum()
view_matrix = view_matrix * 0.3  # View weight = 0.3

print("\nView matrix created")
print(f"Shape: {view_matrix.shape}")

# Combine: purchase + views (weighted sum)
interaction_scores = purchase_matrix.add(view_matrix, fill_value=0)

print("\nCombined interaction scores")
print(f"Total interactions: {len(interaction_scores):,}")

Purchase matrix created
Shape: (2835,)

View matrix created
Shape: (3173,)

Combined interaction scores
Total interactions: 3,331


In [53]:
# Convert to wide format matrix (users x products)
user_item_matrix = interaction_scores.unstack(fill_value=0)

print("User-Item Interaction Matrix:")
print(f"Shape: {user_item_matrix.shape} (users x products)")
print(f"Total cells: {user_item_matrix.size:,}")
print(f"Non-zero cells: {(user_item_matrix > 0).sum().sum():,}")

# Calculate sparsity
sparsity = 1 - ((user_item_matrix > 0).sum().sum() / user_item_matrix.size)
print(f"\nSparsity: {sparsity:.2%}")
print(f"Density: {(1-sparsity):.2%}")

# Show sample
print("\nSample matrix (first 5 users x 5 products):")
user_item_matrix.iloc[:5, :5]

User-Item Interaction Matrix:
Shape: (102, 33) (users x products)
Total cells: 3,366
Non-zero cells: 3,331

Sparsity: 1.04%
Density: 98.96%

Sample matrix (first 5 users x 5 products):


product_id,0840E707EDFC4C31AA5097D0F004A8CE,1EAE17D73016409F8840E802DC599F5F,25BB55DAFE254A358396CC98410FB5C0,27ABB23A419A4B05A14FEC388EC8B863,307BB36E6EE44BE5897DE0E1FDD7B270
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
00865C51DFA045A7AD4F510E366CCFCB,2.9,1.2,4.2,1.2,13.2
00925D0E0A3B4F668AC6521C4525555E,2.8,1.6,1.6,3.2,13.3
054AF889BC5E4AAE947C0AC124803AE8,6.3,3.6,2.2,0.6,11.9
094B9067489349A1BDCCE008413E7DDE,2.9,2.2,6.9,3.8,5.9
0E8949A2F260466F9ADC9D8C7FDC6F19,4.6,4.5,1.6,4.9,13.4


## 6. Calculate Item-Item Similarity Matrix

Using Cosine Similarity to measure how similar products are based on user purchase patterns.

In [54]:
print("Calculating item-item similarity matrix...")
print("This may take a few seconds...\n")

# Transpose to get product x user matrix
item_user_matrix = user_item_matrix.T

# Calculate cosine similarity between all product pairs
# Result: (n_products x n_products) matrix
similarity_array = cosine_similarity(item_user_matrix)

# Convert to DataFrame for easier manipulation
product_ids = item_user_matrix.index.tolist()
similarity_matrix = pd.DataFrame(
    similarity_array,
    index=product_ids,
    columns=product_ids
)

print(f"Similarity matrix shape: {similarity_matrix.shape}")
print(f"\nSample similarities (first 5x5):")
print(similarity_matrix.iloc[:5, :5].round(3))

Calculating item-item similarity matrix...
This may take a few seconds...

Similarity matrix shape: (33, 33)

Sample similarities (first 5x5):
                                  0840E707EDFC4C31AA5097D0F004A8CE  \
0840E707EDFC4C31AA5097D0F004A8CE                             1.000   
1EAE17D73016409F8840E802DC599F5F                             0.733   
25BB55DAFE254A358396CC98410FB5C0                             0.754   
27ABB23A419A4B05A14FEC388EC8B863                             0.703   
307BB36E6EE44BE5897DE0E1FDD7B270                             0.779   

                                  1EAE17D73016409F8840E802DC599F5F  \
0840E707EDFC4C31AA5097D0F004A8CE                             0.733   
1EAE17D73016409F8840E802DC599F5F                             1.000   
25BB55DAFE254A358396CC98410FB5C0                             0.685   
27ABB23A419A4B05A14FEC388EC8B863                             0.757   
307BB36E6EE44BE5897DE0E1FDD7B270                             0.802   

               

## 7. Apply Co-occurrence Filtering

Filter out weak similarities based on minimum co-occurrence threshold.
This improves recommendation quality by removing spurious correlations.

In [55]:
# Calculate co-occurrence matrix
# Binary matrix: 1 if user purchased product, 0 otherwise
binary_matrix = (user_item_matrix > 0).astype(int)

# Co-occurrence = how many users bought both products
# Formula: binary_matrix.T @ binary_matrix
co_occurrence = binary_matrix.T @ binary_matrix

# Set diagonal to 0 (product with itself)
np.fill_diagonal(co_occurrence.values, 0)

print("Co-occurrence matrix calculated")
print(f"Shape: {co_occurrence.shape}")
print(f"\nSample co-occurrence (first 5x5):")
print(co_occurrence.iloc[:5, :5])

Co-occurrence matrix calculated
Shape: (33, 33)

Sample co-occurrence (first 5x5):
product_id                        0840E707EDFC4C31AA5097D0F004A8CE  \
product_id                                                           
0840E707EDFC4C31AA5097D0F004A8CE                                 0   
1EAE17D73016409F8840E802DC599F5F                                98   
25BB55DAFE254A358396CC98410FB5C0                                96   
27ABB23A419A4B05A14FEC388EC8B863                                95   
307BB36E6EE44BE5897DE0E1FDD7B270                                98   

product_id                        1EAE17D73016409F8840E802DC599F5F  \
product_id                                                           
0840E707EDFC4C31AA5097D0F004A8CE                                98   
1EAE17D73016409F8840E802DC599F5F                                 0   
25BB55DAFE254A358396CC98410FB5C0                               100   
27ABB23A419A4B05A14FEC388EC8B863                                99   
307BB3

In [56]:
# Apply minimum co-occurrence threshold (UPDATED: Increased from 2 to 5)
MIN_CO_OCCURRENCE = 5  # Products must be bought together at least 5 times

print(f"Applying minimum co-occurrence filter: {MIN_CO_OCCURRENCE}")
print(f"\nBefore filtering:")
print(f"  Non-zero similarities: {(similarity_matrix > 0).sum().sum() // 2:,} pairs")

# Create mask: keep only similarities where co-occurrence >= threshold
co_occurrence_mask = co_occurrence >= MIN_CO_OCCURRENCE

# Apply mask to similarity matrix
filtered_similarity = similarity_matrix.copy()
filtered_similarity = filtered_similarity.where(co_occurrence_mask, 0)

# Keep diagonal as 1 (product with itself)
np.fill_diagonal(filtered_similarity.values, 1.0)

print(f"\nAfter filtering:")
print(f"  Non-zero similarities: {(filtered_similarity > 0).sum().sum() // 2:,} pairs")
print(f"  Reduction: {((similarity_matrix > 0).sum().sum() - (filtered_similarity > 0).sum().sum()) // 2:,} pairs removed")
print(f"\nThis helps eliminate weak correlations and improve recommendation quality.")

Applying minimum co-occurrence filter: 5

Before filtering:
  Non-zero similarities: 544 pairs

After filtering:
  Non-zero similarities: 544 pairs
  Reduction: 0 pairs removed

This helps eliminate weak correlations and improve recommendation quality.


## 8. Model Summary and Statistics

In [57]:
# Calculate model statistics
n_users = len(user_item_matrix)
n_products = len(product_ids)
n_interactions = (user_item_matrix > 0).sum().sum()

print("="*70)
print("MODEL TRAINING SUMMARY")
print("="*70)
print(f"Data:")
print(f"  Users: {n_users:,}")
print(f"  Products: {n_products:,}")
print(f"  Interactions: {n_interactions:,}")
print(f"  Sparsity: {sparsity:.2%}")

print(f"\nModel:")
print(f"  Algorithm: Item-Based Collaborative Filtering")
print(f"  Similarity metric: Cosine Similarity")
print(f"  Min co-occurrence: {MIN_CO_OCCURRENCE}")
print(f"  Hybrid approach: Purchase (1.0) + Views (0.3)")

print(f"\nSimilarity Matrix:")
print(f"  Shape: {filtered_similarity.shape}")
print(f"  Non-zero pairs: {(filtered_similarity > 0).sum().sum() // 2:,}")

# Calculate avg and max similarity - FIXED: use .values.flatten() to get 1D array
non_zero_similarities = filtered_similarity.values[filtered_similarity.values > 0]
avg_sim = float(non_zero_similarities.mean())

# Get max similarity excluding diagonal (product with itself)
off_diagonal_mask = ~np.eye(n_products, dtype=bool)
max_sim = float(filtered_similarity.values[off_diagonal_mask].max())

print(f"  Avg similarity (non-zero): {avg_sim:.4f}")
print(f"  Max similarity: {max_sim:.4f}")

print("\n" + "="*70)

MODEL TRAINING SUMMARY
Data:
  Users: 102
  Products: 33
  Interactions: 3,331
  Sparsity: 1.04%

Model:
  Algorithm: Item-Based Collaborative Filtering
  Similarity metric: Cosine Similarity
  Min co-occurrence: 5
  Hybrid approach: Purchase (1.0) + Views (0.3)

Similarity Matrix:
  Shape: (33, 33)
  Non-zero pairs: 544
  Avg similarity (non-zero): 0.7654
  Max similarity: 0.9474



## 9. Test Model - Sample Recommendations

In [58]:
# Function to get similar products
def get_similar_products(product_id, similarity_df, top_n=10):
    """
    Get products most similar to the given product.
    
    Parameters:
        product_id: Target product ID
        similarity_df: Similarity matrix DataFrame
        top_n: Number of recommendations to return
    
    Returns:
        List of (product_id, similarity_score) tuples
    """
    if product_id not in similarity_df.index:
        return []
    
    # Get similarity scores for this product
    similarities = similarity_df.loc[product_id]
    
    # Remove the product itself
    similarities = similarities.drop(product_id, errors='ignore')
    
    # Filter out zero similarities
    similarities = similarities[similarities > 0]
    
    # Sort and get top N
    top_items = similarities.nlargest(top_n)
    
    return [(item_id, score) for item_id, score in top_items.items()]

# Test with first product
sample_product = product_ids[0]
print(f"Testing recommendations for product: {sample_product}")
print("\nTop 10 similar products:")
print("-" * 60)

recommendations = get_similar_products(sample_product, filtered_similarity, top_n=10)
for i, (prod_id, score) in enumerate(recommendations, 1):
    print(f"{i:2}. {prod_id}  (similarity: {score:.4f})")

if not recommendations:
    print("No recommendations found (insufficient co-occurrence)")

Testing recommendations for product: 0840E707EDFC4C31AA5097D0F004A8CE

Top 10 similar products:
------------------------------------------------------------
 1. BA6F86D76CA44499AEE9108F60A9A476  (similarity: 0.8020)
 2. 8878D08F78E147E5BB787FFAAE109482  (similarity: 0.8002)
 3. A6BC68C1D5EC48B899BB17DE744A9D84  (similarity: 0.7891)
 4. E2C2445A70D84EC8B73536B3F0C97327  (similarity: 0.7807)
 5. 307BB36E6EE44BE5897DE0E1FDD7B270  (similarity: 0.7792)
 6. A8EE15B067364544B975986B953AA5DC  (similarity: 0.7645)
 7. 82771BA7894A45C58A689CD7988DB0B0  (similarity: 0.7633)
 8. 25BB55DAFE254A358396CC98410FB5C0  (similarity: 0.7543)
 9. 42F9CAA10821429AAAB0C1714ABB437F  (similarity: 0.7477)
10. E1A15CDBBEB24703B00ADBF762C440EC  (similarity: 0.7329)


In [59]:
# Function to get user recommendations (FIXED: Weighted Scoring)
def recommend_for_user(user_id, user_item_df, similarity_df, top_n=10):
    """
    Generate personalized recommendations for a user using weighted scoring.
    
    Parameters:
        user_id: Target user ID
        user_item_df: User-item interaction matrix
        similarity_df: Item similarity matrix
        top_n: Number of recommendations
    
    Returns:
        List of (product_id, score) tuples
    """
    if user_id not in user_item_df.index:
        return []
    
    # Get user's interaction scores (purchases + views)
    user_interactions = user_item_df.loc[user_id]
    purchased_items = user_interactions[user_interactions > 0]
    
    if len(purchased_items) == 0:
        return []
    
    # FIXED: Weighted scoring based on user interaction strength
    # Instead of simple average, weight by how much user interacted with each item
    candidate_scores = {}
    
    for item_id, interaction_strength in purchased_items.items():
        # Get similarity scores for this item
        item_similarities = similarity_df.loc[item_id]
        
        # For each similar item
        for candidate_id, similarity in item_similarities.items():
            # Skip if already purchased or similarity is 0
            if candidate_id in purchased_items.index or similarity == 0:
                continue
            
            # Accumulate weighted score: interaction_strength * similarity
            if candidate_id not in candidate_scores:
                candidate_scores[candidate_id] = 0
            candidate_scores[candidate_id] += interaction_strength * similarity
    
    # Sort by score and get top N
    if not candidate_scores:
        return []
    
    top_candidates = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
    return top_candidates

# Test with first user
sample_user = user_item_matrix.index[0]
print(f"Testing IMPROVED recommendations for user: {sample_user}")

# Show user's purchase history
user_purchases = user_item_matrix.loc[sample_user]
purchased = user_purchases[user_purchases > 0]
print(f"\nUser has interacted with {len(purchased)} products")
print(f"Top 5 interactions:")
for i, (prod_id, score) in enumerate(purchased.nlargest(5).items(), 1):
    print(f"  {i}. {prod_id[:20]}... (score: {score:.2f})")

# Get recommendations
print("\nTop 10 recommended products (WEIGHTED SCORING):")
print("-" * 60)
recommendations = recommend_for_user(sample_user, user_item_matrix, filtered_similarity, top_n=10)
for i, (prod_id, score) in enumerate(recommendations, 1):
    print(f"{i:2}. {prod_id}  (score: {score:.4f})")

if not recommendations:
    print("No recommendations found - user may have purchased all available products")

Testing IMPROVED recommendations for user: 00865C51DFA045A7AD4F510E366CCFCB

User has interacted with 33 products
Top 5 interactions:
  1. 307BB36E6EE44BE5897D... (score: 13.20)
  2. A6BC68C1D5EC48B899BB... (score: 13.20)
  3. E2C2445A70D84EC8B735... (score: 11.80)
  4. BA6F86D76CA44499AEE9... (score: 11.50)
  5. 903219F6DB3F451FB9CE... (score: 10.60)

Top 10 recommended products (WEIGHTED SCORING):
------------------------------------------------------------
No recommendations found - user may have purchased all available products


## 9.2 Fallback Recommendation Strategy

For edge cases where personalized recommendations are not available (e.g., user purchased all products, new users), we use popularity-based fallback.

In [60]:
# Fallback: Get popular products
def get_popular_products(user_item_df, exclude_products=None, top_n=10):
    """
    Get most popular products based on total interactions.
    Used as fallback when personalized recommendations are not available.
    
    Parameters:
        user_item_df: User-item interaction matrix
        exclude_products: List of product IDs to exclude (e.g., already purchased)
        top_n: Number of recommendations to return
    
    Returns:
        List of (product_id, popularity_score) tuples
    """
    # Calculate popularity: sum of all user interactions per product
    popularity_scores = user_item_df.sum(axis=0).sort_values(ascending=False)
    
    # Filter out excluded products
    if exclude_products is not None:
        popularity_scores = popularity_scores.drop(exclude_products, errors='ignore')
    
    # Get top N
    top_products = popularity_scores.head(top_n)
    return [(prod_id, score) for prod_id, score in top_products.items()]

# Enhanced recommend function with fallback
def recommend_for_user_with_fallback(user_id, user_item_df, similarity_df, top_n=10):
    """
    Generate personalized recommendations with popularity-based fallback.
    
    Parameters:
        user_id: Target user ID
        user_item_df: User-item interaction matrix
        similarity_df: Item similarity matrix
        top_n: Number of recommendations
    
    Returns:
        Tuple: (recommendations_list, recommendation_type)
        - recommendation_type: 'personalized' or 'popular'
    """
    # Try personalized recommendations first
    personalized_recs = recommend_for_user(user_id, user_item_df, similarity_df, top_n)
    
    if personalized_recs:
        return personalized_recs, 'personalized'
    
    # Fallback to popular products
    # Get user's purchased products to exclude
    if user_id in user_item_df.index:
        user_purchases = user_item_df.loc[user_id]
        purchased_products = user_purchases[user_purchases > 0].index.tolist()
    else:
        purchased_products = None
    
    popular_recs = get_popular_products(user_item_df, exclude_products=purchased_products, top_n=top_n)
    return popular_recs, 'popular'

# Test with user who has purchased all products
print("Testing fallback strategy...")
print("=" * 60)

# Find a user with many purchases
user_with_most_purchases = user_item_matrix.sum(axis=1).idxmax()
num_purchased = (user_item_matrix.loc[user_with_most_purchases] > 0).sum()

print(f"\nUser: {user_with_most_purchases}")
print(f"Products purchased: {num_purchased}/{len(product_ids)}")

recommendations, rec_type = recommend_for_user_with_fallback(
    user_with_most_purchases, user_item_matrix, filtered_similarity, top_n=10
)

print(f"\nRecommendation type: {rec_type.upper()}")
print("Top 10 recommendations:")
print("-" * 60)
for i, (prod_id, score) in enumerate(recommendations, 1):
    print(f"{i:2}. {prod_id}  (score: {score:.4f})")

# Test popular products for new user
print("\n" + "=" * 60)
print("\nTesting for NEW USER (no purchase history):")
print("-" * 60)
popular_products = get_popular_products(user_item_matrix, top_n=10)
for i, (prod_id, score) in enumerate(popular_products, 1):
    print(f"{i:2}. {prod_id}  (popularity: {score:.2f})")

Testing fallback strategy...

User: E06384C1443347D3B03BE62D34E75B9C
Products purchased: 33/33

Recommendation type: POPULAR
Top 10 recommendations:
------------------------------------------------------------


Testing for NEW USER (no purchase history):
------------------------------------------------------------
 1. A6BC68C1D5EC48B899BB17DE744A9D84  (popularity: 1303.90)
 2. 307BB36E6EE44BE5897DE0E1FDD7B270  (popularity: 1258.20)
 3. BA6F86D76CA44499AEE9108F60A9A476  (popularity: 1224.40)
 4. 8878D08F78E147E5BB787FFAAE109482  (popularity: 1211.50)
 5. E2C2445A70D84EC8B73536B3F0C97327  (popularity: 1210.10)
 6. 5D40F0621A4A4882860B165E7EA725A7  (popularity: 332.30)
 7. F70B514C60A74B27B5FD93A8EF0D6F94  (popularity: 330.70)
 8. F5723D267940488BBAF0A7750DF52F50  (popularity: 326.20)
 9. F221EFECEB20438D989E78FEB513E1AF  (popularity: 321.60)
10. 5820158E3B1C46E28D6C6CCA87477D2A  (popularity: 321.00)


## 9.3 Model Evaluation Metrics

Evaluate recommendation quality using standard metrics.

In [61]:
# NOTE: Basic Evaluation Metrics for Collaborative Filtering
# Since we don't have train/test split in current setup,
# these metrics are calculated on the same data used for training (for demonstration)
# In production, should use proper train/test split

print("MODEL EVALUATION SUMMARY")
print("=" * 70)

# 1. Coverage: What percentage of products can be recommended?
recommendable_products = (filtered_similarity > 0).sum(axis=1)
coverage = (recommendable_products > 0).sum() / len(product_ids)
print(f"\n1. COVERAGE:")
print(f"   Products that can be recommended: {(recommendable_products > 0).sum()}/{len(product_ids)}")
print(f"   Coverage: {coverage:.2%}")
print(f"   Avg similar products per item: {recommendable_products[recommendable_products > 0].mean():.1f}")

# 2. Diversity: Average diversity of recommendations
# Calculate average pairwise distance in recommendations
def calculate_diversity(similarity_df, top_n=10):
    """Calculate average diversity of top-N recommendations"""
    diversities = []
    for product_id in similarity_df.index:
        similar_items = get_similar_products(product_id, similarity_df, top_n=top_n)
        if len(similar_items) >= 2:
            # Calculate avg pairwise dissimilarity
            similarities_in_recs = []
            for i, (prod1, _) in enumerate(similar_items):
                for prod2, _ in similar_items[i+1:]:
                    if prod1 in similarity_df.index and prod2 in similarity_df.columns:
                        sim = similarity_df.loc[prod1, prod2]
                        similarities_in_recs.append(sim)
            if similarities_in_recs:
                avg_sim = np.mean(similarities_in_recs)
                diversity = 1 - avg_sim  # Diversity = 1 - similarity
                diversities.append(diversity)
    return np.mean(diversities) if diversities else 0

diversity = calculate_diversity(filtered_similarity, top_n=10)
print(f"\n2. DIVERSITY:")
print(f"   Average diversity score: {diversity:.4f}")
print(f"   (0 = all recommendations identical, 1 = completely different)")

# 3. Personalization: How different are recommendations for different users?
def calculate_personalization(user_item_df, similarity_df, sample_users=20, top_n=10):
    """Calculate how personalized recommendations are across users"""
    user_rec_sets = []
    sampled_users = user_item_df.index[:sample_users].tolist()
    
    for user_id in sampled_users:
        recs = recommend_for_user(user_id, user_item_df, similarity_df, top_n=top_n)
        rec_set = set([prod_id for prod_id, _ in recs])
        user_rec_sets.append(rec_set)
    
    # Calculate Jaccard distance between all pairs
    if len(user_rec_sets) < 2:
        return 0
    
    distances = []
    for i in range(len(user_rec_sets)):
        for j in range(i+1, len(user_rec_sets)):
            intersection = len(user_rec_sets[i] & user_rec_sets[j])
            union = len(user_rec_sets[i] | user_rec_sets[j])
            if union > 0:
                jaccard_sim = intersection / union
                distances.append(1 - jaccard_sim)
    
    return np.mean(distances) if distances else 0

personalization = calculate_personalization(user_item_matrix, filtered_similarity, sample_users=20)
print(f"\n3. PERSONALIZATION:")
print(f"   Average personalization score: {personalization:.4f}")
print(f"   (0 = all users get same recs, 1 = completely different)")

# 4. Model Statistics
print(f"\n4. MODEL STATISTICS:")
print(f"   Total product pairs: {len(product_ids) * (len(product_ids) - 1) // 2:,}")
print(f"   Non-zero similarities: {(filtered_similarity > 0).sum().sum() // 2:,}")
print(f"   Similarity density: {((filtered_similarity > 0).sum().sum() // 2) / (len(product_ids) * (len(product_ids) - 1) // 2):.2%}")

print("\n" + "=" * 70)
print("\nNOTE: These are basic quality indicators.")
print("For production, implement proper train/test split and calculate:")
print("  - Precision@K, Recall@K")
print("  - NDCG (Normalized Discounted Cumulative Gain)")
print("  - Hit Rate")
print("  - A/B testing in production")
print("=" * 70)

MODEL EVALUATION SUMMARY

1. COVERAGE:
   Products that can be recommended: 33/33
   Coverage: 100.00%
   Avg similar products per item: 33.0

2. DIVERSITY:
   Average diversity score: 0.1631
   (0 = all recommendations identical, 1 = completely different)

3. PERSONALIZATION:
   Average personalization score: 1.0000
   (0 = all users get same recs, 1 = completely different)

4. MODEL STATISTICS:
   Total product pairs: 528
   Non-zero similarities: 544
   Similarity density: 103.03%


NOTE: These are basic quality indicators.
For production, implement proper train/test split and calculate:
  - Precision@K, Recall@K
  - NDCG (Normalized Discounted Cumulative Gain)
  - Hit Rate
  - A/B testing in production


## 10. Save Model to Pickle File

In [62]:
# Prepare model data for serialization
model_data = {
    'similarity_matrix': filtered_similarity,
    'product_ids': product_ids,
    'user_item_matrix': user_item_matrix,
    'trained_at': datetime.now().isoformat(),
    'n_users': n_users,
    'n_products': n_products,
    'n_interactions': int(n_interactions),
    'sparsity': float(sparsity),
    'min_co_occurrence': MIN_CO_OCCURRENCE,
    'use_hybrid': True,
    'purchase_weight': 1.0,
    'view_weight': 0.3
}

# Create models directory if not exists
models_dir = Path('../models')
models_dir.mkdir(exist_ok=True)

# Save to pickle file
model_path = models_dir / 'item_based_cf.pkl'
print(f"Saving model to: {model_path.absolute()}")

with open(model_path, 'wb') as f:
    pickle.dump(model_data, f)

# Verify file size
file_size = model_path.stat().st_size / (1024 * 1024)  # Convert to MB
print(f"\nModel saved successfully!")
print(f"File size: {file_size:.2f} MB")
print(f"\nModel can now be loaded and used for serving recommendations.")

Saving model to: d:\work-space\HTTM\ml-recommendation\notebooks\..\models\item_based_cf.pkl

Model saved successfully!
File size: 0.04 MB

Model can now be loaded and used for serving recommendations.


## 11. Verify Saved Model

In [63]:
# Test loading the saved model
print("Testing model loading...\n")

with open(model_path, 'rb') as f:
    loaded_model = pickle.load(f)

print("Model loaded successfully!")
print(f"\nModel metadata:")
print(f"  Trained at: {loaded_model['trained_at']}")
print(f"  Users: {loaded_model['n_users']:,}")
print(f"  Products: {loaded_model['n_products']:,}")
print(f"  Interactions: {loaded_model['n_interactions']:,}")
print(f"  Sparsity: {loaded_model['sparsity']:.2%}")
print(f"  Min co-occurrence: {loaded_model['min_co_occurrence']}")
print(f"  Hybrid mode: {loaded_model['use_hybrid']}")

print("\nModel is ready for deployment!")

Testing model loading...

Model loaded successfully!

Model metadata:
  Trained at: 2025-11-05T02:38:55.962590
  Users: 102
  Products: 33
  Interactions: 3,331
  Sparsity: 1.04%
  Min co-occurrence: 5
  Hybrid mode: True

Model is ready for deployment!


## Summary

Training completed! The model has been saved to `models/item_based_cf.pkl`.

**Next steps:**
1. Create `recommender.py` class to load and serve the model
2. Create `api.py` FastAPI service to expose REST endpoints
3. Integrate with Spring Boot backend
4. Test end-to-end recommendations

**Model characteristics:**
- Algorithm: Item-Based Collaborative Filtering
- Similarity: Cosine similarity with co-occurrence filtering
- Data: Purchase history + view history (hybrid approach)
- Ready for real-time inference