# Group 17: 
### Eyad Medhat 221100279 / Hady Aly 221101190 / Mohamed Mahfouz 221101743 / Omar Mady 221100745

# Part 2: Content-Based Recommendation System

This notebook implements a comprehensive content-based recommendation system as per the project requirements.

In [1]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd
import numpy as np
import re
import nltk
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download necessary NLTK data
nltk.download('stopwords')

# --- LOAD YOUR DATA HERE ---
data_path = r'../data/'
df = pd.read_csv(os.path.join(data_path, 'Amazon_health&household_label_encoded.csv'))

print("Data Loaded. Shape:", df.shape)
df.head()

Data Loaded. Shape: (14554, 7)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hadye\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,user_id,item,item_id_encoded,rating,price,text,is_green
0,AEV4DP5E3FJH6FHDLXUYQTDEQYCQ,Sonic Handheld Percussion Massage Gun - Deep T...,827,2,79.99,"This product worked great when it worked, but ...",1
1,AEHWUHNTB5FX32HJ7UBOZ2WWUX3Q,DUDE Wipes On-The-Go Flushable Wet Wipes - 1 P...,255,5,6.48,These are amazing for travel or for keeping in...,1
2,AEHWUHNTB5FX32HJ7UBOZ2WWUX3Q,"Sleep Mask for Side Sleeper, 100% Blackout 3D ...",814,5,12.69,These are great! My other sleep mask pressed o...,0
3,AEHWUHNTB5FX32HJ7UBOZ2WWUX3Q,"Cottonelle Freshfeel Flushable Wet Wipes, Adul...",226,5,15.79,These are really good quality. Do not tear lik...,1
4,AEHWUHNTB5FX32HJ7UBOZ2WWUX3Q,"Silk Sleep Eye Mask for Men Women, Comfortable...",808,5,8.88,Great for travel. Super soft and silky. Has ad...,0


In [3]:
# 1. Separate Unique Items from Ratings
# We need one row per item to build the feature matrix
df_items = df[['item_id_encoded', 'item', 'price', 'text', 'is_green']].drop_duplicates(subset='item_id_encoded').sort_values('item_id_encoded').set_index('item_id_encoded')


# --- FIX 1: Handle NaNs in Text ---
# Fill missing text with empty string
df_items['text'] = df_items['text'].fillna('')

# --- FIX 2: Handle NaNs in Price ---
# Fill missing prices with the Median price (better than 0)
median_price = df_items['price'].median()
df_items['price'] = df_items['price'].fillna(median_price)

# --- FIX 3: Handle NaNs in Is_Green ---
# Fill missing boolean with False (0)
df_items['is_green'] = df_items['is_green'].fillna(False)

print("Data Cleaned. Generating Matrix...")


# 2. Re-create the Item-Feature Matrix (from Phase 3)
# A. Text (TF-IDF)
tfidf = TfidfVectorizer(stop_words='english', max_features=100) # Limited to 100 for speed on large data
text_matrix = tfidf.fit_transform(df_items['text'].fillna('')).toarray()

# B. Price (Normalized)
scaler = MinMaxScaler()
price_vec = scaler.fit_transform(df_items[['price']])

# C. Is_Green (Binary)
green_vec = df_items[['is_green']].astype(int).values

# D. Combine into "item_features"
item_features = np.hstack([text_matrix, price_vec, green_vec])

print(f"Item-Feature Matrix Ready: {item_features.shape}")

Data Cleaned. Generating Matrix...
Item-Feature Matrix Ready: (1000, 102)


In [4]:
import numpy as np

# ==========================================
# 4. USER PROFILE CONSTRUCTION
# ==========================================

def build_user_profiles(df, item_features_matrix):
    print("--- 4. User Profile Construction ---")
    
    # 1. Get unique users
    unique_users = df['user_id'].unique()
    user_profiles = {}
    
    # ---------------------------------------------------------
    # 4.2 STRATEGY: Handle Cold-Start Users (Popular/Average)
    # ---------------------------------------------------------
    # Since we lack demographics, we calculate the "Global Average Item"
    # This represents the "average taste" of the entire catalog.
    # Ideally, you could weigh this by popularity (rating count), 
    # but a simple mean is sufficient for this requirement.
    cold_start_vector = np.mean(item_features_matrix, axis=0)
    
    print(f"   -> Cold-Start Strategy: Using Global Average Item Vector.")
    
    # ---------------------------------------------------------
    # 4.1 STRATEGY: Build User Profiles (Weighted Average)
    # ---------------------------------------------------------
    for uid in unique_users:
        # Get user history
        user_history = df[df['user_id'] == uid]
        
        # If user has no ratings (or valid item_ids), treat as Cold Start
        if user_history.empty:
            user_profiles[uid] = cold_start_vector
            continue
            
        # Get indices and ratings
        # Ensure indices are integers for matrix lookup
        item_indices = user_history['item_id_encoded'].values.astype(int)
        ratings = user_history['rating'].values.reshape(-1, 1)
        
        # ERROR HANDLING: Check if any item_id is out of bounds
        # (This happens if df contains items not in our feature matrix)
        valid_mask = item_indices < item_features_matrix.shape[0]
        item_indices = item_indices[valid_mask]
        ratings = ratings[valid_mask]
        
        if len(item_indices) == 0:
            user_profiles[uid] = cold_start_vector
            continue

        # Fetch vectors for items rated by this user
        rated_item_vectors = item_features_matrix[item_indices]
        
        # CALCULATE WEIGHTED AVERAGE
        # Formula: Sum(Item_Vector * Rating) / Sum(Ratings)
        weighted_sum = np.sum(rated_item_vectors * ratings, axis=0)
        total_rating_val = np.sum(ratings)
        
        if total_rating_val == 0:
            user_profiles[uid] = cold_start_vector
        else:
            user_profiles[uid] = weighted_sum / total_rating_val
            
    print(f"   -> Built profiles for {len(user_profiles)} users.")
    return user_profiles

# --- EXECUTE PHASE 4 ---
# Input: Your main dataframe (df) and the matrix from Phase 3 (item_features)
user_profiles = build_user_profiles(df, item_features)

# --- VERIFICATION ---
# Let's look at the first user's profile
sample_uid = list(user_profiles.keys())[0]
print(f"\nSample Profile (User {sample_uid}):")
print(f"Vector Shape: {user_profiles[sample_uid].shape}")
print(f"First 5 Features: {user_profiles[sample_uid][:5]}")

--- 4. User Profile Construction ---
   -> Cold-Start Strategy: Using Global Average Item Vector.
   -> Built profiles for 10000 users.

Sample Profile (User AEV4DP5E3FJH6FHDLXUYQTDEQYCQ):
Vector Shape: (102,)
First 5 Features: [0. 0. 0. 0. 0.]


In [5]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# ==========================================
# 5. SIMILARITY & RECOMMENDATION
# ==========================================

def get_recommendations(user_id, user_profiles, item_features_matrix, df_full, top_n=10):
    """
    Generates content-based recommendations for a specific user.
    """
    # 1. Get the User's Profile Vector
    if user_id not in user_profiles:
        print(f"User {user_id} not found in profiles.")
        return []
    
    # Reshape is needed because cosine_similarity expects a 2D array (1 sample, n features)
    user_vector = user_profiles[user_id].reshape(1, -1)
    
    # ---------------------------------------------------------
    # 5.1 COMPUTE SIMILARITY
    # ---------------------------------------------------------
    # Calculate similarity between this User Vector and ALL Item Vectors
    # Result shape: (1, num_items) -> flatten to 1D array
    similarity_scores = cosine_similarity(user_vector, item_features_matrix).flatten()
    
    # ---------------------------------------------------------
    # 5.2 GENERATE TOP-N RECOMMENDATIONS
    # ---------------------------------------------------------
    
    # Get the list of items the user has ALREADY rated
    # We don't want to recommend things they already know about
    rated_items = df_full[df_full['user_id'] == user_id]['item_id_encoded'].values
    
    # Create a list of tuples: (item_id_encoded, similarity_score)
    # We enumerate to get the index (which corresponds to item_id_encoded)
    all_scores = list(enumerate(similarity_scores))
    
    # Filter out already rated items
    candidates = [
        (item_id, score) 
        for item_id, score in all_scores 
        if item_id not in rated_items
    ]
    
    # Sort by Score (Descending) -> Highest similarity first
    candidates.sort(key=lambda x: x[1], reverse=True)
    
    # Slice the top N
    top_recommendations = candidates[:top_n]
    
    return top_recommendations

# ==========================================
# EXECUTION EXAMPLE
# ==========================================

# Select a sample user to test
sample_user = df['user_id'].iloc[0]

# --- Get Top 10 Recommendations ---
recs_10 = get_recommendations(sample_user, user_profiles, item_features, df, top_n=10)

print(f"\n--- Top 10 Recommendations for User {sample_user} ---")
print(f"{'Item ID':<10} | {'Score':<8} | {'Item Name (Lookup)'}")
print("-" * 50)

# We need a helper to look up item names from IDs
item_lookup = df[['item_id_encoded', 'item']].drop_duplicates().set_index('item_id_encoded')

for item_id, score in recs_10:
    # Handle case where item_id might be out of lookup range (sanity check)
    try:
        item_name = item_lookup.loc[item_id, 'item']
        # Truncate long names for cleaner print
        item_name = (item_name[:30] + '..') if len(item_name) > 30 else item_name
    except KeyError:
        item_name = "Unknown Item"
        
    print(f"{item_id:<10} | {score:.4f}   | {item_name}")


# --- Get Top 20 Recommendations ---
recs_20 = get_recommendations(sample_user, user_profiles, item_features, df, top_n=20)
print(f"\n(Generated Top-20 list. Count: {len(recs_20)})")


--- Top 10 Recommendations for User AEV4DP5E3FJH6FHDLXUYQTDEQYCQ ---
Item ID    | Score    | Item Name (Lookup)
--------------------------------------------------
428        | 0.8180   | Green Gobbler Septic Saver Bac..
40         | 0.8137   | Affresh Washing Machine Cleane..
425        | 0.7759   | Green Gobbler Liquid Hair Drai..
420        | 0.7122   | Grandma's Secret Wrinkle Remov..
692        | 0.7101   | Philips Sonicare Genuine E-Ser..
645        | 0.7074   | Oral-B Dual Clean Replacement ..
803        | 0.7061   | Seventh Generation Dish Soap L..
388        | 0.7055   | Fractionated Coconut Oil Premi..
855        | 0.7053   | Swedish Wholesale Swedish Dish..
185        | 0.7051   | Cascade Complete Dishwasher Po..

(Generated Top-20 list. Count: 20)


In [13]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

# ==========================================
# 6. k-NEAREST NEIGHBORS (Item-Based)
# ==========================================
print("\n--- 6. k-Nearest Neighbors (Item-Based) ---")

# A. FIT THE MODEL
# We fit the model on the ITEM FEATURES matrix (from Phase 3)
# metric='cosine' is crucial because we care about the angle (similarity), not distance
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(item_features)

def predict_rating_knn(user_id, target_item_id, df_full, knn_model, item_features, k=10):
    """
    Predicts the rating a user would give to 'target_item_id' 
    based on their ratings of the k-nearest similar items.
    """
    # 1. Find k nearest neighbors for the target item
    # Input must be 2D array (1, n_features)
    target_vec = item_features[target_item_id].reshape(1, -1)
    
    # We ask for k+1 neighbors because the closest neighbor is ALWAYS the item itself (distance=0)
    # We will skip the first one later.
    distances, indices = knn_model.kneighbors(target_vec, n_neighbors=k+1)
    
    neighbor_ids = indices.flatten()
    neighbor_dists = distances.flatten()
    
    # 2. Get the user's rating history
    user_history = df_full[df_full['user_id'] == user_id]
    
    weighted_sum = 0
    similarity_sum = 0
    count_matches = 0
    
    # 3. Loop through neighbors to calculate Weighted Average
    # Start at index 1 to skip the target item itself
    for i in range(1, len(neighbor_ids)):
        n_id = neighbor_ids[i]
        n_dist = neighbor_dists[i]
        
        # Convert distance to similarity (Cosine Similarity = 1 - Cosine Distance)
        similarity = 1 - n_dist
        
        # Check if the user has actually rated this neighbor
        if n_id in user_history['item_id_encoded'].values:
            # Get the actual rating (1-5)
            actual_rating = user_history[user_history['item_id_encoded'] == n_id]['rating'].values[0]
            
            # Accumulate Weighted Sum
            weighted_sum += (similarity * actual_rating)
            similarity_sum += similarity
            count_matches += 1
            
    # 4. Final Prediction Logic
    if count_matches == 0:
        # Fallback: If user hasn't rated ANY similar items, return user's average rating
        if not user_history.empty:
            return user_history['rating'].mean()
        return df_full['rating'].mean() # Global average fallback
    
    if similarity_sum == 0:
        return 0

    predicted_rating = weighted_sum / similarity_sum
    return predicted_rating

# --- EXECUTION: Test with k=10 and k=20 ---

# Setup: Pick a user and an item they have NOT rated yet
sample_user = df['user_id'].iloc[0]
all_items = set(df['item_id_encoded'].unique())
user_rated_items = set(df[df['user_id'] == sample_user]['item_id_encoded'])

# Find a candidate item (just pick the first available one)
candidate_item = list(all_items - user_rated_items)[0]

print(f"Test User: {sample_user}")
print(f"Target Item ID: {candidate_item}")

# Test k=10
pred_10 = predict_rating_knn(sample_user, candidate_item, df, knn_model, item_features, k=10)
print(f"Predicted Rating (k=10): {pred_10:.4f}")

# Test k=20
pred_20 = predict_rating_knn(sample_user, candidate_item, df, knn_model, item_features, k=20)
print(f"Predicted Rating (k=20): {pred_20:.4f}")


--- 6. k-Nearest Neighbors (Item-Based) ---
Test User: AEV4DP5E3FJH6FHDLXUYQTDEQYCQ
Target Item ID: 0
Predicted Rating (k=10): 2.0000
Predicted Rating (k=20): 2.0000


In [14]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity

# ==========================================
# 7.1 STEP-BY-STEP NUMERICAL EXAMPLE
# ==========================================
print("="*60)
print("       PART 7: COMPLETE NUMERICAL EXAMPLE")
print("="*60)

# --- STEP 1: SAMPLE DATA ---
print("\n[STEP 1] Sample Data Representation")
print("We select 5 items. The User has rated Item A and Item B.")

data = {
    'item_id': ['A', 'B', 'C', 'D', 'E'],
    'text': [
        'green eco cotton',   # Item A (Rated 5.0)
        'green eco bamboo',   # Item B (Rated 4.0)
        'red plastic cheap',  # Item C (Unrated - Dissimilar)
        'blue denim jeans',   # Item D (Unrated - Dissimilar)
        'green cotton shirt'  # Item E (Unrated - Target Recommendation)
    ],
    'price': [20, 25, 10, 50, 22],
    'is_green': [1, 1, 0, 0, 1]
}

# User Ratings: Likes "Green/Eco" items
ratings = {'A': 5.0, 'B': 4.0} 

df_sample = pd.DataFrame(data)
print(df_sample)
print(f"\nUser Ratings: {ratings}")


# --- STEP 2: TF-IDF CALCULATION ---
print("\n" + "-"*30)
print("[STEP 2] TF-IDF Calculation")
print("-" * 30)

# We use a simple tokenizer to keep vocabulary small
tfidf = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
tfidf_matrix = tfidf.fit_transform(df_sample['text']).toarray()
vocab = tfidf.get_feature_names_out()

print(f"Vocabulary ({len(vocab)} terms): {list(vocab)}")

# Show vectors for the first 3 items
print("\nTF-IDF Vectors (Sample):")
for i in range(3):
    # Formatting to 2 decimal places
    vec_str = ", ".join([f"{x:.2f}" for x in tfidf_matrix[i]])
    print(f"Item {df_sample.loc[i, 'item_id']}: [{vec_str}]")


# --- STEP 3: FULL FEATURE MATRIX ---
print("\n" + "-"*30)
print("[STEP 3] Full Item-Feature Matrix")
print("-" * 30)
print("Combining: [TF-IDF Vectors] + [Norm_Price] + [Is_Green]")

# Normalize Price (0-1)
scaler = MinMaxScaler()
price_norm = scaler.fit_transform(df_sample[['price']])

# Is_Green (already 0/1)
green_vec = df_sample[['is_green']].values

# Combine
item_features = np.hstack([tfidf_matrix, price_norm, green_vec])

# Print Item A's full vector as an example
vec_A = item_features[0]
print(f"Item A Full Vector (Size {len(vec_A)}):")
print(np.round(vec_A, 2))


# --- STEP 4: USER PROFILE CONSTRUCTION ---
print("\n" + "-"*30)
print("[STEP 4] User Profile Construction (Weighted Average)")
print("-" * 30)

# Get vectors for rated items A (index 0) and B (index 1)
vec_A = item_features[0]
vec_B = item_features[1]
rating_A = 5.0
rating_B = 4.0

# Formula: (VecA * 5 + VecB * 4) / (5 + 4)
numerator = (vec_A * rating_A) + (vec_B * rating_B)
denominator = rating_A + rating_B
user_profile = numerator / denominator

print(f"Math: (Vec_A * {rating_A} + Vec_B * {rating_B}) / {denominator}")
print(f"User Profile Vector:\n{np.round(user_profile, 2)}")


# --- STEP 5: SIMILARITY & RECOMMENDATION ---
print("\n" + "-"*30)
print("[STEP 5] Similarity & Top Recommendations")
print("-" * 30)

rec_scores = []

# Calculate Cosine Similarity for unrated items (C, D, E)
# Indices: C=2, D=3, E=4
for i in [2, 3, 4]:
    item_id = df_sample.loc[i, 'item_id']
    vec_item = item_features[i]
    
    # Cosine Similarity Formula: dot(A, B) / (norm(A) * norm(B))
    dot_product = np.dot(user_profile, vec_item)
    norm_u = np.linalg.norm(user_profile)
    norm_i = np.linalg.norm(vec_item)
    
    score = dot_product / (norm_u * norm_i)
    rec_scores.append((item_id, score))
    
    # Print calculation for Item E (The expected winner)
    if item_id == 'E':
        print(f"\nCalculation for Item E ('green cotton shirt'):")
        print(f"   Dot Product: {dot_product:.4f}")
        print(f"   Norm(User): {norm_u:.4f}, Norm(Item): {norm_i:.4f}")
        print(f"   Score: {score:.4f}")

# Sort and Recommend
rec_scores.sort(key=lambda x: x[1], reverse=True)

print("\n--- Final Recommendations ---")
print("Rank | Item | Score  | Description")
for rank, (iid, score) in enumerate(rec_scores, 1):
    desc = df_sample[df_sample['item_id'] == iid]['text'].values[0]
    print(f"  {rank}  |  {iid}   | {score:.4f} | {desc}")

print("\n(Note: Item E is recommended #1 because it shares 'green' and 'cotton' with the User Profile.)")

       PART 7: COMPLETE NUMERICAL EXAMPLE

[STEP 1] Sample Data Representation
We select 5 items. The User has rated Item A and Item B.
  item_id                text  price  is_green
0       A    green eco cotton     20         1
1       B    green eco bamboo     25         1
2       C   red plastic cheap     10         0
3       D    blue denim jeans     50         0
4       E  green cotton shirt     22         1

User Ratings: {'A': 5.0, 'B': 4.0}

------------------------------
[STEP 2] TF-IDF Calculation
------------------------------
Vocabulary (11 terms): ['bamboo', 'blue', 'cheap', 'cotton', 'denim', 'eco', 'green', 'jeans', 'plastic', 'red', 'shirt']

TF-IDF Vectors (Sample):
Item A: [0.00, 0.00, 0.00, 0.61, 0.00, 0.61, 0.51, 0.00, 0.00, 0.00, 0.00]
Item B: [0.69, 0.00, 0.00, 0.00, 0.00, 0.56, 0.46, 0.00, 0.00, 0.00, 0.00]
Item C: [0.00, 0.00, 0.58, 0.00, 0.00, 0.00, 0.00, 0.00, 0.58, 0.58, 0.00]

------------------------------
[STEP 3] Full Item-Feature Matrix
----------------

## 3. Feature Extraction and Vector Space Model
### 3.1. Text Feature Extraction (TF-IDF)
We use **TF-IDF vectors** for the `title_y` column with basic preprocessing (tokenization and stop-word removal).

In [6]:
# print("Performing TF-IDF Vectorization...")
# tfidf = TfidfVectorizer(stop_words='english', max_features=1000)
# tfidf_matrix = tfidf.fit_transform(df['text'].fillna(''))
# print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")

### 3.3. Create Item-Feature Matrix
As specified, the item-feature matrix is constructed from the text features.

In [7]:
# item_feature_matrix = tfidf_matrix
# print(f"Final Item-Feature Matrix Shape: {item_feature_matrix.shape}")

## 4. User Profile Construction
### 4.1. Build User Profiles
We use a **Weighted average of rated item features**, where the weights are the rating values.

In [8]:
# def build_user_profiles(df, feature_matrix):
#     user_profiles = {}
#     user_groups = df.groupby('user_id')
    
#     for user_id, group in user_groups:
#         indices = group.index
#         ratings = group['rating'].values.reshape(-1, 1)
        
#         # Weighted sum of features
#         user_features = feature_matrix[indices]
#         weighted_features = user_features.multiply(ratings)
#         user_profile = weighted_features.sum(axis=0) / ratings.sum()
        
#         user_profiles[user_id] = np.asarray(user_profile).flatten()
        
#     return user_profiles

# print("Building user profiles...")
# user_profiles = build_user_profiles(df, item_feature_matrix)
# print(f"Generated profiles for {len(user_profiles)} users.")

### 4.2. Handle Cold-Start Users
**Strategy**: Use **popular item features**. Since demographic data is unavailable, we use the average features of the items most frequently rated in the dataset.

In [9]:
# def get_popular_item_profile(df, feature_matrix, top_n=100):
#     # Identify popular items by count of appearances
#     popular_titles = df['item'].value_counts().head(top_n).index
#     popular_indices = df[df['item'].isin(popular_titles)].index
    
#     popular_profile = feature_matrix[popular_indices].mean(axis=0)
#     return np.asarray(popular_profile).flatten()

# cold_start_profile = get_popular_item_profile(df, item_feature_matrix)
# print("Cold-start profile (popular) generated.")

## 5. Similarity Computation and Recommendation
### 5.1 & 5.2. Compute Similarity and Generate Top-N Recommendations
We use **Cosine similarity** between user profiles and all items. We then rank items by score and remove items already rated by the user.

In [10]:
# def get_recommendations(user_id, user_profiles, feature_matrix, df, top_n=10):
#     if user_id in user_profiles:
#         profile = user_profiles[user_id].reshape(1, -1)
#     else:
#         profile = cold_start_profile.reshape(1, -1)
        
#     # 5.1 Cosine Similarity Scores
#     scores = cosine_similarity(profile, feature_matrix).flatten()
    
#     # 5.2 Ranking and Removing already-rated items
#     rec_df = pd.DataFrame({'item_idx': range(len(df)), 'score': scores})
    
#     if user_id in user_profiles:
#         rated_indices = df[df['user_id'] == user_id].index
#         rec_df = rec_df[~rec_df['item_idx'].isin(rated_indices)]
        
#     top_indices = rec_df.sort_values(by='score', ascending=False).head(top_n)['item_idx'].values
#     return df.iloc[top_indices][['item', 'rating']]

# example_user = df['user_id'].iloc[0]
# print(f"--- Top-10 Recommendations for User {example_user} ---")
# print(get_recommendations(example_user, user_profiles, item_feature_matrix, df, top_n=10))

# print(f"\n--- Top-20 Recommendations for User {example_user} ---")
# print(get_recommendations(example_user, user_profiles, item_feature_matrix, df, top_n=20))

## 6. k-Nearest Neighbors (k-NN)
### 6.1. Implement Item-Based k-NN
Predict ratings using a weighted average of the $k$ most similar items.

In [11]:
# def predict_rating_knn(user_id, item_title, df, feature_matrix, k=10):
#     target_row = df[df['item'] == item_title].head(1)
#     if target_row.empty: return df['rating'].mean()
#     target_idx = target_row.index[0]
    
#     knn = NearestNeighbors(n_neighbors=k+1, metric='cosine')
#     knn.fit(feature_matrix)
#     distances, indices = knn.kneighbors(feature_matrix[target_idx])
    
#     user_ratings = df[df['user_id'] == user_id]
#     pred_numerator = 0
#     pred_denominator = 0
    
#     for dist, idx in zip(distances.flatten()[1:], indices.flatten()[1:]):
#         item_title_sim = df.iloc[idx]['item']
#         user_record = user_ratings[user_ratings['item'] == item_title_sim]
        
#         if not user_record.empty:
#             similarity = 1 - dist
#             pred_numerator += similarity * user_record['rating'].values[0]
#             pred_denominator += similarity
            
#     if pred_denominator == 0: return df['rating'].mean()
#     return pred_numerator / pred_denominator

# test_item = df.iloc[10]['item']
# prediction = predict_rating_knn(example_user, test_item, df, item_feature_matrix, k=10)
# print(f"Predicted rating for '{test_item}': {prediction:.2f}")

### 6.2. Compare content-based and k-NN approaches

| Feature | Content-Based (User Profiles) | k-NN (Item-Based) |
| :--- | :--- | :--- |
| **Core Concept** | Matches user's overall keyword profile to items. | Matches a specific item to its closest 'neighbors'. |
| **Pros** | Great for new items (no ratings needed), explains *why* based on content. | Better at capturing subtle niche similarities text can't describe. |
| **Cons** | Can be 'boring' (always recommends similar keywords). | Subject to cold-start (needs ratings to predict well). |
| **Use Case** | Discovery based on specific interests/topics. | 'Users who liked this also liked...' logic. |

**Summary**: In this implementation, the **Content-Based** approach is more robust because it can recommend items based on text features alone, whereas the **k-NN rating prediction** relies heavily on the user having rated very similar items in the sparse high-dimensional space.

## 7. Complete Numerical Example
Step-by-step example using a tiny subset of 3 items.

In [12]:
# print("--- Step 7.1: Sample Item Descriptions ---")
# mini_items = df['item'].unique()[:3]
# print(mini_items)

# print("\n--- Step 7.2: TF-IDF Calculation (Sample 5 terms) ---")
# mini_tfidf = tfidf.transform(mini_items).toarray()
# print(pd.DataFrame(mini_tfidf[:, :5], columns=tfidf.get_feature_names_out()[:5], index=mini_items))

# print("\n--- Step 7.3: User Profile (from 3 items with ratings 5, 4, 3) ---")
# user_ratings_val = np.array([5, 4, 3])
# user_mini_profile = np.average(mini_tfidf, axis=0, weights=user_ratings_val)
# print(f"User Profile Vector (first 5 terms): {user_mini_profile[:5]}")

# print("\n--- Step 7.4: Similarity Scores ---")
# mini_scores = cosine_similarity(user_mini_profile.reshape(1, -1), mini_tfidf).flatten()
# print(pd.Series(mini_scores, index=mini_items))

# print("\n--- Step 7.5: Top-5 Recommendations ---")
# print(get_recommendations(example_user, user_profiles, item_feature_matrix, df, top_n=5))