# Group 17: 
# Eyad Medhat 221100279 / Hady Aly 221101190 / Mohamed Mahfouz 221101743 / Omar Mady 221100745

# Part 2: Content-Based Recommendation System

This notebook implements a comprehensive content-based recommendation system as per the project requirements.

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
import scipy.sparse as sp

In [3]:
pd.set_option('display.max_columns', None)

data_path = r'../data/'
df = pd.read_csv(os.path.join(data_path, 'Amazon_health&household_sampled.csv'))
print(f"Loaded {len(df)} rows.")

Loaded 14554 rows.


## 3. Feature Extraction and Vector Space Model
### 3.1. Text Feature Extraction (TF-IDF)
We use **TF-IDF vectors** for the `title_y` column with basic preprocessing (tokenization and stop-word removal).

In [8]:
print("Performing TF-IDF Vectorization...")
tfidf = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = tfidf.fit_transform(df['text'].fillna(''))
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")

Performing TF-IDF Vectorization...
TF-IDF Matrix Shape: (14554, 1000)


### 3.3. Create Item-Feature Matrix
As specified, the item-feature matrix is constructed from the text features.

In [9]:
item_feature_matrix = tfidf_matrix
print(f"Final Item-Feature Matrix Shape: {item_feature_matrix.shape}")

Final Item-Feature Matrix Shape: (14554, 1000)


## 4. User Profile Construction
### 4.1. Build User Profiles
We use a **Weighted average of rated item features**, where the weights are the rating values.

In [10]:
def build_user_profiles(df, feature_matrix):
    user_profiles = {}
    user_groups = df.groupby('user_id')
    
    for user_id, group in user_groups:
        indices = group.index
        ratings = group['rating'].values.reshape(-1, 1)
        
        # Weighted sum of features
        user_features = feature_matrix[indices]
        weighted_features = user_features.multiply(ratings)
        user_profile = weighted_features.sum(axis=0) / ratings.sum()
        
        user_profiles[user_id] = np.asarray(user_profile).flatten()
        
    return user_profiles

print("Building user profiles...")
user_profiles = build_user_profiles(df, item_feature_matrix)
print(f"Generated profiles for {len(user_profiles)} users.")

Building user profiles...
Generated profiles for 10000 users.


### 4.2. Handle Cold-Start Users
**Strategy**: Use **popular item features**. Since demographic data is unavailable, we use the average features of the items most frequently rated in the dataset.

In [12]:
def get_popular_item_profile(df, feature_matrix, top_n=100):
    # Identify popular items by count of appearances
    popular_titles = df['item'].value_counts().head(top_n).index
    popular_indices = df[df['item'].isin(popular_titles)].index
    
    popular_profile = feature_matrix[popular_indices].mean(axis=0)
    return np.asarray(popular_profile).flatten()

cold_start_profile = get_popular_item_profile(df, item_feature_matrix)
print("Cold-start profile (popular) generated.")

Cold-start profile (popular) generated.


## 5. Similarity Computation and Recommendation
### 5.1 & 5.2. Compute Similarity and Generate Top-N Recommendations
We use **Cosine similarity** between user profiles and all items. We then rank items by score and remove items already rated by the user.

In [16]:
def get_recommendations(user_id, user_profiles, feature_matrix, df, top_n=10):
    if user_id in user_profiles:
        profile = user_profiles[user_id].reshape(1, -1)
    else:
        profile = cold_start_profile.reshape(1, -1)
        
    # 5.1 Cosine Similarity Scores
    scores = cosine_similarity(profile, feature_matrix).flatten()
    
    # 5.2 Ranking and Removing already-rated items
    rec_df = pd.DataFrame({'item_idx': range(len(df)), 'score': scores})
    
    if user_id in user_profiles:
        rated_indices = df[df['user_id'] == user_id].index
        rec_df = rec_df[~rec_df['item_idx'].isin(rated_indices)]
        
    top_indices = rec_df.sort_values(by='score', ascending=False).head(top_n)['item_idx'].values
    return df.iloc[top_indices][['item', 'rating']]

example_user = df['user_id'].iloc[0]
print(f"--- Top-10 Recommendations for User {example_user} ---")
print(get_recommendations(example_user, user_profiles, item_feature_matrix, df, top_n=10))

print(f"\n--- Top-20 Recommendations for User {example_user} ---")
print(get_recommendations(example_user, user_profiles, item_feature_matrix, df, top_n=20))

--- Top-10 Recommendations for User AEV4DP5E3FJH6FHDLXUYQTDEQYCQ ---
                                                    item  rating
4375   Energizer Ultimate Lithium AA Batteries, World...       4
11307  Green Gobbler Drain Clog Dissolver, Drain Open...       5
872    Energizer CR2032 Batteries, 3V Lithium Coin Ce...       5
5048   Affresh Washing Machine Cleaner, Cleans Front ...       5
13529  ComfiLife Gel Enhanced Seat Cushion – Non-Slip...       5
10311  Premium Elastic Bandage Wrap (3" Wide, 2 Pack)...       5
9414   BUBBAS Super Strength Enzyme Cleaner - Pet Odo...       5
4709          LiCB A23 23A 12V Alkaline Battery (5-Pack)       5
9930   Green Gobbler Drain Clog Dissolver, Drain Open...       5
5537   Bio Clean Hard Water Stain Remover 40 oz. (20....       5

--- Top-20 Recommendations for User AEV4DP5E3FJH6FHDLXUYQTDEQYCQ ---
                                                    item  rating
4375   Energizer Ultimate Lithium AA Batteries, World...       4
11307  Green Gob

## 6. k-Nearest Neighbors (k-NN)
### 6.1. Implement Item-Based k-NN
Predict ratings using a weighted average of the $k$ most similar items.

In [17]:
def predict_rating_knn(user_id, item_title, df, feature_matrix, k=10):
    target_row = df[df['item'] == item_title].head(1)
    if target_row.empty: return df['rating'].mean()
    target_idx = target_row.index[0]
    
    knn = NearestNeighbors(n_neighbors=k+1, metric='cosine')
    knn.fit(feature_matrix)
    distances, indices = knn.kneighbors(feature_matrix[target_idx])
    
    user_ratings = df[df['user_id'] == user_id]
    pred_numerator = 0
    pred_denominator = 0
    
    for dist, idx in zip(distances.flatten()[1:], indices.flatten()[1:]):
        item_title_sim = df.iloc[idx]['item']
        user_record = user_ratings[user_ratings['item'] == item_title_sim]
        
        if not user_record.empty:
            similarity = 1 - dist
            pred_numerator += similarity * user_record['rating'].values[0]
            pred_denominator += similarity
            
    if pred_denominator == 0: return df['rating'].mean()
    return pred_numerator / pred_denominator

test_item = df.iloc[10]['item']
prediction = predict_rating_knn(example_user, test_item, df, item_feature_matrix, k=10)
print(f"Predicted rating for '{test_item}': {prediction:.2f}")

Predicted rating for 'Affresh Dishwasher Cleaner, Helps Remove Limescale and Odor-Causing Residue, 6 Tablets & Coffee Maker Cleaner, Works with Multi-cup and Single-serve Brewers, 3 Tablets': 4.29


### 6.2. Compare content-based and k-NN approaches

| Feature | Content-Based (User Profiles) | k-NN (Item-Based) |
| :--- | :--- | :--- |
| **Core Concept** | Matches user's overall keyword profile to items. | Matches a specific item to its closest 'neighbors'. |
| **Pros** | Great for new items (no ratings needed), explains *why* based on content. | Better at capturing subtle niche similarities text can't describe. |
| **Cons** | Can be 'boring' (always recommends similar keywords). | Subject to cold-start (needs ratings to predict well). |
| **Use Case** | Discovery based on specific interests/topics. | 'Users who liked this also liked...' logic. |

**Summary**: In this implementation, the **Content-Based** approach is more robust because it can recommend items based on text features alone, whereas the **k-NN rating prediction** relies heavily on the user having rated very similar items in the sparse high-dimensional space.

## 7. Complete Numerical Example
Step-by-step example using a tiny subset of 3 items.

In [18]:
print("--- Step 7.1: Sample Item Descriptions ---")
mini_items = df['item'].unique()[:3]
print(mini_items)

print("\n--- Step 7.2: TF-IDF Calculation (Sample 5 terms) ---")
mini_tfidf = tfidf.transform(mini_items).toarray()
print(pd.DataFrame(mini_tfidf[:, :5], columns=tfidf.get_feature_names_out()[:5], index=mini_items))

print("\n--- Step 7.3: User Profile (from 3 items with ratings 5, 4, 3) ---")
user_ratings_val = np.array([5, 4, 3])
user_mini_profile = np.average(mini_tfidf, axis=0, weights=user_ratings_val)
print(f"User Profile Vector (first 5 terms): {user_mini_profile[:5]}")

print("\n--- Step 7.4: Similarity Scores ---")
mini_scores = cosine_similarity(user_mini_profile.reshape(1, -1), mini_tfidf).flatten()
print(pd.Series(mini_scores, index=mini_items))

print("\n--- Step 7.5: Top-5 Recommendations ---")
print(get_recommendations(example_user, user_profiles, item_feature_matrix, df, top_n=5))

--- Step 7.1: Sample Item Descriptions ---
['Sonic Handheld Percussion Massage Gun - Deep Tissue Massager for Sore Muscle and Stiffness - Quiet, 5 Speed High-Intensity Vibration - Quick Rechargeable Device - Includes 8 Massage Heads (Blue)'
 'DUDE Wipes On-The-Go Flushable Wet Wipes - 1 Pack, 30 Wipes - Mint Chill Extra-Large Individually Wrapped Wipes with Eucalyptus & Tea Tree Oil - Septic and Sewer Safe'
 'Sleep Mask for Side Sleeper, 100% Blackout 3D Eye Mask for Sleeping, Night Blindfold for Men Women']

--- Step 7.2: TF-IDF Calculation (Sample 5 terms) ---
                                                     10       100   12   15  \
Sonic Handheld Percussion Massage Gun - Deep Ti...  0.0  0.000000  0.0  0.0   
DUDE Wipes On-The-Go Flushable Wet Wipes - 1 Pa...  0.0  0.000000  0.0  0.0   
Sleep Mask for Side Sleeper, 100% Blackout 3D E...  0.0  0.347612  0.0  0.0   

                                                     20  
Sonic Handheld Percussion Massage Gun - Deep Ti...  0.0 