# Part 4: Advanced Hybrid Recommendation System Design
    
## 9. Integrated Hybrid Strategy Implementation

This notebook represents the culmination of our recommendation system pipeline. Having individually developed **Content-Based (CB)** and **Collaborative Filtering (CF)** models, we now seek to overcome their respective limitations by synthesizing them into a robust **Hybrid Architecture**.

### Theoretical Framework & Motivation
Single-method recommenders often suffer from critical weaknesses:
1.  **Content-Based Systems**: Excel at recommending niche or new items based on features (Description, Metadata) but fail to capture "serendipity" or community trends. They are limited by the quality of the feature engineering.
2.  **Collaborative Filtering**: Excels at capturing complex latent patterns and community wisdom but suffers catastrohpically from the **Cold-Start Problem** (new users/items) and **Sparsity** (inadequate overlap between users).

### Our Hybrid Approach
We propose a unified framework that implements and evaluates distinct hybridization strategies to determine the optimal configuration for the Amazon Health & Household domain:

1.  **Option A: Weighted Hybridization**
    *   **Concept**: A linear combination of the normalized scores from both predictors.
    *   **Formula**: $Score_{Hybrid} = \alpha \cdot Score_{CB} + (1 - \alpha) \cdot Score_{CF}$
    *   **Goal**: To find the sweet spot $\alpha$ where the "content signal" stabilizes the "collaborative noise."

2.  **Option B: Switching Hybridization**
    *   **Concept**: A dynamic selection mechanism based on user confidence.
    *   **Logic**: If a user has sufficient history ($N \ge Threshold$), we trust the community signal (CF). Otherwise, we fallback to the safer content matching (CB).
    *   **Goal**: To solve the Cold-Start problem explicitly by treating new users differently.

3.  **Option C: Cascade Hybridization**
    *   **Concept**: A multi-stage funneling process.
    *   **Logic**: Stage 1 uses Content-Based logic to filter the search space to the top-50 relevant candidates. Stage 2 uses Collaborative Filtering to re-rank this short list.
    *   **Goal**: To improve computational efficiency and ensure that final recommendations are at least "content-relevant."

## 10. Robustness Analysis (Cold-Start)
Finally, we will rigorously stress-test our selected model against users with minimal data points (3, 5, and 10 ratings) to validte its real-world viability.


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import time
import os

# Import our custom consolidated utilities
# This library now encapsulates the core logic for:
# - Data Loading
# - Feature Construction (TF-IDF, Vectors)
# - User Profile Building
# - Collaborative matrix generation
# - Atomic Prediction Functions
import utils
import importlib
importlib.reload(utils) # Standard practice to ensure updates are reflected

# Settings
np.random.seed(42)
pd.set_option('display.max_columns', None)

RESULTS_DIR = "../results"
if not os.path.exists(RESULTS_DIR):
    os.makedirs(RESULTS_DIR)


## 1. Data Pipeline Initialization

Here we load the curated dataset. To ensure consistency with previous experiments, we apply the exact same preprocessing cleaning steps (filling NaNs with medians, handling empty text) that were established in Parts 2 and 3.


In [None]:
# Load Data
data_path = '../data/Amazon_health&household_label_encoded.csv'
df = pd.read_csv(data_path)
print(f"Data Loaded: {df.shape}")

# Preprocessing (Standardized)
df_items = df[['item_id_encoded', 'item', 'price', 'text', 'is_green']].drop_duplicates(subset='item_id_encoded').sort_values('item_id_encoded').set_index('item_id_encoded')
df_items['text'] = df_items['text'].fillna('')
df_items['price'] = df_items['price'].fillna(df_items['price'].median())
df_items['is_green'] = df_items['is_green'].fillna(False)


## 2. Content-Based Model instantiation

We invoke our `utils` library to construct the feature space.
*   **Item Features**: A dense matrix combining Top-100 TF-IDF terms + Price + Green Labels.
*   **User Profiles**: Weighted average vectors representing user preferences.

*Note: We handle users with absolutely no history by assigning them a global average 'Cold-Start Vector'.*


In [None]:
# 1. Build Item Features
# Uses TF-IDF on 'text' + Normalized 'price' + 'is_green'
item_features = utils.build_content_features(df_items)
print(f"Item Feature Matrix Constructed: {item_features.shape}")

# 2. Build User Profiles
# Computes weighted average of item vectors for every user
user_profiles_global, cold_start_vec_global = utils.build_user_profiles(df, item_features)
print(f"Global User Profiles Built: {len(user_profiles_global)} users processed.")


## 3. Collaborative Filtering Model Instantiation

We construct the sparse interaction matrices required for memory-based CF.
*   **Interaction Matrix $R$**: A User x Item sparse matrix.
*   **Similarity Matrix**: Item-Item Pearson Correlation matrix, computed via Cosine Similarity on Mean-Centered ratings.

*Optimization: We use sparse matrix operations throughout to prevent memory overflow.*


In [None]:
# 1. Matrix Construction mapping
users = sorted(df['user_id'].unique())
items = sorted(df['item_id_encoded'].unique())
user_map = {u: i for i, u in enumerate(users)}
item_map = {it: i for i, it in enumerate(items)}

# 2. Build Global Matrices
R_global = utils.build_cf_matrix(df, users, items, user_map, item_map)

# 3. Compute Similarity
# This utilizes the mathematical equivalence: Pearson(X, Y) = Cosine(Centered(X), Centered(Y))
sim_matrix_global, user_means_global = utils.get_centered_sim_matrix(R_global)
print("Collaborative Filtering Matrices (Similarity & Means) Ready.")


## 9. Hybrid Strategy Definitions

Here we define the core logic for our three experimental strategies. These functions abstract the decision-making process for combining or selecting scores.


In [None]:
# Strategy A: Weighted Hybrid
# Linearly blends the scores. Alpha controls the weight of the Content-Based signal.
def hybrid_weighted(cb_score, cf_score, alpha):
    return utils.hybrid_weighted(cb_score, cf_score, alpha)

# Strategy B: Switching Hybrid
# Chooses the model based on the user's data density (count).
def hybrid_switching(user_rating_count, cb_score, cf_score, threshold=10):
    return utils.hybrid_switching(user_rating_count, cb_score, cf_score, threshold)

# Strategy C: Cascade Hybrid
# Filters candidates by Content-Based score. If < Threshold, score is 0. Else, CF score.
def hybrid_cascade(cb_score, cf_score, threshold=0.5):
    return utils.hybrid_cascade(cb_score, cf_score, threshold)


## Evaluation Framework

To strictly evaluate the performance, we essentially rebuild our models on a **Training Set (80%)** and test them on a held-out **Test Set (20%)**.

**Methodology**:
1.  Split Data.
2.  Re-calculate User Profiles and CF Matrices on **Train Data only**.
3.  Iterate through Test Data pairs (User, Item).
4.  Generate predictions using all strategies.
5.  Compute **RMSE (Root Mean Squared Error)** to quantify accuracy.



In [None]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# --- REBUILDING MODELS ON TRAIN SET ---
print("Rebuilding models on Training Data...")
# 1. CF Matrices (Train)
R_train = utils.build_cf_matrix(train_df, users, items, user_map, item_map)
sim_train, means_train = utils.get_centered_sim_matrix(R_train)

# 2. User Profiles (Train)
profiles_train, cold_vec_train = utils.build_user_profiles(train_df, item_features)

# 3. User Counts (for Switching Logic)
user_counts_train = train_df['user_id'].value_counts().to_dict()

def evaluate_models(test_set, subset_name="Full Test"):
    print(f"\n--- Evaluating on {subset_name} ({len(test_set)} samples) ---")
    mse_weighted = {0.3: [], 0.5: [], 0.7: []}
    mse_switching = []
    mse_cascade = []
    
    start = time.time()
    for idx, row in test_set.iterrows():
        uid = row['user_id']
        iid = row['item_id_encoded']
        true_r = row['rating']
        
        # Safe Index Lookup
        u_idx = user_map.get(uid)
        i_idx = item_map.get(iid)
        if u_idx is None or i_idx is None: continue 
        
        # --- ATOMIC PREDICTIONS ---
        # 1. Content-Based Score
        cb_s = utils.predict_cb(uid, i_idx, profiles_train, item_features, cold_vec_train)
        
        # 2. Collaborative Filtering Score
        cf_s = utils.predict_cf(u_idx, i_idx, R_train, sim_train, means_train)
        
        # --- HYBRID PREDICTIONS ---
        # Option A: Weighted
        for alpha in [0.3, 0.5, 0.7]:
            h_s = hybrid_weighted(cb_s, cf_s, alpha)
            mse_weighted[alpha].append((true_r - h_s)**2)
            
        # Option B: Switching
        cnt = user_counts_train.get(uid, 0)
        sw_s = hybrid_switching(cnt, cb_s, cf_s, threshold=10)
        mse_switching.append((true_r - sw_s)**2)
        
        # Option C: Cascade (Threshold = 0.8, since CB returns 1-5, we need equivalent threshold)
        # Note: Our CB returns 1+4*sim. Sim 0 -> 1. Sim 1 -> 5.
        # Threshold: let's require at least "some" similarity. Sim > 0.1 => Score > 1.4
        cas_s = hybrid_cascade(cb_s, cf_s, threshold=1.5)
        
        # If filtered (score 0), the error is (True - 0)^2 = True^2. 
        # This penalizes missing a relevant item.
        mse_cascade.append((true_r - cas_s)**2)
        
        if idx % 500 == 0 and idx > 0: print(f"Processed {idx} predictions...")

    print(f"Evaluation complete in {time.time()-start:.2f}s")
    
    results = {}
    print("\n--- RESULTS (RMSE) ---")
    for alpha, errs in mse_weighted.items():
        rmse = np.sqrt(np.mean(errs)) if errs else 0
        print(f"[Option A] Weighted (alpha={alpha}): {rmse:.4f}")
        results[f'Weighted_{alpha}'] = rmse
        
    rmse_sw = np.sqrt(np.mean(mse_switching)) if mse_switching else 0
    print(f"[Option B] Switching (Threshold=10):  {rmse_sw:.4f}")
    results['Switching'] = rmse_sw
    
    rmse_cas = np.sqrt(np.mean(mse_cascade)) if mse_cascade else 0
    print(f"[Option C] Cascade (Threshold=1.5):   {rmse_cas:.4f}")
    results['Cascade'] = rmse_cas
    
    # Save Results
    df_res = pd.DataFrame.from_dict(results, orient='index', columns=['RMSE'])
    df_res.index.name = 'Strategy'
    filename = f"hybrid_evaluation_{subset_name.replace(' ', '_').lower()}.csv"
    utils.save_csv(df_res.reset_index(), filename)
    
    return results

# Execute Evaluation on a subset for demonstration speed (first 2000 rows)
results_full = evaluate_models(test_df.head(2000), "Test Subset (2000)")


## 9.2 Recommendation Strategy Selection & Justification

Based on our experimental evaluation, **Option A: Weighted Hybrid (alpha=0.7)** is the superior strategy (RMSE ~2.75), outperforming both Switching (RMSE ~2.85) and Cascade (RMSE ~4.14).

### **Selected Strategy: Option A (Weighted Hybrid, $\alpha=0.7$)**

**Domain-Specific Model Justification (Health & Household):**

1.  **Nature of the Domain (Functional vs. Taste-Based)**:
    *   The "Health & Household" category is primarily **functional**. Users buy items like "Septic Tank Treatment" or "Vitamins" to solve specific problems, not purely for entertainment (like Movies/Music).
    *   **Result**: The **Content-Based** component (TF-IDF on descriptions) captures this "functional utility" much better than Collaborative Filtering. If a user needs "green cleaning," they need it regardless of what other users bought. This explains why a high alpha (0.7) favoring content performed best.

2.  **Data Sparsity & Cold Start**:
    *   Household items have **lower interaction frequency** than media consumption. Most users buy few items, leading to an extremely sparse matrix ($>99.9\%$ empty).
    *   **Result**: Pure Collaborative Filtering struggles to find neighbors. The Weighted Hybrid compensates for this "sparse signal" by relying on the robust "content signal," which is always available (as every item has a description).

3.  **Why Cascade Failed**:
    *   By filtering items based *only* on content first, Cascade likely removed "complementary goods" (e.g., buying a toothbrush after toothpaste) that don't share text keywords but are effectively linked by user behavior. A weighted blend preserves these subtle links while maintaining content relevance.



## 10. Cold-Start Robustness Analysis

We now perform the final robustness check requested. We isolate users in the test set who have extremely limited training history (3, 5, and 10 ratings) and evaluate our selected model (Weighted Hybrid) on them. This simulates the experience of "new" users joining the platform.


In [None]:
# Simulate Cold Start Scenarios
import importlib
importlib.reload(utils) # Ensure we have the dynamic prediction functions

# We regenerate scenarios here to ensure availability
scenarios, ground_truth, sampled_users = utils.simulate_cold_start(df, min_ratings=20, n_users=20, n_ratings_list=[3, 5, 10])

print("\n=== STRICT COLD-START SIMULATION (Hit Rate @ 10) ===")
print("Methodology: Train on N visible ratings -> Predict 1 Hidden Positive Item vs 500 Distractors")
print("Comparison: Hybrid (Dynamic Profile) vs Global Popularity")

results_cold = []
# Pre-calculate global popularity for distractors
pop_items_global = utils.recommend_popularity(train_df, top_k=500)
all_items_ids = list(item_map.keys())

for n in [3, 5, 10]:
    print(f"\n--- Testing Cohort: {n} Ratings ---")
    hits_hybrid = 0
    hits_pop = 0
    total = 0
    
    for uid in sampled_users:
        # 1. Get Visible Data
        visible_df = scenarios[uid][n]
        visible_items = set(visible_df['item_id_encoded'].values)
        
        # 2. Identify Hidden Item (Test Target)
        # Ground truth has all positives.
        actual_liked = ground_truth[uid]
        possible_targets = list(actual_liked - visible_items)
        
        if not possible_targets: continue
        
        import random
        hidden_item = random.choice(possible_targets)
        
        # 3. Build Candidates
        # Uniqueness guaranteed by set operations
        # 500 Popular + 100 Random + Hidden
        distractors = set(pop_items_global) | set(utils.recommend_random(all_items_ids, 100))
        distractors.discard(hidden_item)
        distractors = list(distractors)[:600] # Cap size
        candidates = [hidden_item] + distractors
        
        # 4. Prepare Dynamic User State for Hybrid
        # Content-Based Profile
        vis_indices = [item_map.get(i) for i in visible_df['item_id_encoded']]
        vis_ratings = visible_df['rating'].values
        
        # Handle cases where item might not be in feature map (should be rare)
        valid_indices = []
        valid_ratings = []
        for i, idx in enumerate(vis_indices):
            if idx is not None:
                valid_indices.append(idx)
                valid_ratings.append(vis_ratings[i])
        
        if not valid_indices:
            dyn_profile = cold_vec_train
        else:
             vecs = item_features[valid_indices]
             rats = np.array(valid_ratings).reshape(-1, 1)
             dyn_profile = np.sum(vecs * rats, axis=0) / (np.sum(rats) + 1e-9)
             
        # Collaborative Profile (Ratings Dict)
        dyn_ratings_dict = {item_map.get(row['item_id_encoded']): row['rating'] 
                           for _, row in visible_df.iterrows() 
                           if item_map.get(row['item_id_encoded']) is not None}
        dyn_mean = visible_df['rating'].mean()
        
        # 5. Score Candidates
        hyb_scores = []
        pop_counts = train_df['item_id_encoded'].value_counts()
        pop_scores = [] # We can batch get
        
        for cand in candidates:
            # Pop
            pop_scores.append(pop_counts.get(cand, 0))
            
            # Hybrid
            c_idx = item_map.get(cand)
            if c_idx is None:
                hyb_scores.append(-1)
                continue
            
            cb = utils.predict_cb(None, c_idx, None, item_features, cold_vec_train, dynamic_profile=dyn_profile)
            cf = utils.predict_cf(None, c_idx, R_train, sim_train, means_train, dynamic_ratings=dyn_ratings_dict, dynamic_mean=dyn_mean)
            hyb_scores.append(utils.hybrid_weighted(cb, cf, alpha=0.7))
            
        # 6. Check Ranks
        # Hybrid
        # Sort (score, index_in_candidates)
        hyb_sort = sorted([(s, i) for i, s in enumerate(hyb_scores)], reverse=True)
        # Rank of hidden item (which is at index 0 of candidates)
        rank_h = [x[1] for x in hyb_sort].index(0) + 1
        
        # Pop
        pop_sort = sorted([(s, i) for i, s in enumerate(pop_scores)], reverse=True)
        rank_p = [x[1] for x in pop_sort].index(0) + 1
        
        if rank_h <= 10: hits_hybrid += 1
        if rank_p <= 10: hits_pop += 1
        total += 1
        
    # Metrics
    if total > 0:
        hr_h = hits_hybrid / total
        hr_p = hits_pop / total
        print(f"Cohort {n}: Hybrid HR@10 = {hr_h:.2f} | Popularity HR@10 = {hr_p:.2f}")
        results_cold.append({'N_Ratings': n, 'Hybrid_HR@10': hr_h, 'Popularity_HR@10': hr_p})

# Save
df_cold_res = pd.DataFrame(results_cold)
utils.save_csv(df_cold_res, "cold_start_comparison.csv")
print("\nFinal Cold-Start Analysis:")
print(df_cold_res)


In [None]:
# === ADDITIONAL ANALYSIS: NATURAL COLD START ===
# Comparison on users who ACTUALLY have only 3, 5, or 10 ratings.
# Methodology: Leave-One-Out (Hold out 1 item, use remaining N-1 to predict).

print("\n=== NATURAL COLD-START COHORTS (Real Low-Activity Users) ===")
results_natural = []

# 1. Identify Natural Cohorts
user_counts_all = df['user_id'].value_counts()

for n_total in [3, 5, 10]:
    # Get users with EXACTLY n_total ratings
    cohort_users = user_counts_all[user_counts_all == n_total].index.tolist()
    
    # Sample for speed (20 users)
    if len(cohort_users) > 20:
        test_cohort = np.random.choice(cohort_users, 20, replace=False)
    else:
        test_cohort = cohort_users
        
    if len(test_cohort) == 0:
        print(f"No users found with exactly {n_total} ratings.")
        continue
        
    print(f"\n--- Natural Cohort: {n_total} Ratings (Sampled {len(test_cohort)} users) ---")
    
    hits = 0
    total = 0
    
    for uid in test_cohort:
        # Get User's Full Data
        u_data = df[df['user_id'] == uid]
        
        # Pick 1 Hidden Item (Positive, assuming all in cleaned data are 'interactions')
        # Since we use implicit feedback logic mostly, we just pick one.
        hidden_row = u_data.sample(1, random_state=42)
        hidden_item = hidden_row['item_id_encoded'].values[0]
        
        # Visible Data (The remaining N-1 items)
        visible_data = u_data.drop(hidden_row.index)
        
        # Build Candidates (Hidden + 100 Random)
        # Detailed comparison (vs Pop) is less critical here, we just want to see if Hybrid works.
        distractors = utils.recommend_random(list(item_map.keys()), 100)
        candidates = [hidden_item] + [d for d in distractors if d != hidden_item]
        
        # Prepare Dynamic Profile from Visible Data
        vis_indices = [item_map.get(i) for i in visible_data['item_id_encoded'] if item_map.get(i) is not None]
        vis_ratings = visible_data['rating'].values
        
        # Content Profile
        if not vis_indices:
            dyn_profile = cold_vec_train
        else:
            vecs = item_features[vis_indices]
            rats = np.array(vis_ratings[:len(vis_indices)]).reshape(-1, 1)
            dyn_profile = np.sum(vecs * rats, axis=0) / (np.sum(rats) + 1e-9)
            
        # Collaborative Profile
        dyn_ratings = {item_map.get(row['item_id_encoded']): row['rating'] 
                      for _, row in visible_data.iterrows() 
                      if item_map.get(row['item_id_encoded']) is not None}
        dyn_mean = visible_data['rating'].mean()
        
        # Score
        scores = []
        for cand in candidates:
            c_idx = item_map.get(cand)
            if c_idx is None:
                scores.append(-1)
                continue
            
            cb = utils.predict_cb(None, c_idx, None, item_features, cold_vec_train, dynamic_profile=dyn_profile)
            cf = utils.predict_cf(None, c_idx, R_train, sim_train, means_train, dynamic_ratings=dyn_ratings, dynamic_mean=dyn_mean)
            scores.append(utils.hybrid_weighted(cb, cf, alpha=0.7))
            
        # Rank
        # Sort desc, find index of 0 (hidden item)
        ranked_indices = [i for _, i in sorted(zip(scores, range(len(scores))), key=lambda pair: pair[0], reverse=True)]
        rank = ranked_indices.index(0) + 1
        
        if rank <= 10:
            hits += 1
        total += 1
        
    hr = hits / total if total > 0 else 0
    print(f"Cohort {n_total}: Hybrid HR@10 = {hr:.2f} ({hits}/{total})")
    results_natural.append({'Cohort': f'{n_total} Ratings', 'HR@10': hr})

df_nat = pd.DataFrame(results_natural)
utils.save_csv(df_nat, "natural_cold_start_summary.csv")
print(df_nat)


## 11. Baseline Comparison (Hit Rate @ 10)

In this final evaluation, we perform a strict **Leave-One-Out (LOO)** test to compare our Hybrid strategy against standard recommendations.

### **Methodology**
1.  **Target Users**: We sample 50 users who have at least 20 ratings (to ensure stable profiles).
2.  **Leave-One-Out**: For each user, we "hide" **one** of their positive interactions (Rating $\ge$ 3.0).
3.  **Candidate Selection**: We create a pool of items consisting of:
    *   The **Hidden Item** (Ground Truth).
    *   **500 Popular Items** (Distractors).
    *   **100 Random Items** (Distractors).
4.  **Ranking**: We ask each model to rank this candidate set (Size ~601).
5.  **Metric**: **Hit Rate @ 10 (HR@10)**.
    *   If the *Hidden Item* appears in the **Top 10** recommendations $\rightarrow$ **Hit (1)**.
    *   Otherwise $\rightarrow$ **Miss (0)**.

### **Models Compared**
1.  **Random**: Shuffles the candidates. Baseline chance $\approx 10/600 \approx 1.6\%$.
2.  **Popularity**: Ranks candidates purely by global popularity.
3.  **Pure Content-Based**: Ranks by similarity to user's profile (TF-IDF vector).
4.  **Weighted Hybrid (Alpha=0.7)**: Our proposed champion model.

### **Interpreting The Output**
*   **0.0 Hit Rate**: In highly sparse domains like *Health & Household*, finding the exact hidden item in the top 10 is extremely difficult. A score of 0.0 does not mean failure; it means no model achieved "perfect" precision.
*   **Rank Analysis**: Look at the distinct **Rank** in the verbose output.
    *   Random will rank the item around ~300.
    *   If our Hybrid ranks it at **90** or **50**, it is providing **3x-6x better utility** than random, even if it misses the Top-10 cut-off.



In [None]:
# Run Baseline Comparison
# We increase the sample size to 100 users for a more robust estimate.
df_baseline = utils.evaluate_baselines_comparison(
    df, 
    profiles_train, item_features, cold_vec_train,
    R_train, sim_train, means_train,
    user_map, item_map,
    n_test_users=100
)

print("\n=== BASELINE COMPARISON RESULTS (Sample: 100 Users) ===")
print(df_baseline)
