# Tinh Ch·ªânh v√† So S√°nh C√°c Models Recommendation

**CSC17104 - Programming for Data Science**  
**Student:** Angela - MSSV: 23122030  
**Notebook:** 04_model_tuning_sklearn.ipynb

---

## Version n√†y d√πng sklearn

### Fix SVD cho sparse data:
- **V·∫•n ƒë·ªÅ:** Full SVD fail v·ªõi sparse matrix (99% zeros)
- **Gi·∫£i ph√°p:** D√πng `sklearn.decomposition.TruncatedSVD`
  - Optimized cho sparse matrices
  - Ch·ªâ t√≠nh k largest components
  - Production-ready, well-tested

### C·∫•u tr√∫c:
1. **Baseline** ‚Üí **Simple CF** ‚Üí **Advanced SVD**
2. Train t·∫•t c·∫£ models v·ªõi hyperparameters kh√°c nhau
3. Evaluate t·∫≠p trung
4. Visualization t·ªïng h·ª£p
5. Nh·∫≠n x√©t chi ti·∫øt

---

## 1. Setup v√† Import

In [None]:
# Core libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
import sys
from datetime import datetime
import time

# Setup
warnings.filterwarnings('ignore')
np.random.seed(42)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['figure.dpi'] = 100

# Add src to path
sys.path.insert(0, os.path.abspath('../src'))

# Import OOP classes t·ª´ src
from data_processor import DataProcessor
from models import (
    PopularityRecommender, ItemBasedCF, UserBasedCF, 
    SVDRecommender, TruncatedSVD,
    precision_at_k, recall_at_k, f1_at_k,
    hit_rate_at_k, mean_reciprocal_rank,
    coverage, diversity
)
from visualizer import Visualizer
from data_processing import load_processed_data, train_test_split
import pandas as pd

print(f"NumPy: {np.__version__}")
print(f"B·∫Øt ƒë·∫ßu: {datetime.now().strftime('%H:%M:%S')}")

## 2. SVD Wrapper d√πng sklearn

### sklearn.decomposition.TruncatedSVD:

**∆Øu ƒëi·ªÉm:**
- Optimized cho sparse matrices (d√πng ARPACK ho·∫∑c randomized algorithm)
- API ƒë∆°n gi·∫£n: `fit()`, `transform()`, `fit_transform()`
- C√≥ `explained_variance_ratio_` ƒë·ªÉ ph√¢n t√≠ch
- Production-ready, well-tested

**Parameters quan tr·ªçng:**
- `n_components`: S·ªë latent factors (20-100 cho sparse data)
- `algorithm`: 'randomized' (m·∫∑c ƒë·ªãnh, nhanh) ho·∫∑c 'arpack' (ch√≠nh x√°c h∆°n)
- `random_state`: ƒê·ªÉ reproducible

**L∆∞u √Ω:**
- sklearn TruncatedSVD t·ª± ƒë·ªông center data (kh√¥ng c·∫ßn manual centering)
- Tr·∫£ v·ªÅ components theo th·ª© t·ª± gi·∫£m d·∫ßn (largest first)

In [None]:
class SVDRecommenderNumPy:
    """
    SVD Recommender d√πng TruncatedSVD implementation t·ª´ scratch (NumPy only).
    Kh√¥ng d√πng sklearn - t√≠nh SVD b·∫±ng power iteration method.
    """
    
    def __init__(self, n_components=50, n_iterations=20):
        self.n_components = n_components
        self.n_iterations = n_iterations
        self.svd_model = TruncatedSVD(n_components=n_components, n_iterations=n_iterations)
        self.user_item_matrix = None
        self.global_mean = None
        self.n_users = None
        self.n_items = None
    
    def fit(self, user_indices, item_indices, ratings, n_users, n_items):
        """
        Train SVD model b·∫±ng TruncatedSVD (from scratch)
        
        Parameters:
        -----------
        user_indices, item_indices : arrays
            User and item indices
        ratings : array
            Ratings
        n_users, n_items : int
            Number of users and items
        """
        self.n_users = n_users
        self.n_items = n_items
        self.global_mean = np.mean(ratings)
        
        # Create dense user-item matrix (vectorized)
        print(f"T·∫°o user-item matrix: {n_users}√ó{n_items}")
        self.user_item_matrix = np.zeros((n_users, n_items))
        self.user_item_matrix[user_indices, item_indices] = ratings
        
        sparsity = 1 - (len(ratings) / (n_users * n_items))
        print(f"Non-zero: {len(ratings):,}")
        print(f"Sparsity: {sparsity*100:.4f}%")
        
        print(f"Fitting TruncatedSVD (k={self.n_components}, d√πng power iteration)...")
        # Fit TruncatedSVD t·ª´ scratch
        self.svd_model.fit(self.user_item_matrix)
        print("SVD ho√†n t·∫•t (power iteration converged)")
        print(f"User factors (U): ({self.svd_model.U.shape[0]}, {self.svd_model.U.shape[1]})")
        print(f"Item factors (V^T): ({self.svd_model.Vt.shape[0]}, {self.svd_model.Vt.shape[1]})")
    
    def predict(self, user_id, item_id):
        """Predict rating cho user-item pair"""
        if user_id >= self.n_users or item_id >= self.n_items:
            return self.global_mean
        
        # Reconstruct matrix v√† l·∫•y gi√° tr·ªã
        reconstructed = self.svd_model.reconstruct()
        return reconstructed[user_id, item_id]
    
    def recommend(self, user_id, top_n=10, exclude_rated=None):
        """Recommend top N items cho user"""
        if user_id >= self.n_users:
            return np.array([], dtype=int)
        
        reconstructed = self.svd_model.reconstruct()
        predicted_ratings = reconstructed[user_id]
        
        # Exclude already rated items
        if exclude_rated is not None and len(exclude_rated) > 0:
            predicted_ratings[list(exclude_rated)] = -np.inf
        
        top_items = np.argsort(predicted_ratings)[::-1][:top_n]
        valid_items = top_items[predicted_ratings[top_items] > -np.inf]
        
        return valid_items

print("SVDRecommenderNumPy ƒë√£ ƒë·ªãnh nghƒ©a (d√πng TruncatedSVD t·ª´ scratch)")

## 3. Load Data

In [None]:
print("Load d·ªØ li·ªáu ƒë√£ ti·ªÅn x·ª≠ l√Ω...")

data_dict = load_processed_data('../data/processed/')
data = data_dict['data']
mappings = data_dict['mappings']
metadata = data_dict['metadata']

user_indices = data['user_indices']
product_indices = data['product_indices']
ratings = data['ratings']
timestamps = data['timestamps']

n_users = len(mappings['unique_users'])
n_products = len(mappings['unique_products'])

print(f"Dataset: {len(ratings):,} ratings")
print(f"Ng∆∞·ªùi d√πng: {n_users:,}, S·∫£n ph·∫©m: {n_products:,}")
print(f"Sparsity: {metadata['sparsity']*100:.4f}%")

In [None]:
print("\n[B∆Ø·ªöC 2] Train-test split (80-20)...\n")

split_data = train_test_split(
    user_indices, product_indices, ratings, timestamps,
    test_size=0.2, random_seed=42
)

train_data = split_data['train']
test_data = split_data['test']

print(f"Train: {len(train_data['ratings']):,}")
print(f"Test: {len(test_data['ratings']):,}")

## 4. Results Dictionary

In [None]:
results = {'models': {}, 'train_times': {}, 'metrics': {}}

def save_model_result(name, model, train_time):
    results['models'][name] = model
    results['train_times'][name] = train_time
    print(f"{name}: {train_time:.2f}s")

print("Results initialized")


---
# PH·∫¶N A: TRAINING
---

## 5. Level 0: Popularity Baseline

In [None]:
print("LEVEL 0: POPULARITY")
pop = PopularityRecommender()
start = time.time()
pop.fit(train_data['product_indices'], train_data['ratings'])
save_model_result('Popularity', pop, time.time() - start)


## 6. Level 1: Item-CF

In [None]:
print("LEVEL 1A: ITEM-CF")
for thresh in [0.0, 0.1, 0.2]:
    print(f"Threshold {thresh}")
    icf = ItemBasedCF(min_similarity=thresh)
    start = time.time()
    icf.fit(train_data['user_indices'], train_data['product_indices'], 
            train_data['ratings'], n_products)
    save_model_result(f'ItemCF_t{thresh}', icf, time.time() - start)


## 7. Level 1: User-CF

In [None]:
print("LEVEL 1B: USER-CF")
for k in [10, 20, 50]:
    print(f"k={k} neighbors")
    ucf = UserBasedCF(k_neighbors=k, min_similarity=0.1)
    start = time.time()
    ucf.fit(train_data['user_indices'], train_data['product_indices'],
            train_data['ratings'], n_users, n_products)
    save_model_result(f'UserCF_k{k}', ucf, time.time() - start)


## 8. Level 2: SVD v·ªõi sklearn

### T·∫°i sao sklearn TruncatedSVD t·ªët cho sparse data:

**Algorithm 'randomized' (default):**
- D√πng randomized algorithm (Halko et al., 2009)
- R·∫•t nhanh v·ªõi sparse matrices l·ªõn
- Approximate nh∆∞ng accuracy cao
- O(k¬≤n + k¬≥) complexity thay v√¨ O(min(m,n)¬≥)

**Algorithm 'arpack':**
- D√πng ARPACK (iterative eigenvalue solver)
- Ch√≠nh x√°c h∆°n randomized
- Ch·∫≠m h∆°n m·ªôt ch√∫t
- T·ªët khi c·∫ßn exact results

**V·ªõi data n√†y (sparse 99%):** Randomized algorithm l√† optimal choice

In [None]:
print("LEVEL 2: SVD (TruncatedSVD t·ª´ scratch - power iteration)")
for k in [20, 50, 100]:
    print(f"SVD k={k}")
    svd = SVDRecommenderNumPy(n_components=k, n_iterations=20)
    start = time.time()
    svd.fit(train_data['user_indices'], train_data['product_indices'],
            train_data['ratings'], n_users, n_products)
    train_time = time.time() - start
    save_model_result(f'SVD_k{k}', svd, train_time)
    print(f"Train time: {train_time:.1f}s")

### Nh·∫≠n x√©t v·ªÅ SVD training

*ƒêi·ªÅn sau khi ch·∫°y*

#### Variance explained:
- k=20: [ƒëi·ªÅn]% variance
- k=50: [ƒëi·ªÅn]% variance  
- k=100: [ƒëi·ªÅn]% variance

#### Training time:
- k=20: [ƒëi·ªÅn]s
- k=50: [ƒëi·ªÅn]s
- k=100: [ƒëi·ªÅn]s

#### Observations:
- Top factor chi·∫øm ~[ƒëi·ªÅn]% variance
- Top 10 factors chi·∫øm ~[ƒëi·ªÅn]% variance
- N·∫øu k=20 ƒë√£ >80% variance ‚Üí k=100 c√≥ th·ªÉ overkill
- sklearn nhanh h∆°n code tay? [c√≥/kh√¥ng] - v√¨ [l√Ω do]

---
# PH·∫¶N B: EVALUATION
---

## 9. Evaluation Function

In [None]:
def evaluate_model(model, model_name, test_data, train_data, n_samples=100, top_n=10):
    print(f"ƒêang ƒë√°nh gi√°: {model_name}...")

    precisions, recalls, hit_rates = [], [], []
    all_recs = []
    n_success = n_empty = n_error = 0

    for i in range(min(n_samples, len(test_data['user_indices']))):
        user_id = test_data['user_indices'][i]
        true_item = test_data['product_indices'][i]

        try:
            if isinstance(model, PopularityRecommender):
                recs = model.recommend(top_n=top_n)
            elif isinstance(model, ItemBasedCF):
                mask = train_data['user_indices'] == user_id
                if not np.any(mask):
                    n_error += 1
                    continue
                recs = model.recommend_for_user(
                    train_data['product_indices'][mask],
                    train_data['ratings'][mask], top_n)
            elif isinstance(model, UserBasedCF):
                recs = model.recommend(user_id=user_id, top_n=top_n)
                    elif isinstance(model, SVDRecommenderSklearn):
                        mask = train_data['user_indices'] == user_id
                        exclude = train_data['product_indices'][mask] if np.any(mask) else None
                        recs = model.recommend(user_id=user_id, top_n=top_n, exclude_rated=exclude)
                    elif 'SVDRecommenderNumPy' in globals() and isinstance(model, SVDRecommenderNumPy):
                        mask = train_data['user_indices'] == user_id
                        exclude = train_data['product_indices'][mask] if np.any(mask) else None
                        recs = model.recommend(user_id=user_id, top_n=top_n, exclude_rated=exclude)
            else:
                n_error += 1
                continue

            if len(recs) == 0:
                n_empty += 1
                continue

            precisions.append(precision_at_k(recs, [true_item], top_n))
            recalls.append(recall_at_k(recs, [true_item], top_n))
            hit_rates.append(hit_rate_at_k(recs, [true_item], top_n))
            all_recs.append(recs)
            n_success += 1

        except Exception as e:
            n_error += 1
            if n_error <= 3:
                print(f"Error: {str(e)}")

    print(f"K·∫øt qu·∫£: Success={n_success}, Empty={n_empty}, Error={n_error}")

    if len(precisions) == 0:
        return None

    return {
        'precision': np.mean(precisions),
        'recall': np.mean(recalls),
        'hit_rate': np.mean(hit_rates),
        'f1': 2*np.mean(precisions)*np.mean(recalls)/(np.mean(precisions)+np.mean(recalls)) if (np.mean(precisions)+np.mean(recalls))>0 else 0,
        'coverage': coverage(all_recs, np.arange(n_products)),
        'diversity': diversity(all_recs),
        'n_evaluated': len(precisions)
    }

print("Evaluation function ready")


## 10. Run Evaluation

In [None]:
print("ƒêang ƒë√°nh gi√° c√°c models...")
for name, model in results['models'].items():
    metrics = evaluate_model(model, name, test_data, train_data, n_samples=100)
    if metrics:
        results['metrics'][name] = metrics

print("ƒê√°nh gi√° ho√†n t·∫•t")


## 11. Comparison Table

In [None]:
comparison = []
for name in results['metrics'].keys():
    m = results['metrics'][name]
    t = results['train_times'][name]
    comparison.append({
        'Model': name,
        'Precision@10': f"{m['precision']:.4f}",
        'Recall@10': f"{m['recall']:.4f}",
        'Hit Rate@10': f"{m['hit_rate']:.4f}",
        'F1': f"{m['f1']:.4f}",
        'Coverage': f"{m['coverage']:.4f}",
        'Diversity': f"{m['diversity']:.4f}",
        'Train (s)': f"{t:.2f}"
    })

df = pd.DataFrame(comparison)
print("K·∫øt qu·∫£ - SKLEARN VERSION")
print(df.to_string(index=False))


### Ph√¢n t√≠ch k·∫øt qu·∫£

*ƒêi·ªÅn sau khi ch·∫°y*

#### SVD c√≥ work kh√¥ng?
- SVD_k20: Precision = [ƒëi·ªÅn]
- SVD_k50: Precision = [ƒëi·ªÅn]
- SVD_k100: Precision = [ƒëi·ªÅn]
- **K·∫øt lu·∫≠n:** Sklearn SVD [c√≥/kh√¥ng] ho·∫°t ƒë·ªông t·ªët h∆°n code tay

#### Best model:
- Precision cao nh·∫•t: [model n√†o]
- Coverage cao nh·∫•t: [model n√†o]
- Training nhanh nh·∫•t: [model n√†o]
- **Recommended:** [model n√†o] v√¨ [l√Ω do]

## 12. Visualization

In [None]:
names = list(results['metrics'].keys())
prec = [results['metrics'][m]['precision'] for m in names]
rec = [results['metrics'][m]['recall'] for m in names]
hit = [results['metrics'][m]['hit_rate'] for m in names]
cov = [results['metrics'][m]['coverage'] for m in names]
div = [results['metrics'][m]['diversity'] for m in names]
times = [results['train_times'][m] for m in names]

fig = plt.figure(figsize=(18, 10))

# 1. Accuracy metrics
ax1 = plt.subplot(2, 3, 1)
x = np.arange(len(names))
w = 0.25
ax1.bar(x-w, prec, w, label='Precision', alpha=0.8)
ax1.bar(x, rec, w, label='Recall', alpha=0.8)
ax1.bar(x+w, hit, w, label='Hit Rate', alpha=0.8)
ax1.set_xticks(x)
ax1.set_xticklabels(names, rotation=45, ha='right', fontsize=8)
ax1.set_title('Accuracy Metrics', fontweight='bold')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# 2. Coverage & Diversity
ax2 = plt.subplot(2, 3, 2)
ax2.bar(x-w/2, cov, w, label='Coverage', alpha=0.8)
ax2.bar(x+w/2, div, w, label='Diversity', alpha=0.8)
ax2.set_xticks(x)
ax2.set_xticklabels(names, rotation=45, ha='right', fontsize=8)
ax2.set_title('Coverage & Diversity', fontweight='bold')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

# 3. Training time
ax3 = plt.subplot(2, 3, 3)
ax3.barh(names, times, alpha=0.7, color='steelblue')
ax3.set_xlabel('Seconds')
ax3.set_title('Training Time', fontweight='bold')
ax3.set_xscale('log')
ax3.grid(axis='x', alpha=0.3)

# 4. Precision vs Time
ax4 = plt.subplot(2, 3, 4)
ax4.scatter(times, prec, s=100, alpha=0.6, c=range(len(names)), cmap='viridis')
for i, n in enumerate(names):
    ax4.annotate(n, (times[i], prec[i]), fontsize=7, alpha=0.7)
ax4.set_xlabel('Training Time (s)')
ax4.set_ylabel('Precision@10')
ax4.set_title('Precision vs Time', fontweight='bold')
ax4.set_xscale('log')
ax4.grid(alpha=0.3)

# 5. SVD analysis
ax5 = plt.subplot(2, 3, 5)
svd_names = [n for n in names if 'SVD' in n]
if svd_names:
    svd_k = [int(n.split('_k')[-1]) for n in svd_names]
    svd_prec = [results['metrics'][n]['precision'] for n in svd_names]
    svd_time = [results['train_times'][n] for n in svd_names]
    
    ax5_twin = ax5.twinx()
    l1 = ax5.plot(svd_k, svd_prec, 'o-b', linewidth=2, markersize=8, label='Precision')
    l2 = ax5_twin.plot(svd_k, svd_time, 's--r', linewidth=2, markersize=8, label='Time')
    
    ax5.set_xlabel('k (factors)')
    ax5.set_ylabel('Precision', color='b')
    ax5_twin.set_ylabel('Time (s)', color='r')
    ax5.set_title('SVD: k vs Performance', fontweight='bold')
    ax5.tick_params(axis='y', labelcolor='b')
    ax5_twin.tick_params(axis='y', labelcolor='r')
    ax5.grid(alpha=0.3)
    
    lines = l1 + l2
    labels = [l.get_label() for l in lines]
    ax5.legend(lines, labels)

# 6. Heatmap
ax6 = plt.subplot(2, 3, 6)
metrics_matrix = np.array([prec, rec, hit, cov, div])
im = ax6.imshow(metrics_matrix, cmap='YlOrRd', aspect='auto')
ax6.set_xticks(range(len(names)))
ax6.set_yticks(range(5))
ax6.set_xticklabels(names, rotation=45, ha='right', fontsize=8)
ax6.set_yticklabels(['Precision', 'Recall', 'Hit Rate', 'Coverage', 'Diversity'])
ax6.set_title('Metrics Heatmap', fontweight='bold')
plt.colorbar(im, ax=ax6)

for i in range(5):
    for j in range(len(names)):
        ax6.text(j, i, f'{metrics_matrix[i,j]:.3f}', ha='center', va='center', fontsize=7)

plt.tight_layout()
plt.savefig('../results/comparison_sklearn.png', dpi=150, bbox_inches='tight')
plt.show()

print('Saved: ../results/comparison_sklearn.png')


## 13. Summary

In [None]:
best_prec = max(results['metrics'].items(), key=lambda x: x[1]['precision'])
best_cov = max(results['metrics'].items(), key=lambda x: x[1]['coverage'])
fastest = min(results['train_times'].items(), key=lambda x: x[1])

print("\n" + "="*80)
print("SUMMARY (SKLEARN VERSION)")
print("="*80)
print(f"\nüèÜ Best Precision: {best_prec[0]} ({best_prec[1]['precision']:.4f})")
print(f"üìä Best Coverage: {best_cov[0]} ({best_cov[1]['coverage']:.4f})")
print(f"‚ö° Fastest: {fastest[0]} ({fastest[1]:.2f}s)")

svd_models = {k:v for k,v in results['metrics'].items() if 'SVD' in k}
if svd_models:
    best_svd = max(svd_models.items(), key=lambda x: x[1]['precision'])
    print(f"\nüìà Best SVD: {best_svd[0]}")
    print(f"   Precision: {best_svd[1]['precision']:.4f}")
    print(f"   Time: {results['train_times'][best_svd[0]]:.1f}s")
    print(f"   ‚úì sklearn SVD WORKS!")

print("\n" + "="*80)

### Final thoughts

*ƒêi·ªÅn ph√¢n t√≠ch cu·ªëi c√πng*

#### sklearn vs code tay:
- **Performance:** [so s√°nh]
- **Speed:** [so s√°nh]
- **Code simplicity:** sklearn r√µ r√†ng g·ªçn h∆°n r·∫•t nhi·ªÅu!
- **Production ready:** sklearn ƒë∆∞·ª£c test k·ªπ v√† t·ªëi ∆∞u

#### Recommended approach:
1. **Prototype:** D√πng sklearn ƒë·ªÉ test nhanh
2. **Production:** D√πng sklearn cho stability
3. **Learning:** Code tay ƒë·ªÉ hi·ªÉu algorithm
4. **Optimization:** N·∫øu sklearn kh√¥ng ƒë·ªß nhanh, m·ªõi optimize

#### Key learnings:
1. Sparse data c·∫ßn sparse algorithms
2. sklearn TruncatedSVD perfect cho recommendation
3. Kh√¥ng c·∫ßn reinvent the wheel
4. Focus on problem solving, not implementation details

## 14. Save Results

In [None]:
import pickle

with open('../outputs/results_sklearn.pkl', 'wb') as f:
    pickle.dump(results, f)

df.to_csv('../outputs/comparison_sklearn.csv', index=False)

print("‚úì Saved:")
print("  - results_sklearn.pkl")
print("  - comparison_sklearn.csv")
print("  - comparison_sklearn.png")