# PoC: History-based Recommendation

## Goal
Prototype a recommendation system that takes a **list of recently watched animes** (User History) and suggests similar content. 
This moves beyond the simple "Item-to-Item" recommendation to a "User-to-Item" approach.

## Methodologies

We will compare two distinct strategies to handle the user's history:

### Strategy 1: The Centroid Method (Mean Vector)
**The Average Taste**

- **Concept**: We assume a user can be represented by the *average* of what they watch. If you watch a *Horror* anime and a *Comedy* anime, your "user vector" sits somewhere in the middle (e.g., *Dark Comedy*).
- **Math**:
  1. Retrieve the TF-IDF feature vector $\vec{v}_i$ for each anime $i$ in the history.
  2. Compute the User Vector: $\vec{u} = \frac{1}{N} \sum_{i=1}^{N} \vec{v}_i$
  3. Find the $k$ nearest neighbors to $\vec{u}$ in the entire dataset.
- **Pros**: Fast (1 query), holistic.
- **Cons**: Can dilute specific tastes. The average of "Naruto" and "Your Lie in April" might be a generic show that fits neither perfectly.

### Strategy 2: Multi-Query Aggregation
**The Eclectic Taste**

- **Definition**: "Eclectic" refers to a taste derived from a wide and diverse range of sources. This strategy assumes the user has multiple distinct interests rather than one single "average" preference.

- **Concept**: We respect that a user might have distinct, unrelated interests. We find recommendations for *each* show in history separately, then combine them.
- **Math**:
  1. For each anime $i$ in history, find its top $k$ neighbors $R_i = \{r_{i1}, r_{i2}, ...\}$.
  2. Pool all results $R_{total} = R_1 \cup R_2 \cup ...$
  3. Rank candidates by **Frequency** (how many history items voted for it?) and **Score**.
- **Pros**: Preserves distinct genres. Excellent for users with diverse taste.
- **Cons**: Slower ($N$ queries).


In [None]:
import sys
import os
from pathlib import Path

import pandas as pd
import numpy as np
from scipy.sparse import vstack

# Add project root to path to verify accessing src modules
current_dir = Path(os.getcwd()).resolve()
project_root = current_dir.parent
sys.path.append(str(project_root))

from src.pipeline.inference import load_models, load_processed_data

## Setup & Data Loading

In [None]:
knn, vectorizer = load_models()
df = load_processed_data()

2026-02-02 10:34:37,218 - inference - INFO - Loading models...
2026-02-02 10:34:37,312 - inference - INFO - Loading data from C:\Users\sayye\OneDrive\Documents\GitHub\AniMate\artifacts\vector_embeddings.pkl...


In [None]:
df.head(1)

Unnamed: 0,mal_id,title,english title,japanese title,episodes,release year,status,air date,genres,themes,...,favorites,content rating,source,duration,url,image url,score,stemmed_synopsis,producer,combined_features
0,16498,Shingeki no Kyojin,Attack on Titan,進撃の巨人,25.0,2013.0,Finished Airing,"Apr 7, 2013 to Sep 29, 2013","Action, Award Winning, Drama, Suspense","Gore, Military, Survival",...,186403,R - 17+ (violence & profanity),Manga,24 min per ep,https://myanimelist.net/anime/16498/Shingeki_n...,https://cdn.myanimelist.net/images/anime/10/47...,8.57,centuri ago mankind slaughter near extinct mon...,,Shingeki no Kyojin Attack on Titan 進撃の巨人 Actio...


## Helper Functions

We need a robust way to find the vector for a given anime title. 
Since our model works on `combined_features`, we must:
1. Find the dataframe row for the title.
2. Extract the `combined_features` text.
3. Transform it using the loaded `vectorizer`.

In [None]:
def get_anime_vector(title_query, dataframe, vectorizer):
    """
    Finds an anime by title and returns its TF-IDF vector.
    Returns (None, None) if not found.
    """
    # Case-insensitive search
    match = dataframe[dataframe['title'].str.contains(title_query, case=False, na=False)]
    
    if match.empty:
        # Try English title
        match = dataframe[dataframe['english title'].str.contains(title_query, case=False, na=False)]
        
    if match.empty:
        print(f"Warning: '{title_query}' not found in database.")
        return None, None
    
    # Take the first match
    row = match.iloc[0]
    text_feature = row['combined_features']
    
    # Vectorize
    vector = vectorizer.transform([text_feature])
    return vector, row['title']

In [None]:
def get_user_history_vectors(history_titles, df, vectorizer):
    """
    Converts a list of titles into a matrix of vectors.
    """
    vectors = []
    found_titles = []
    
    for title in history_titles:
        vec, true_title = get_anime_vector(title, df, vectorizer)
        if vec is not None:
            vectors.append(vec)
            found_titles.append(true_title)
            
    if not vectors:
        return None, []
        
    return vstack(vectors), found_titles

## Implementation: Strategy 1 (Centroid)
 
We calculate $\vec{u} = \text{mean}(\text{vectors})$. Then query the KNN model once.

In [None]:
def recommend_centroid(history_titles, df, vectorizer, knn, top_k=5):
    vectors, found_titles = get_user_history_vectors(history_titles, df, vectorizer)
    
    if vectors is None:
        return pd.DataFrame() # No history found
    
    # Calculate Centroid (Mean Vector)
    # Note: TF-IDF vectors are sparse, but mean works fine mathematically
    user_vector = vectors.mean(axis=0)

    # Fix for sklearn: Convert np.matrix to np.array
    import numpy as np
    user_vector = np.asarray(user_vector)
    
    # Search KNN
    # We ask for more than k to allow filtering history items
    distances, indices = knn.kneighbors(user_vector, n_neighbors=top_k + len(found_titles))
    
    # Process Results
    results = []
    for dist, idx in zip(distances[0], indices[0]):
        row = df.iloc[idx]
        title = row['title']
        
        # Filter out history
        if title in found_titles:
            continue
            
        results.append({
            'title': title,
            'genres': row.get('genres'),
            'themes': row.get('themes'),
            'distance': dist,
            'strategy': 'Centroid'
        })
        
        if len(results) >= top_k:
            break
            
    return pd.DataFrame(results)

## Implementation: Strategy 2 (Multi-Query)

We query for *each* history item, then aggregate. 
Scoring rule: 
- 1 point for being a neighbor.
- Boost by (1 - distance) to value closer matches.

In [None]:
def recommend_multiquery(history_titles, df, vectorizer, knn, top_k=5):
    vectors, found_titles = get_user_history_vectors(history_titles, df, vectorizer)
    
    if vectors is None:
        return pd.DataFrame()
    
    candidates = {}
    
    # Iterate through each history item
    # vectors is a sparse matrix, iterate rows
    for i in range(vectors.shape[0]):
        vec = vectors.getrow(i)
        
        # Get neighbors for this specific item
        dists, idxs = knn.kneighbors(vec, n_neighbors=top_k + 1)
        
        for d, idx in zip(dists[0], idxs[0]):
            row = df.iloc[idx]
            title = row['title']
            
            if title in found_titles:
                continue
                
            # Scoring Logic
            # Score = Sum of (1 - distance) scores from all triggers
            # A perfect match (dist=0) gives 1.0
            score_boost = 1.0 - d
            
            if title not in candidates:
                candidates[title] = {
                    'row': row, 
                    'total_score': 0, 
                    'frequency': 0,
                    'min_distance': 1.0
                }
            
            candidates[title]['total_score'] += score_boost
            candidates[title]['frequency'] += 1
            candidates[title]['min_distance'] = min(candidates[title]['min_distance'], d)
            
    # Convert to DataFrame
    results = []
    for title, data in candidates.items():
        results.append({
            'title': title,
            'genres': data['row'].get('genres'),
            'themes': data['row'].get('themes'),
            'score': data['total_score'],
            'frequency': data['frequency'],
            'best_dist': data['min_distance'],
            'strategy': 'Multi-Query'
        })
        
    # Sort by Score desc
    results_df = pd.DataFrame(results)
    if not results_df.empty:
        results_df = results_df.sort_values('score', ascending=False).head(top_k)
        
    return results_df

## Comparison Run

In [None]:
# TEST CASE 1: The "Action/Shounen" Fan
# Logic: Should recommend other high-stakes action shows
history_1 = ["Naruto", "One Piece", "Bleach"]

print(f"--- Recommendations for History: {history_1} ---")
print("\n[Strategy 1: Centroid]")
res_c = recommend_centroid(history_1, df, vectorizer, knn)
display(res_c[['title', 'distance', 'genres', 'themes']])

print("\n[Strategy 2: Multi-Query]")
res_mq = recommend_multiquery(history_1, df, vectorizer, knn)
display(res_mq[['title', 'score', 'frequency', 'genres', 'themes']])

--- Recommendations for History: ['Naruto', 'One Piece', 'Bleach'] ---

[Strategy 1: Centroid]


Unnamed: 0,title,distance,genres,themes
0,Bleach: Sennen Kessen-hen,0.616237,"Action, Adventure, Supernatural",
1,Bleach Movie 1: Memories of Nobody,0.63274,"Action, Adventure, Supernatural",
2,One Piece: Episode of East Blue - Luffy to 4-n...,0.641049,"Action, Adventure, Fantasy",
3,Naruto Soyokazeden Movie: Naruto to Mashin to ...,0.665299,"Action, Comedy, Fantasy",Martial Arts
4,Bleach Movie 4: Jigoku-hen,0.667985,"Action, Adventure, Supernatural",



[Strategy 2: Multi-Query]


Unnamed: 0,title,score,frequency,genres,themes
10,Bleach Movie 1: Memories of Nobody,0.617031,1,"Action, Adventure, Supernatural",
11,Bleach: Sennen Kessen-hen,0.608361,1,"Action, Adventure, Supernatural",
5,One Piece: Episode of East Blue - Luffy to 4-n...,0.572019,1,"Action, Adventure, Fantasy",
0,Naruto Soyokazeden Movie: Naruto to Mashin to ...,0.549607,1,"Action, Comedy, Fantasy",Martial Arts
12,Bleach Movie 4: Jigoku-hen,0.539993,1,"Action, Adventure, Supernatural",


In [None]:
# TEST CASE 2: The "Mixed Taste" User
# Logic: Likes Psychological Thrillers AND Comedy. 
# Centroid might struggle. Multi-Query should find gems for both.
history_2 = ["Death Note", "One Punch Man"]

print(f"--- Recommendations for History: {history_2} ---")
print("\n[Strategy 1: Centroid]")
res_c = recommend_centroid(history_2, df, vectorizer, knn)
display(res_c[['title', 'distance', 'genres', 'themes']])

print("\n[Strategy 2: Multi-Query]")
res_mq = recommend_multiquery(history_2, df, vectorizer, knn)
display(res_mq[['title', 'score', 'frequency', 'genres', 'themes']])

--- Recommendations for History: ['Death Note', 'One Punch Man'] ---

[Strategy 1: Centroid]


Unnamed: 0,title,distance,genres,themes
0,One Punch Man: Road to Hero,0.612957,"Action, Comedy","Parody, Super Power"
1,Death Note: Rewrite,0.651258,"Supernatural, Suspense",Psychological
2,One Punch Man 2nd Season,0.673006,"Action, Comedy","Adult Cast, Parody, Super Power"
3,One Punch Man 3 Part 2,0.785443,"Action, Comedy","Adult Cast, Parody, Super Power"
4,One Punch Man 3,0.785671,"Action, Comedy","Adult Cast, Parody, Super Power"



[Strategy 2: Multi-Query]


Unnamed: 0,title,score,frequency,genres,themes
5,One Punch Man: Road to Hero,0.521727,1,"Action, Comedy","Parody, Super Power"
0,Death Note: Rewrite,0.453576,1,"Supernatural, Suspense",Psychological
6,One Punch Man 2nd Season,0.421587,1,"Action, Comedy","Adult Cast, Parody, Super Power"
7,One Punch Man 3 Part 2,0.285283,1,"Action, Comedy","Adult Cast, Parody, Super Power"
8,One Punch Man 3,0.28498,1,"Action, Comedy","Adult Cast, Parody, Super Power"
