# PoC: History-based Recommendation

## Goal
Prototype a recommendation system that takes a **list of recently watched animes** (User History) and suggests similar content. 
This moves beyond the simple "Item-to-Item" recommendation to a "User-to-Item" approach.

## Methodologies

We will compare two distinct strategies to handle the user's history:

### Strategy 1: The Centroid Method (Mean Vector)
**The Average Taste**

- **Concept**: We assume a user can be represented by the *average* of what they watch. If you watch a *Horror* anime and a *Comedy* anime, your "user vector" sits somewhere in the middle (e.g., *Dark Comedy*).
- **Math**:
  1. Retrieve the TF-IDF feature vector $\vec{v}_i$ for each anime $i$ in the history.
  2. Compute the User Vector: $\vec{u} = \frac{1}{N} \sum_{i=1}^{N} \vec{v}_i$
  3. Find the $k$ nearest neighbors to $\vec{u}$ in the entire dataset.
- **Pros**: Fast (1 query), holistic.
- **Cons**: Can dilute specific tastes. The average of "Naruto" and "Your Lie in April" might be a generic show that fits neither perfectly.

### Strategy 2: Multi-Query Aggregation
**The Eclectic Taste**

- **Definition**: "Eclectic" refers to a taste derived from a wide and diverse range of sources. This strategy assumes the user has multiple distinct interests rather than one single "average" preference.

- **Concept**: We respect that a user might have distinct, unrelated interests. We find recommendations for *each* show in history separately, then combine them.
- **Math**:
  1. For each anime $i$ in history, find its top $k$ neighbors $R_i = \{r_{i1}, r_{i2}, ...\}$.
  2. Pool all results $R_{total} = R_1 \cup R_2 \cup ...$
  3. Rank candidates by **Frequency** (how many history items voted for it?) and **Score**.
- **Pros**: Preserves distinct genres. Excellent for users with diverse taste.
- **Cons**: Slower ($N$ queries).

## Assumption: Popularity & Quality Bias
> **Assumption**: Users prefer "good" and "popular" shows over obscure ones, all else being equal. 
> **Refinement**: After finding similar animes (by distance), we re-rank the candidates using a **Hybrid Score**:
> - **Similarity** (Distance)
> - **Popularity** (Favorites count, log-scaled)
> - **Quality** (MyAnimeList Score)
> This ensures we recommend high-confidence hits.

## Assumption: Franchise Awareness
> **Assumption**: Users are generally aware of the franchises they are already watching. 
> If a user has *Naruto* in their history, recommending *Naruto Shippuden* is trivial and less valuable than recommending a *new* series like *Hunter x Hunter*. 
> **Refinement**: We explicitly filter out recommendations that contain the title of any anime in the history (e.g., exclude "One Piece: Movie 1" if "One Piece" is in history).


In [70]:
import sys
import os
from pathlib import Path

import numpy as np
import pandas as pd
from scipy.sparse import vstack

# Add project root to path to verify accessing src modules
current_dir = Path(os.getcwd()).resolve()
project_root = current_dir.parent
sys.path.append(str(project_root))

from src.pipeline.inference import load_models, load_processed_data

# Data Understanding

## Setup & Data Loading

In [71]:
knn, vectorizer = load_models()
df = load_processed_data()

2026-02-04 12:11:23,273 - inference - INFO - Loading models...


2026-02-04 12:11:23,439 - inference - INFO - Loading data from C:\Users\sayye\OneDrive\Documents\GitHub\AniMate\artifacts\vector_embeddings.pkl...


In [72]:
df.head(1)

Unnamed: 0,mal_id,title,english title,japanese title,episodes,release year,status,air date,genres,themes,...,favorites,content rating,source,duration,url,image url,score,stemmed_synopsis,producer,combined_features
0,16498,Shingeki no Kyojin,Attack on Titan,進撃の巨人,25.0,2013.0,Finished Airing,"Apr 7, 2013 to Sep 29, 2013","Action, Award Winning, Drama, Suspense","Gore, Military, Survival",...,186403,R - 17+ (violence & profanity),Manga,24 min per ep,https://myanimelist.net/anime/16498/Shingeki_n...,https://cdn.myanimelist.net/images/anime/10/47...,8.57,centuri ago mankind slaughter near extinct mon...,,Shingeki no Kyojin Attack on Titan 進撃の巨人 Actio...


# Data Preparation

## Helper Functions

We need a robust way to find the vector for a given anime title. 
Since our model works on `combined_features`, we must:
1. Find the dataframe row for the title.
2. Extract the `combined_features` text.
3. Transform it using the loaded `vectorizer`.

In [73]:
def get_anime_vector(title_query, dataframe, vectorizer):
    """
    Finds an anime by title and returns its TF-IDF vector.
    Returns (None, None) if not found.
    """
    # Case-insensitive search
    match = dataframe[dataframe['title'].str.contains(title_query, case=False, na=False, regex=False)]
    
    if match.empty:
        # Try English title
        match = dataframe[dataframe['english title'].str.contains(title_query, case=False, na=False, regex=False)]
        
    if match.empty:
        print(f"Warning: '{title_query}' not found in database.")
        return None, None
    
    # Take the first match
    row = match.iloc[0]
    text_feature = row['combined_features']
    
    # Vectorize
    vector = vectorizer.transform([text_feature])
    return vector, row['title']

In [74]:
def get_user_history_vectors(history_titles, df, vectorizer):
    """
    Converts a list of titles into a matrix of vectors.
    """
    vectors = []
    found_titles = []
    
    for title in history_titles:
        vec, true_title = get_anime_vector(title, df, vectorizer)
        if vec is not None:
            vectors.append(vec)
            found_titles.append(true_title)
            
    if not vectors:
        return None, []
        
    return vstack(vectors), found_titles

In [75]:
def calculate_hybrid_recommendation_score(candidates_dataframe):
    """
    Calculates a final weighted score for recommendation candidates based on Similarity, Popularity, and Quality.
    Uses descriptive variable names for clarity.
    """
    
    df_scored = candidates_dataframe.copy()
    
    # 1. Similarity Score (0 to 1)
    # 'distance' is lower-is-better. We convert to higher-is-better.
    # For Multi-Query, we might calculate 'average_similarity' from 'similarity_sum' / 'frequency'
    if 'average_similarity_score' not in df_scored.columns:
        if 'similarity_sum' in df_scored.columns and 'frequency_count' in df_scored.columns:
             df_scored['average_similarity_score'] = df_scored['similarity_sum'] / df_scored['frequency_count']
        elif 'distance' in df_scored.columns:
             df_scored['average_similarity_score'] = 1.0 - df_scored['distance'].clip(0, 1)
        else:
             df_scored['average_similarity_score'] = 0.0
    
    # 2. Popularity Score (Log-normalized Favorites)
    # We use log1p to handle power-law distribution (100 vs 100k favorites)
    raw_favorites_count = df_scored['favorites_count'].fillna(0)
    log_transformed_favorites = np.log1p(raw_favorites_count)
    max_log_favorites = log_transformed_favorites.max()
    
    # Avoid division by zero
    if max_log_favorites == 0:
        max_log_favorites = 1.0
        
    df_scored['normalized_popularity_score'] = log_transformed_favorites / max_log_favorites
    
    # 3. Quality Score (Normalized MAL Score)
    # Raw score is 0-10. We normalize to 0-1.
    raw_mal_score = df_scored['myanimelist_score'].fillna(0)
    df_scored['normalized_quality_score'] = raw_mal_score / 10.0
    
    # 4. Frequency Bonus (Only for Multi-Query)
    if 'frequency_count' in df_scored.columns:
        max_frequency = df_scored['frequency_count'].max()
        if max_frequency == 0: max_frequency = 1
        df_scored['normalized_frequency_bonus'] = df_scored['frequency_count'] / max_frequency
        
        # Weighted Sum for Multi-Query
        # 40% Sim + 20% Freq + 20% Pop + 20% Quality
        df_scored['final_hybrid_score'] = (
            0.4 * df_scored['average_similarity_score'] +
            0.2 * df_scored['normalized_frequency_bonus'] +
            0.2 * df_scored['normalized_popularity_score'] +
            0.2 * df_scored['normalized_quality_score']
        )
    else:
        # Weighted Sum for Centroid
        # 50% Sim + 30% Pop + 20% Quality
        df_scored['final_hybrid_score'] = (
            0.5 * df_scored['average_similarity_score'] +
            0.3 * df_scored['normalized_popularity_score'] +
            0.2 * df_scored['normalized_quality_score']
        )
        
    return df_scored.sort_values('final_hybrid_score', ascending=False)


# Modeling

## Implementation: Strategy 1 (Centroid)
 
We calculate $\vec{u} = \text{mean}(\text{vectors})$. Then query the KNN model once.

In [76]:
def recommend_centroid(history_titles, df, vectorizer, knn, top_k=5):
    vectors, found_titles = get_user_history_vectors(history_titles, df, vectorizer)
    
    if vectors is None:
        return pd.DataFrame()
    
    # Calculate Centroid
    user_vector = vectors.mean(axis=0)
    import numpy as np
    user_vector = np.asarray(user_vector)
    
    # Fetch Larger Pool
    distances, indices = knn.kneighbors(user_vector, n_neighbors=50)
    
    candidate_anime_list = []
    for dist, idx in zip(distances[0], indices[0]):
        row = df.iloc[idx]
        anime_title = row['title']
        
        # Franchise Filtering
        is_franchise_duplicate = any(history_item.lower() in anime_title.lower() for history_item in found_titles)
        if anime_title in found_titles or is_franchise_duplicate:
            continue
            
        candidate_anime_list.append({
            'title': anime_title,
            'genres': row.get('genres'),
            'themes': row.get('themes'),
            'distance': dist,
            'favorites_count': row.get('favorites', 0),
            'myanimelist_score': row.get('score', 0),
            'strategy': 'Centroid'
        })
    
    if not candidate_anime_list:
        return pd.DataFrame()
        
    # Hybrid Scoring
    df_candidates = pd.DataFrame(candidate_anime_list)
    df_ranked = calculate_hybrid_recommendation_score(df_candidates)
    
    return df_ranked.head(top_k)


## Implementation: Strategy 2 (Multi-Query)

We query for *each* history item, then aggregate. 
Scoring rule: 
- 1 point for being a neighbor.
- Boost by (1 - distance) to value closer matches.

In [77]:
def recommend_multiquery(history_titles, df, vectorizer, knn, top_k=5):
    vectors, found_titles = get_user_history_vectors(history_titles, df, vectorizer)
    
    if vectors is None:
        return pd.DataFrame()
    
    candidates_map = {}
    
    for i in range(vectors.shape[0]):
        vec = vectors.getrow(i)
        dists, idxs = knn.kneighbors(vec, n_neighbors=20)
        
        for dist, idx in zip(dists[0], idxs[0]):
            row = df.iloc[idx]
            anime_title = row['title']
            
            # Franchise Filtering
            is_franchise_duplicate = any(history_item.lower() in anime_title.lower() for history_item in found_titles)
            if anime_title in found_titles or is_franchise_duplicate:
                continue
                
            similarity_score = 1.0 - dist
            
            if anime_title not in candidates_map:
                candidates_map[anime_title] = {
                    'row': row, 
                    'similarity_sum': 0, 
                    'frequency_count': 0,
                    'min_distance': 1.0
                }
            
            candidates_map[anime_title]['similarity_sum'] += similarity_score
            candidates_map[anime_title]['frequency_count'] += 1
            candidates_map[anime_title]['min_distance'] = min(candidates_map[anime_title]['min_distance'], dist)
            
    if not candidates_map:
        return pd.DataFrame()
        
    candidate_anime_list = []
    for title, data in candidates_map.items():
        candidate_anime_list.append({
            'title': title,
            'genres': data['row'].get('genres'),
            'themes': data['row'].get('themes'),
            'similarity_sum': data['similarity_sum'],
            'frequency_count': data['frequency_count'],
            'best_distance': data['min_distance'],
            'favorites_count': data['row'].get('favorites', 0),
            'myanimelist_score': data['row'].get('score', 0),
            'strategy': 'Multi-Query'
        })
        
    df_candidates = pd.DataFrame(candidate_anime_list)
    df_ranked = calculate_hybrid_recommendation_score(df_candidates)
    
    return df_ranked.head(top_k)


# Evaluation

## Comparison Run

In [78]:
# TEST CASE 1: The "Action/Shounen" Fan
# Logic: Should recommend other high-stakes action shows
history_1 = ["Naruto", "One Piece", "Bleach"]

print(f"--- Recommendations for History: {history_1} ---")
print("\n[Strategy 1: Centroid]")
res_c = recommend_centroid(history_1, df, vectorizer, knn)
if not res_c.empty:
    if not res_c.empty:
        display(res_c[['title', 'final_hybrid_score', 'normalized_popularity_score', 'normalized_quality_score', 'genres']])
    else:
        print('No results')
else:
    print("No recommendations found (all filtered by franchise logic).")

print("\n[Strategy 2: Multi-Query]")
res_mq = recommend_multiquery(history_1, df, vectorizer, knn)
if not res_mq.empty:
    if not res_mq.empty:
        display(res_mq[['title', 'final_hybrid_score', 'frequency_count', 'normalized_popularity_score', 'genres']])
    else:
        print('No results')
else:
    print("No recommendations found (all filtered by franchise logic).")


--- Recommendations for History: ['Naruto', 'One Piece', 'Bleach'] ---

[Strategy 1: Centroid]


Unnamed: 0,title,final_hybrid_score,normalized_popularity_score,normalized_quality_score,genres
1,Yume-iro Pâtissière SP Professional,0.571105,1.0,0.757,"Gourmet, Slice of Life"
0,Ichigo Mashimaro Encore,0.449081,0.580675,0.767,Slice of Life



[Strategy 2: Multi-Query]


Unnamed: 0,title,final_hybrid_score,frequency_count,normalized_popularity_score,genres
8,Soul Eater,0.628269,1,1.0,"Action, Comedy, Fantasy"
2,Tokyo Mew Mew,0.609132,1,0.72632,"Romance, Sci-Fi"
1,Yume-iro Pâtissière SP Professional,0.598952,1,0.500697,"Gourmet, Slice of Life"
4,Yume-iro Pâtissière,0.597738,1,0.692688,Gourmet
0,Ichigo Mashimaro Encore,0.573826,1,0.290742,Slice of Life


In [79]:
# TEST CASE 2: The "Mixed Taste" User
# Logic: Likes Psychological Thrillers AND Comedy. 
# Centroid might struggle. Multi-Query should find gems for both.
history_2 = ["Death Note", "One Punch Man"]

print(f"--- Recommendations for History: {history_2} ---")
print("\n[Strategy 1: Centroid]")
res_c = recommend_centroid(history_2, df, vectorizer, knn)
if not res_c.empty:
    if not res_c.empty:
        display(res_c[['title', 'final_hybrid_score', 'normalized_popularity_score', 'normalized_quality_score', 'genres']])
    else:
        print('No results')
else:
    print("No recommendations found (all filtered by franchise logic).")

print("\n[Strategy 2: Multi-Query]")
res_mq = recommend_multiquery(history_2, df, vectorizer, knn)
if not res_mq.empty:
    if not res_mq.empty:
        display(res_mq[['title', 'final_hybrid_score', 'frequency_count', 'normalized_popularity_score', 'genres']])
    else:
        print('No results')
else:
    print("No recommendations found (all filtered by franchise logic).")


--- Recommendations for History: ['Death Note', 'One Punch Man'] ---

[Strategy 1: Centroid]


Unnamed: 0,title,final_hybrid_score,normalized_popularity_score,normalized_quality_score,genres
10,Sousou no Frieren,0.552879,1.0,0.928,"Adventure, Drama, Fantasy"
1,Death Parade,0.515686,0.908059,0.813,"Drama, Fantasy, Suspense"
27,Boku no Hero Academia,0.503526,0.962552,0.783,Action
21,Soul Eater,0.491163,0.908911,0.785,"Action, Comedy, Fantasy"
3,One Outs,0.466713,0.736585,0.832,"Sports, Suspense"



[Strategy 2: Multi-Query]


Unnamed: 0,title,final_hybrid_score,frequency_count,normalized_popularity_score,genres
1,Death Parade,0.644385,1,0.999062,"Drama, Fantasy, Suspense"
4,Soul Eater,0.623099,1,1.0,"Action, Comedy, Fantasy"
20,Boku no Hero Academia 6th Season,0.604372,1,0.864042,Action
19,One Outs,0.598275,1,0.810404,"Sports, Suspense"
26,Boku no Hero Academia 3rd Season,0.592318,1,0.915229,Action


In [80]:
# TEST CASE 3: Single Anime Input
# Logic: Should recommend other high-stakes action shows
history_1 = ["Mashle"]

print(f"--- Recommendations for History: {history_1} ---")
print("\n[Strategy 1: Centroid]")
res_c = recommend_centroid(history_1, df, vectorizer, knn)
if not res_c.empty:
    if not res_c.empty:
        display(res_c[['title', 'final_hybrid_score', 'normalized_popularity_score', 'normalized_quality_score', 'genres']])
    else:
        print('No results')
else:
    print("No recommendations found (all filtered by franchise logic).")

print("\n[Strategy 2: Multi-Query]")
res_mq = recommend_multiquery(history_1, df, vectorizer, knn)
if not res_mq.empty:
    if not res_mq.empty:
        display(res_mq[['title', 'final_hybrid_score', 'frequency_count', 'normalized_popularity_score', 'genres']])
    else:
        print('No results')
else:
    print("No recommendations found (all filtered by franchise logic).")


--- Recommendations for History: ['Mashle'] ---

[Strategy 1: Centroid]


Unnamed: 0,title,final_hybrid_score,normalized_popularity_score,normalized_quality_score,genres
37,Mahou Shoujo Madoka★Magica,0.553938,1.0,0.838,"Award Winning, Drama, Suspense"
9,Magi: The Kingdom of Magic,0.523081,0.815034,0.82,"Action, Adventure, Fantasy"
17,Dorohedoro,0.514839,0.838267,0.804,"Action, Comedy, Fantasy, Horror"
25,Mahouka Koukou no Rettousei,0.495004,0.847105,0.736,"Action, Fantasy, Romance, Sci-Fi"
31,Little Witch Academia (TV),0.493668,0.81842,0.78,"Adventure, Comedy, Fantasy"



[Strategy 2: Multi-Query]


Unnamed: 0,title,final_hybrid_score,frequency_count,normalized_popularity_score,genres
9,Magi: The Kingdom of Magic,0.655656,1,1.0,"Action, Adventure, Fantasy"
6,Rokudenashi Majutsu Koushi to Akashic Records,0.619064,1,0.910108,"Action, Fantasy"
10,Knight's & Magic,0.587303,1,0.782214,Fantasy
8,Mahou Shoujo Site,0.586199,1,0.808633,"Action, Drama, Horror, Suspense"
11,Quanzhi Fashi,0.578276,1,0.731767,"Action, Fantasy"


In [81]:
# TEST CASE 4: Multiple differering choices
# Logic: Should add a lot of big names to see if the system can handle it
history_1 = ["Mashle", "Solo Leveling", "Demon Slayer", "Jujutsu Kaisen", "One Punch Man", "One Piece", "Naruto", "Bleach"]

print(f"--- Recommendations for History: {history_1} ---")
print("\n[Strategy 1: Centroid]")
res_c = recommend_centroid(history_1, df, vectorizer, knn)
if not res_c.empty:
    if not res_c.empty:
        display(res_c[['title', 'final_hybrid_score', 'normalized_popularity_score', 'normalized_quality_score', 'genres']])
    else:
        print('No results')
else:
    print("No recommendations found (all filtered by franchise logic).")

print("\n[Strategy 2: Multi-Query]")
res_mq = recommend_multiquery(history_1, df, vectorizer, knn)
if not res_mq.empty:
    if not res_mq.empty:
        display(res_mq[['title', 'final_hybrid_score', 'frequency_count', 'normalized_popularity_score', 'genres']])
    else:
        print('No results')
else:
    print("No recommendations found (all filtered by franchise logic).")


--- Recommendations for History: ['Mashle', 'Solo Leveling', 'Demon Slayer', 'Jujutsu Kaisen', 'One Punch Man', 'One Piece', 'Naruto', 'Bleach'] ---

[Strategy 1: Centroid]


Unnamed: 0,title,final_hybrid_score,normalized_popularity_score,normalized_quality_score,genres
4,Gensoumaden Saiyuuki,0.54168,1.0,0.753,"Action, Adventure, Drama, Fantasy"
1,Kaiko sareta Ankoku Heishi (30-dai) no Slow na...,0.52771,0.953253,0.7,Fantasy
3,Yuusha ni Narenakatta Ore wa Shibushibu Shuush...,0.515823,0.951197,0.679,"Comedy, Fantasy, Romance, Ecchi"
0,Mahoujin Guruguru (2017),0.508645,0.822208,0.779,"Adventure, Comedy, Fantasy"
2,Hunter x Hunter Movie 2: The Last Mission,0.503278,0.870062,0.729,"Action, Adventure, Fantasy"



[Strategy 2: Multi-Query]


Unnamed: 0,title,final_hybrid_score,frequency_count,normalized_popularity_score,genres
19,Hunter x Hunter (2011),0.720578,1,1.0,"Action, Adventure, Fantasy"
18,Hunter x Hunter,0.661294,1,0.761957,"Action, Adventure, Fantasy"
23,Dungeon Meshi,0.611351,1,0.760866,"Adventure, Comedy, Fantasy, Gourmet"
9,Magi: The Kingdom of Magic,0.600998,1,0.72671,"Action, Adventure, Fantasy"
76,Soul Eater,0.595368,1,0.835496,"Action, Comedy, Fantasy"
