# Recommendation System

## Data Description:

Unique ID of each anime.
Anime title.
Anime broadcast type, such as TV, OVA, etc.
anime genre.
The number of episodes of each anime.
The average rating for each anime compared to the number of users who gave ratings.
Number of community members for each anime.
                           
## Objective:
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset. 

## Dataset:
Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

In [17]:
# Imports and configuration
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from scipy import sparse
from collections import Counter
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

## Step 1: Data Preprocessing:

In [18]:
# load dataset and basic inspection
csv_path = Path('anime.csv')
assert csv_path.exists(), f"CSV not found at {csv_path}"
df = pd.read_csv(csv_path)
print("Shape:", df.shape)
display(df.head())
print("\nColumns:", df.columns.tolist())
print("\nMissing values per column:\n", df.isnull().sum())

Shape: (12294, 7)


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266



Columns: ['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members']

Missing values per column:
 anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


In [19]:
# standardize column names and fix dtypes
df = df.copy()
# Normalize column names by stripping
df.columns = [c.strip() for c in df.columns]

# Map common column name variants to expected names
col_map = {}
for c in df.columns:
    c_low = c.lower()
    if c_low in ['name', 'title', 'anime_title']:
        col_map[c] = 'name'
    if c_low == 'type':
        col_map[c] = 'type'
    if 'genre' in c_low:
        col_map[c] = 'genre'
    if 'episode' in c_low:
        col_map[c] = 'episodes'
    if c_low in ['rating','score','avg_rating']:
        col_map[c] = 'rating'
    if 'member' in c_low:
        col_map[c] = 'members'
df = df.rename(columns=col_map)

# Ensure expected columns exist (if missing, create placeholders)
for c in ['name','type','genre','episodes','rating','members']:
    if c not in df.columns:
        df[c] = np.nan

# Convert numeric columns
for col in ['episodes','rating','members']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Drop rows without name
df = df.dropna(subset=['name']).reset_index(drop=True)

# Fill genre missing with empty string
df['genre'] = df['genre'].fillna('')

# Fill numeric missing with median
for col in ['episodes','rating','members']:
    med = df[col].median()
    df[col] = df[col].fillna(med)

print("After cleaning - shape:", df.shape)
df[['name','type','genre','episodes','rating','members']].head()

After cleaning - shape: (12294, 7)


Unnamed: 0,name,type,genre,episodes,rating,members
0,Kimi no Na wa.,Movie,"Drama, Romance, School, Supernatural",1.0,9.37,200630
1,Fullmetal Alchemist: Brotherhood,TV,"Action, Adventure, Drama, Fantasy, Magic, Mili...",64.0,9.26,793665
2,Gintama°,TV,"Action, Comedy, Historical, Parody, Samurai, S...",51.0,9.25,114262
3,Steins;Gate,TV,"Sci-Fi, Thriller",24.0,9.17,673572
4,Gintama&#039;,TV,"Action, Comedy, Historical, Parody, Samurai, S...",51.0,9.16,151266


## Step 2: Feature Extraction

In [20]:
# parse genres into lists
def split_genres(g):
    if not isinstance(g, str) or g.strip() == '':
        return []
    # common separators
    for sep in [',','/','|',';']:
        if sep in g:
            return [x.strip().lower() for x in g.split(sep) if x.strip()!='']
    # fallback
    return [g.strip().lower()]

df['genre_list'] = df['genre'].apply(split_genres)

# Quick check
print("Example genres:", df['genre_list'].iloc[:6].tolist())

Example genres: [['drama', 'romance', 'school', 'supernatural'], ['action', 'adventure', 'drama', 'fantasy', 'magic', 'military', 'shounen'], ['action', 'comedy', 'historical', 'parody', 'samurai', 'sci-fi', 'shounen'], ['sci-fi', 'thriller'], ['action', 'comedy', 'historical', 'parody', 'samurai', 'sci-fi', 'shounen'], ['comedy', 'drama', 'school', 'shounen', 'sports']]


In [21]:
# build unique genres and sparse one-hot matrix
genre_counts = Counter([g for sub in df['genre_list'] for g in sub])
unique_genres = sorted(genre_counts.keys())
print("Unique genres:", len(unique_genres))

# Build sparse matrix rows: indices and data
rows = []
cols = []
data = []
genre_to_idx = {g:i for i,g in enumerate(unique_genres)}

for r, gl in enumerate(df['genre_list']):
    for g in gl:
        if g in genre_to_idx:
            rows.append(r)
            cols.append(genre_to_idx[g])
            data.append(1)

num_items = len(df)
num_genres = len(unique_genres)
genre_sparse = sparse.csr_matrix((data, (rows, cols)), shape=(num_items, num_genres), dtype=np.float32)

print("Genre sparse shape:", genre_sparse.shape)

Unique genres: 43
Genre sparse shape: (12294, 43)


In [22]:
# scale numeric features and create numeric sparse matrix
num_cols = ['rating', 'members', 'episodes']
scaler = MinMaxScaler()
numeric_vals = scaler.fit_transform(df[num_cols].values.astype(float))
numeric_sparse = sparse.csr_matrix(numeric_vals)  # shape (n_items, 3)

print("Numeric features scaled. Example (first 5 rows):")
pd.DataFrame(numeric_vals[:5], columns=num_cols)

Numeric features scaled. Example (first 5 rows):


Unnamed: 0,rating,members,episodes
0,0.92437,0.197872,0.0
1,0.911164,0.78277,0.034673
2,0.909964,0.112689,0.027518
3,0.90036,0.664325,0.012658
4,0.89916,0.149186,0.027518


In [23]:
# concatenate genre sparse and numeric sparse horizontally
X_sparse = sparse.hstack([genre_sparse, numeric_sparse], format='csr')
print("Combined feature matrix shape:", X_sparse.shape)

Combined feature matrix shape: (12294, 46)


In [24]:
# reduce dimensionality if #features is large
# Set reduce_dim=False to skip SVD. If you have many genres, enable it.
reduce_dim = True
n_components = 100  # tune: 50-200
if reduce_dim and X_sparse.shape[1] > n_components:
    print("Applying TruncatedSVD to reduce dims to", n_components)
    svd = TruncatedSVD(n_components=n_components, random_state=42)
    X_reduced = svd.fit_transform(X_sparse)
    # X_reduced is dense but small dims
    features_matrix = X_reduced
    print("Reduced shape:", features_matrix.shape)
else:
    features_matrix = X_sparse
    print("Using sparse features (no reduction).")

Using sparse features (no reduction).


## Step 3: Recommendation System

In [25]:
# fit NearestNeighbors on features_matrix
# Choose K large enough to cover top_k queries during evaluation (e.g., 50)
K = 50
if sparse.issparse(features_matrix):
    nbrs = NearestNeighbors(n_neighbors=K, metric='cosine', algorithm='brute', n_jobs=-1)
    nbrs.fit(features_matrix)
else:
    nbrs = NearestNeighbors(n_neighbors=K, metric='cosine', algorithm='brute', n_jobs=-1)
    nbrs.fit(features_matrix)

print("NearestNeighbors fitted. K =", K)

NearestNeighbors fitted. K = 50


In [26]:
# helper to get top-k neighbors and convert distances to similarity
def get_topk_neighbors_by_idx(idx, top_k=10, include_self=False):
    # Query point
    if sparse.issparse(features_matrix):
        query = features_matrix[idx]
    else:
        query = features_matrix[idx].reshape(1, -1)
    # request top_k +1 to account for self usually included
    neigh_count = top_k + (0 if include_self else 1)
    dists, idxs = nbrs.kneighbors(query, n_neighbors=neigh_count)
    dists = np.asarray(dists).flatten()
    idxs = np.asarray(idxs).flatten()
    if not include_self:
        mask = idxs != idx
        dists = dists[mask][:top_k]
        idxs = idxs[mask][:top_k]
    else:
        dists = dists[:top_k]
        idxs = idxs[:top_k]
    sims = (1.0 - dists).tolist()  # cosine similarity = 1 - cosine distance
    return list(zip(idxs.tolist(), sims))

In [27]:
# recommend function using neighbors
# Build name->index map (first occurrence)
name_to_idx = {}
for i, name in enumerate(df['name']):
    if name not in name_to_idx:
        name_to_idx[name] = i

def recommend(anime_name, top_n=10, threshold=None, include_self=False):
    if anime_name not in name_to_idx:
        raise ValueError(f"Anime '{anime_name}' not found")
    idx = name_to_idx[anime_name]
    neigh = get_topk_neighbors_by_idx(idx, top_k=max(top_n, 10), include_self=include_self)
    recs = []
    for i, score in neigh:
        if threshold is not None and score < threshold:
            continue
        recs.append({
            'name': df.loc[i, 'name'],
            'score': score,
            'genre': df.loc[i, 'genre'],
            'type': df.loc[i, 'type'],
            'index': int(i)
        })
        if len(recs) >= top_n:
            break
    return recs

# Quick test
sample_name = df['name'].iloc[0]
print("Sample:", sample_name)
for r in recommend(sample_name, top_n=5):
    print(r)

Sample: Kimi no Na wa.
{'name': 'Wind: A Breath of Heart OVA', 'score': 0.9835005587751349, 'genre': 'Drama, Romance, School, Supernatural', 'type': 'OVA', 'index': 5805}
{'name': 'Wind: A Breath of Heart (TV)', 'score': 0.9818171806813046, 'genre': 'Drama, Romance, School, Supernatural', 'type': 'TV', 'index': 6394}
{'name': 'Aura: Maryuuin Kouga Saigo no Tatakai', 'score': 0.8986286494013752, 'genre': 'Comedy, Drama, Romance, School, Supernatural', 'type': 'Movie', 'index': 1111}
{'name': 'Clannad: After Story - Mou Hitotsu no Sekai, Kyou-hen', 'score': 0.8891018431630309, 'genre': 'Drama, Romance, School', 'type': 'Special', 'index': 504}
{'name': 'Kokoro ga Sakebitagatterunda.', 'score': 0.8883236993579294, 'genre': 'Drama, Romance, School', 'type': 'Movie', 'index': 208}


## Step 4: Evaluation

In [28]:
# train-test split (indices only)
indices = np.arange(len(df))
train_idx, test_idx = train_test_split(indices, test_size=0.2, random_state=42)
print("Train:", len(train_idx), "Test:", len(test_idx))

Train: 9835 Test: 2459


In [29]:
# Batch query neighbors for all test items at once (fast)
top_k = 50   # must be >= largest k used in evaluation
if sparse.issparse(features_matrix):
    test_matrix = features_matrix[test_idx]
else:
    test_matrix = features_matrix[test_idx]
dists_batch, idxs_batch = nbrs.kneighbors(test_matrix, n_neighbors=top_k, return_distance=True)
sims_batch = 1.0 - dists_batch
print("Batch shapes: sims:", sims_batch.shape, "idxs:", idxs_batch.shape)

Batch shapes: sims: (2459, 50) idxs: (2459, 50)


In [30]:
# Compute precision/recall/F1@k in a batched manner using sparse ops
train_set = set(train_idx.tolist())
n_test = len(test_idx)
top_k_requested = idxs_batch.shape[1]

# Pre-slice train genre matrix and map train index -> position
train_genre_mat = genre_sparse[train_idx]   # shape (n_train, n_genres)
train_pos = {idx: pos for pos, idx in enumerate(train_idx)}

# Precompute ground_truth_total (#train items sharing at least one genre) for each test item
ground_truth_total = np.zeros(n_test, dtype=int)
for ti_pos, ti in enumerate(test_idx):
    tvec = genre_sparse[ti]               # (1, n_genres)
    prod = tvec.dot(train_genre_mat.T)   # (1, n_train)
    ground_truth_total[ti_pos] = prod.getnnz()

eval_ks = [5, 10]  # Ks to evaluate
results_accum = {f'precision_at_{k}': [] for k in eval_ks}
results_accum.update({f'recall_at_{k}': [] for k in eval_ks})
results_accum.update({f'f1_at_{k}': [] for k in eval_ks})

for ti_pos, ti in enumerate(tqdm(test_idx, desc="Computing metrics per test item")):
    neighbor_idxs = idxs_batch[ti_pos]   # array length top_k_requested
    neighbor_sims = sims_batch[ti_pos]

    # Keep only neighbors that are in train set
    keep_mask = np.array([ (int(n) in train_set) for n in neighbor_idxs ], dtype=bool)
    if not keep_mask.any():
        continue
    candidate_idxs = neighbor_idxs[keep_mask]
    # compute overlap between test item and each candidate quickly
    cand_genre_mat = genre_sparse[candidate_idxs]   # (cand_count, n_genres)
    tvec = genre_sparse[ti]                         # (1, n_genres)
    overlap = tvec.dot(cand_genre_mat.T).toarray().ravel()  # small dense array
    relevant_mask = overlap > 0

    for k in eval_ks:
        topk_mask = np.arange(len(candidate_idxs)) < k
        if not topk_mask.any():
            continue
        chosen_relevant = relevant_mask[topk_mask].sum()
        denom = max(1, topk_mask.sum())
        precision = chosen_relevant / denom
        gt_total = ground_truth_total[ti_pos]
        if gt_total == 0:
            continue
        recall = chosen_relevant / gt_total
        f1 = 0.0 if (precision + recall) == 0 else 2 * precision * recall / (precision + recall)

        results_accum[f'precision_at_{k}'].append(precision)
        results_accum[f'recall_at_{k}'].append(recall)
        results_accum[f'f1_at_{k}'].append(f1)

# Aggregate means
agg_metrics = {}
for k in eval_ks:
    agg_metrics[f'precision_at_{k}'] = float(np.mean(results_accum[f'precision_at_{k}'])) if results_accum[f'precision_at_{k}'] else 0.0
    agg_metrics[f'recall_at_{k}'] = float(np.mean(results_accum[f'recall_at_{k}'])) if results_accum[f'recall_at_{k}'] else 0.0
    agg_metrics[f'f1_at_{k}'] = float(np.mean(results_accum[f'f1_at_{k}'])) if results_accum[f'f1_at_{k}'] else 0.0
    agg_metrics[f'tested_cases_at_{k}'] = len(results_accum[f'precision_at_{k}'])

agg_metrics

Computing metrics per test item: 100%|██████████| 2459/2459 [00:01<00:00, 2162.65it/s]


{'precision_at_5': 1.0,
 'recall_at_5': 0.002579103299120337,
 'f1_at_5': 0.005116536101771541,
 'tested_cases_at_5': 2447,
 'precision_at_10': 1.0,
 'recall_at_10': 0.005158206598240674,
 'f1_at_10': 0.010153473860954861,
 'tested_cases_at_10': 2447}

In [31]:
# display metrics
for k in [5, 10]:
    print(f"=== Metrics @ {k} ===")
    print("Precision:", agg_metrics.get(f'precision_at_{k}', 0.0))
    print("Recall   :", agg_metrics.get(f'recall_at_{k}', 0.0))
    print("F1       :", agg_metrics.get(f'f1_at_{k}', 0.0))
    print("Tested cases:", agg_metrics.get(f'tested_cases_at_{k}', 0))
    print()

=== Metrics @ 5 ===
Precision: 1.0
Recall   : 0.002579103299120337
F1       : 0.005116536101771541
Tested cases: 2447

=== Metrics @ 10 ===
Precision: 1.0
Recall   : 0.005158206598240674
F1       : 0.010153473860954861
Tested cases: 2447



In [32]:
# convenience helpers to save results or get recommendations by index
def recommend_by_index(idx, top_n=10, threshold=None):
    neigh = get_topk_neighbors_by_idx(idx, top_k=max(top_n, 10), include_self=False)
    recs = []
    for i, score in neigh:
        if threshold is not None and score < threshold:
            continue
        recs.append((int(i), df.loc[i,'name'], score, df.loc[i,'genre']))
        if len(recs) >= top_n:
            break
    return recs

# Example: recommend for first 3 anime
for i in range(3):
    print("-- For:", df.loc[i,'name'])
    print(recommend_by_index(i, top_n=5))
    print()

-- For: Kimi no Na wa.
[(5805, 'Wind: A Breath of Heart OVA', 0.9835005587751349, 'Drama, Romance, School, Supernatural'), (6394, 'Wind: A Breath of Heart (TV)', 0.9818171806813046, 'Drama, Romance, School, Supernatural'), (1111, 'Aura: Maryuuin Kouga Saigo no Tatakai', 0.8986286494013752, 'Comedy, Drama, Romance, School, Supernatural'), (504, 'Clannad: After Story - Mou Hitotsu no Sekai, Kyou-hen', 0.8891018431630309, 'Drama, Romance, School'), (208, 'Kokoro ga Sakebitagatterunda.', 0.8883236993579294, 'Drama, Romance, School')]

-- For: Fullmetal Alchemist: Brotherhood
[(200, 'Fullmetal Alchemist', 0.9403031915572548, 'Action, Adventure, Comedy, Drama, Fantasy, Magic, Military, Shounen'), (1558, 'Fullmetal Alchemist: The Sacred Star of Milos', 0.9096753371121155, 'Action, Adventure, Comedy, Drama, Fantasy, Magic, Military, Shounen'), (402, 'Fullmetal Alchemist: Brotherhood Specials', 0.904958561123503, 'Adventure, Drama, Fantasy, Magic, Military, Shounen'), (268, 'Magi: The Labyrinth

## Conclusion
## Performance Analysis of the Recommendation System

The recommendation system implemented in this project is a content-based filtering model that uses anime genres and numeric features such as rating, member count, and number of episodes. Cosine similarity with the Nearest Neighbors method was used to efficiently identify similar anime. The performance was evaluated using Precision@K, Recall@K, and F1-Score@K on a train-test split.

- Precision@K: Measures how many of the top *K* recommended anime are actually relevant.
- Recall@K: Measures how many relevant anime were successfully recommended.
- F1-Score@K: Provides a balanced measure of both Precision and Recall.

These metrics are suitable for recommendation systems where *relevance* matters more than exact rating prediction.