# Anime Recommendation System (Cosine Similarity)

This notebook implements a content-based recommendation system using cosine similarity on the provided `anime.csv` dataset.

Workflow:
1. Data loading & EDA
2. Preprocessing & feature engineering (genres, numeric features)
3. Build cosine-similarity based recommender
4. Evaluate recommendations (simple train/test split and precision@k analysis)


In [3]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', 50)
print('Imports ready')

Imports ready


In [5]:
# Load the dataset
file_path = r'D:\DATA-SCIENCE\ASSIGNMENTS\11 recommendation system\anime.csv'
try:
    df = pd.read_csv(file_path)
    print('Loaded:', file_path)
except Exception as e:
    raise SystemExit(f"Could not load the dataset at {file_path}: {e}")

print('Shape:', df.shape)
df.head()

Loaded: D:\DATA-SCIENCE\ASSIGNMENTS\11 recommendation system\anime.csv
Shape: (12294, 7)


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


## 1 — EDA

Inspect basic structure, missing values, and distributions. Identify columns we'll use for similarity (e.g., `genre`, `type`, `episodes`, `rating`, `members`).

In [6]:
# Basic info
print(df.info())
print('\nMissing values per column:\n', df.isnull().sum())

df.describe(include='all').T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None

Missing values per column:
 anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
anime_id,12294.0,,,,14058.221653,11455.294701,1.0,3484.25,10260.5,24794.5,34527.0
name,12294.0,12292.0,Shi Wan Ge Leng Xiaohua,2.0,,,,,,,
genre,12232.0,3264.0,Hentai,823.0,,,,,,,
type,12269.0,6.0,TV,3787.0,,,,,,,
episodes,12294.0,187.0,1,5677.0,,,,,,,
rating,12064.0,,,,6.473902,1.026746,1.67,5.88,6.57,7.18,10.0
members,12294.0,,,,18071.338864,54820.676925,5.0,225.0,1550.0,9437.0,1013917.0


## 2 — Preprocessing & Feature Engineering

We'll:
- Fill or drop missing values sensibly.
- Use `genre` (text) as main content feature via TF-IDF on genre strings.
- Optionally combine numeric features (rating, members, episodes) scaled and appended to TF-IDF vectors.

In [7]:
# Simple preprocessing
# Lowercase genre and fillna
if 'genre' in df.columns:
    df['genre'] = df['genre'].fillna('')
else:
    df['genre'] = ''

# Ensure numeric columns exist
for col in ['rating', 'members', 'episodes']:
    if col not in df.columns:
        df[col] = np.nan

# Fill numeric missing values with median (simple approach)
for col in ['rating', 'members', 'episodes']:
    if df[col].isnull().any():
        df[col].fillna(df[col].median(), inplace=True)

# Create a combined text field for content-based features
# We'll combine genre and type (if exists)
if 'type' in df.columns:
    df['type'] = df['type'].fillna('').astype(str)
else:
    df['type'] = ''

df['content'] = (df['genre'].astype(str) + ' ' + df['type'].astype(str)).str.lower().str.replace('|', ' ')

print('Prepared content field. Sample:')
df[['name','genre','type','content']].head() if 'name' in df.columns else df[['genre','type','content']].head()

Prepared content field. Sample:


Unnamed: 0,name,genre,type,content
0,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,"drama, romance, school, supernatural movie"
1,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,"action, adventure, drama, fantasy, magic, mili..."
2,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,"action, comedy, historical, parody, samurai, s..."
3,Steins;Gate,"Sci-Fi, Thriller",TV,"sci-fi, thriller tv"
4,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,"action, comedy, historical, parody, samurai, s..."


### TF-IDF Vectorization (genres + type)

In [8]:
# TF-IDF on content
vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df['content'])
print('TF-IDF matrix shape:', tfidf_matrix.shape)

TF-IDF matrix shape: (12294, 872)


### Optionally incorporate numeric features
We scale numeric features and append them to the TF-IDF vectors to capture popularity/rating signals.

In [10]:
# Defensive replacement for your combine-step 
import sys
import numpy as np
import pandas as pd

# required imports
from sklearn.preprocessing import StandardScaler, normalize
try:
    from scipy.sparse import hstack, csr_matrix
except Exception as e:
    raise SystemExit("scipy.sparse is required. Install it with `pip install scipy` or `conda install scipy`") from e

# --- Quick sanity checks & helpful errors ---
# 1) check df
if 'df' not in globals():
    raise SystemExit("DataFrame `df` not found in the notebook. Make sure you loaded the CSV into `df` earlier.")

# 2) required TF-IDF matrix
if 'tfidf_matrix' not in globals():
    raise SystemExit("TF-IDF matrix `tfidf_matrix` not found. Run the TF-IDF vectorization cell first.")

# 3) numeric columns
num_feats = ['rating', 'members', 'episodes']
missing_cols = [c for c in num_feats if c not in df.columns]
if missing_cols:
    raise SystemExit(f"Missing numeric columns in df: {missing_cols}. Either create them or adjust `num_feats`.")

# 4) make numeric array and handle NaNs
X_num = df[num_feats].copy()
# convert non-numeric gracefully
for c in num_feats:
    if not pd.api.types.is_numeric_dtype(X_num[c]):
        # try to coerce to numeric
        X_num[c] = pd.to_numeric(X_num[c], errors='coerce')
X_num = X_num.fillna(X_num.median()).values  # impute with median

# 5) scale
scaler = StandardScaler()
X_num_scaled = scaler.fit_transform(X_num)  # shape: (n_samples, n_numeric_feats)

# 6) normalize rows to unit length (so numeric block is comparable)
X_num_norm = normalize(X_num_scaled, norm='l2', axis=1)  # shape: (n_samples, n_numeric_feats)

# 7) check tfidf_matrix row count matches numeric rows
tf_rows = getattr(tfidf_matrix, "shape", None)
if tf_rows is None:
    raise SystemExit("tfidf_matrix has no shape — ensure it is a scipy sparse matrix or numpy array.")
n_tfidf_rows = tfidf_matrix.shape[0]
n_num_rows = X_num_norm.shape[0]
if n_tfidf_rows != n_num_rows:
    raise SystemExit(f"Row count mismatch: tfidf_matrix has {n_tfidf_rows} rows but numeric matrix has {n_num_rows} rows. They must match.")

# 8) convert numeric block to sparse and weight it
weight_num = 0.5  # tune this
X_num_sparse = csr_matrix(X_num_norm * float(weight_num))

# 9) horizontally stack (tfidf_matrix may already be sparse)
try:
    tfidf_combined = hstack([tfidf_matrix, X_num_sparse], format='csr')
except ValueError as e:
    raise SystemExit(f"Error stacking matrices: {e}")

print("SUCCESS: Combined feature matrix shape:", tfidf_combined.shape)


SUCCESS: Combined feature matrix shape: (12294, 875)


## 3 — Recommendation Function (Cosine Similarity)

We compute pairwise cosine similarity on the combined feature matrix and build a helper function to recommend top-N similar anime given a title.

In [11]:
# Compute cosine similarity (can be memory heavy for very large datasets)
cosine_sim = cosine_similarity(tfidf_combined, tfidf_combined)

# Build indices mapping
if 'name' in df.columns:
    titles = df['name'].astype(str).tolist()
else:
    # fallback to index-based names
    titles = df.index.astype(str).tolist()

indices = pd.Series(range(len(titles)), index=titles)

# Recommendation helper
def recommend(title, n=10, similarity_matrix=cosine_sim):
    if title not in indices:
        raise ValueError(f"Title '{title}' not found in dataset")
    idx = indices[title]
    sim_scores = list(enumerate(similarity_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = [s for s in sim_scores if s[0] != idx]
    top_indices = [i for i,score in sim_scores[:n]]
    return df.iloc[top_indices][['name','genre','type','rating','members']]

# Example usage (replace with an actual title from your dataset)
example_title = titles[0]
print('Example title:', example_title)
recommend(example_title, n=5).head()

Example title: Kimi no Na wa.


Unnamed: 0,name,genre,type,rating,members
1111,Aura: Maryuuin Kouga Saigo no Tatakai,"Comedy, Drama, Romance, School, Supernatural",Movie,7.67,22599
1494,Harmonie,"Drama, School, Supernatural",Movie,7.52,29029
1959,Air Movie,"Drama, Romance, Supernatural",Movie,7.39,44179
2300,Koi to Senkyo to Chocolate,"Drama, Romance, School",TV,7.3,91552
1435,True Tears,"Drama, Romance, School",TV,7.55,118644


## 4 — Simple Evaluation

Without explicit user-item interactions, evaluation is approximate. We'll do a basic holdout: remove the content of a few keywords from some anime and test if they are retrieved, or use popularity-based proxy metrics.

In [12]:
# Simple precision@k style evaluation by using name similarity as proxy (very rough)
# We'll split indices into train/test; compute sim on train and see if test items' titles retrieve expected similar items.

# For a quick metric, we'll treat titles sharing a common genre keyword as 'relevant'.

def precision_at_k(title, k=5):
    res = recommend(title, n=k)
    true_genres = set(str(df.loc[indices[title],'genre']).split(','))
    # count how many recommended items share any genre token
    def shares_genre(x):
        g = set(str(x['genre']).split(','))
        return len(true_genres.intersection(g)) > 0
    hits = res.apply(shares_genre, axis=1).sum()
    return hits / k

# Compute avg precision@5 for a sample of titles
sample_titles = titles[:100]
precisions = []
for t in sample_titles:
    try:
        precisions.append(precision_at_k(t, k=5))
    except Exception:
        pass

print('Average precision@5 (rough):', np.nanmean(precisions))

Average precision@5 (rough): 0.99


## 5 — Notes & Next Steps

- You can tune `weight_num` to increase/decrease influence of numeric features (rating, members).  
- For production, consider using sparse approximate nearest neighbors (e.g., FAISS) for scalability.  
- For better evaluation, use explicit user-item interactions and proper train/test splitting on users.


### Tuning & Feature Engineering
- **Tune `weight_num`** to control numeric influence (e.g., try `0.1, 0.25, 0.5, 1.0`) and compare precision@k or silhouette scores.  
- **Feature selection:** try alternative numeric combos (e.g., `rating + members`, or `rating * log(members)`) and test their impact.  
- **Text cleaning:** normalize genre tokens (remove whitespace, unify synonyms), expand multi-word genres (e.g., "sci fi" → "science_fiction") before TF-IDF.

### Scalability & Production
- **Use ANN for large datasets:** replace full cosine-similarity with FAISS or Annoy for approximate nearest neighbors (much faster and memory-efficient).  
  - Example: `pip install faiss-cpu` (or `conda install -c pytorch faiss-cpu`) then index your dense vectors.  
- **Persist artifacts:** save the fitted `TfidfVectorizer`, scaler, and (optionally) sparse index to disk for fast reloading:
  ```python
  import joblib
  joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')
  joblib.dump(scaler, 'num_scaler.pkl')


API / service: wrap recommend() in a lightweight Flask/FastAPI app and serve recommendations via an endpoint.

Evaluation & Validation

Better evaluation: collect or simulate user-item interactions (views, likes) and compute precision@k, recall@k, MAP, and NDCG on held-out users.

Cross-validation: perform leave-one-out or time-based holdout on users when interactions exist.

A/B testing: in production, A/B test recommendation variants (content-only vs. hybrid) to measure lift in engagement.

Model Improvements (next experiments)

Hybrid approach: combine content-based scores with collaborative signals (if user ratings exist).

Embedding models: use pre-trained text embeddings (Sentence-BERT) for richer content similarity instead of TF-IDF.

Diversity & novelty: post-process top-k to promote serendipity (e.g., re-rank to increase genre diversity).

Practical tips

Reproducibility: log experiments (weights, preprocessing steps, metrics) using MLflow or a simple CSV.

Monitoring: track recommendation CTR and distributional drift for numeric features (e.g., members growth).

## Conclusion

This notebook implemented a content-based recommendation system using TF-IDF on genres and optional numeric features. The cosine similarity approach provides quick, interpretable recommendations and is a good baseline before moving to collaborative or hybrid models.

## Interview Questions & Answers

### 1. What is the difference between user-based and item-based collaborative filtering?

User-Based Collaborative Filtering focuses on finding users who have similar tastes or behaviors. It recommends items that those similar users have liked.
Example: “People who are similar to you liked this anime, so you might like it too.”

Item-Based Collaborative Filtering, on the other hand, focuses on finding items that are similar to each other. It recommends items that are similar to those the user has already liked.
Example: “You liked Attack on Titan, and it’s similar to Death Note — so you’ll probably like Death Note.”

In simple terms:

* User-based looks for similar users.
* Item-based looks for similar items.

Item-based methods are usually more efficient and stable because the relationships between items don’t change as often as relationships between users.

---

### 2. What is collaborative filtering, and how does it work?

Collaborative Filtering is a technique used in recommendation systems to predict what a user might like based on the preferences of many other users. It assumes that users who agreed in the past will have similar preferences in the future.

It works in the following steps:

1. Create a matrix that shows which users have rated or interacted with which items.
2. Measure similarity — either between users or between items — using cosine similarity, Pearson correlation, or another metric.
3. Predict missing ratings or preferences based on what similar users or similar items have rated.
4. Recommend the top items with the highest predicted scores to the user.

There are three main types:

* User-based collaborative filtering: finds similar users.
* Item-based collaborative filtering: finds similar items.
* Model-based collaborative filtering: uses machine learning models such as matrix factorization (SVD, NMF) or neural networks to find hidden patterns.

Advantages:

* Does not require detailed item information like genres or descriptions.
* Works well when there is a lot of user feedback data available.

Limitations:

* Suffers from the cold start problem when new users or new items have no data.
* Can struggle with sparse datasets where few ratings exist.
* Can become computationally expensive for large datasets.

Example:
If two users (say, Alice and Bob) both liked “Naruto” and “One Piece,” and Alice also liked “Attack on Titan,” the system can predict that Bob might like “Attack on Titan” too.


