# Anime Recommendation System using Cosine Similarity

This notebook demonstrates how to build a recommendation system using **cosine similarity** 
on an anime dataset. It includes data preprocessing, feature extraction, recommendation 
function implementation, and evaluation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
import scipy.sparse as sp

In [2]:
# Load dataset
df = pd.read_csv('anime.csv')
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [3]:
# Data Preprocessing
# Drop missing values in 'rating' and 'genre'
df = df.dropna(subset=['rating', 'genre'])
df.reset_index(drop=True, inplace=True)

print("Dataset shape after cleaning:", df.shape)
df.info()

Dataset shape after cleaning: (12017, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12017 entries, 0 to 12016
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12017 non-null  int64  
 1   name      12017 non-null  object 
 2   genre     12017 non-null  object 
 3   type      12017 non-null  object 
 4   episodes  12017 non-null  object 
 5   rating    12017 non-null  float64
 6   members   12017 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 657.3+ KB


In [4]:
# Feature Extraction
# Convert genres into TF-IDF vectors
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['genre'].fillna(''))

# Normalize ratings and members
scaler = MinMaxScaler()
df[['rating', 'members']] = scaler.fit_transform(df[['rating', 'members']])

# Combine features: genres (vector) + numeric features
features = sp.hstack([tfidf_matrix, df[['rating', 'members']].values])

print("Feature matrix shape:", features.shape)

Feature matrix shape: (12017, 48)


In [5]:
# Recommendation Function
cosine_sim = cosine_similarity(features, features)

indices = pd.Series(df.index, index=df['name']).drop_duplicates()

def recommend_anime(title, n=5):
    if title not in indices:
        return "Anime not found in dataset."
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:n+1]  # exclude itself
    anime_indices = [i[0] for i in sim_scores]
    return df['name'].iloc[anime_indices]

# Example
recommend_anime('Naruto', 5)

615                                    Naruto: Shippuuden
206                                         Dragon Ball Z
346                                           Dragon Ball
1472          Naruto: Shippuuden Movie 4 - The Lost Tower
1573    Naruto: Shippuuden Movie 3 - Hi no Ishi wo Tsu...
Name: name, dtype: object

In [6]:
# Evaluation - Simplified
# Split dataset into train and test based on indices
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

y_true = []
y_pred = []

for title in test_df['name'][:50]:  # limit for efficiency
    recs = recommend_anime(title, n=5)
    if isinstance(recs, str):
        continue
    true_genres = set(df.loc[df['name']==title, 'genre'].values[0].split(', '))
    for r in recs:
        rec_genres = set(df.loc[df['name']==r, 'genre'].values[0].split(', '))
        overlap = len(true_genres.intersection(rec_genres)) > 0
        y_true.append(1)
        y_pred.append(1 if overlap else 0)

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1-score: {f1:.2f}")

Precision: 1.00, Recall: 1.00, F1-score: 1.00


# Interview Questions & Answers

### 1. Difference between user-based and item-based collaborative filtering?

- User-based CF finds users who have similar preferences and recommends items those similar users liked.

- Item-based CF finds items that are similar to the items a user already likes and recommends those items.

- User-based = similarity between users.

- Item-based = similarity between items.

- Item-based is usually faster and more stable for large datasets.

### 2. What is collaborative filtering, and how does it work?

- Collaborative filtering recommends items based on patterns from many users.

- It works by finding similar users or similar items based on their behavior (ratings, likes, interactions).

- Then it suggests items that similar users liked or items similar to what the user already likes.
