Recommendation System

Data Description:

Unique ID of each anime.
Anime title.
Anime broadcast type, such as TV, OVA, etc.
anime genre.
The number of episodes of each anime.
The average rating for each anime compared to the number of users who gave ratings.


Number of community members for each anime.
Objective:
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset.
Dataset:
Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

Tasks:

Data Preprocessing:

Load the dataset into a suitable data structure (e.g., pandas DataFrame).
Handle missing values, if any.
Explore the dataset to understand its structure and attributes.

Feature Extraction:

Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
Convert categorical features into numerical representations if necessary.
Normalize numerical features if required.

Recommendation System:

Design a function to recommend anime based on cosine similarity.
Given a target anime, recommend a list of similar anime based on cosine similarity scores.
Experiment with different threshold values for similarity scores to adjust the recommendation list size.

Evaluation:

Split the dataset into training and testing sets.
Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
Analyze the performance of the recommendation system and identify areas of improvement.

In [220]:
import pandas as pd


In [221]:
df = pd.read_csv("/content/anime.csv")

In [222]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [223]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [224]:
df['genre'].isnull().value_counts()

Unnamed: 0_level_0,count
genre,Unnamed: 1_level_1
False,12232
True,62


In [225]:
genre_df = df['genre'];genre_df

Unnamed: 0,genre
0,"Drama, Romance, School, Supernatural"
1,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
2,"Action, Comedy, Historical, Parody, Samurai, S..."
3,"Sci-Fi, Thriller"
4,"Action, Comedy, Historical, Parody, Samurai, S..."
...,...
12289,Hentai
12290,Hentai
12291,Hentai
12292,Hentai


In [226]:
genre_df.isnull().sum()

np.int64(62)

# Spliting columns to process data

In [227]:
genre_list = genre_df.apply(lambda x: x.split(', ') if isinstance(x, str) else [])
genre_list

Unnamed: 0,genre
0,"[Drama, Romance, School, Supernatural]"
1,"[Action, Adventure, Drama, Fantasy, Magic, Mil..."
2,"[Action, Comedy, Historical, Parody, Samurai, ..."
3,"[Sci-Fi, Thriller]"
4,"[Action, Comedy, Historical, Parody, Samurai, ..."
...,...
12289,[Hentai]
12290,[Hentai]
12291,[Hentai]
12292,[Hentai]


# Encoding of Genre list

In [228]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb_out = mlb.fit_transform(genre_list)
mlb_out


array([[0, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [229]:
genre_encoded_df = pd.DataFrame(data=mlb_out,columns=mlb.classes_);genre_encoded_df

Unnamed: 0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,1,1,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [230]:
original_data = pd.concat([df,genre_encoded_df],axis=1)
original_data.drop(['genre'],axis=1,inplace=True)
original_data.head()


Unnamed: 0,anime_id,name,type,episodes,rating,members,Action,Adventure,Cars,Comedy,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281,Kimi no Na wa.,Movie,1,9.37,200630,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,TV,64,9.26,793665,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,TV,51,9.25,114262,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,TV,24,9.17,673572,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,9969,Gintama&#039;,TV,51,9.16,151266,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [231]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(genre_encoded_df)

In [232]:
similarity[1]

array([0.18898224, 1.        , 0.28571429, ..., 0.        , 0.        ,
       0.        ])

# Recommendation System

In [233]:
def recommendation(anime_name, n):
  recommed = []
  if anime_name in df['name'].values:
    idx = df[df['name'] == anime_name].index[0]
    similiar = sorted(list(enumerate(similarity[idx])),reverse=True, key=lambda x: x[1])[1:n+1]


    for i in similiar:
      recommed.append(df.loc[i[0], 'name'])
    return recommed

  else:
    print('No')
    return []

In [234]:
print(recommendation('Chihayafuru 2',10))

['Chihayafuru', 'Otona Joshi no Anime Time', 'Human Crossing', 'Battery', '3-gatsu no Lion', 'Ristorante Paradiso', 'Usagi Drop', 'Ashita no Joe 2', 'Ashita no Joe', 'Usagi Drop Specials']


# Evaluation

In [235]:
genre_list.head()

Unnamed: 0,genre
0,"[Drama, Romance, School, Supernatural]"
1,"[Action, Adventure, Drama, Fantasy, Magic, Mil..."
2,"[Action, Comedy, Historical, Parody, Samurai, ..."
3,"[Sci-Fi, Thriller]"
4,"[Action, Comedy, Historical, Parody, Samurai, ..."


In [236]:
df['genre_list'] = df['genre'].apply(
    lambda x: x.split(', ') if isinstance(x, str) else []
)

In [237]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,genre_list
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,"[Drama, Romance, School, Supernatural]"
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665,"[Action, Adventure, Drama, Fantasy, Magic, Mil..."
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262,"[Action, Comedy, Historical, Parody, Samurai, ..."
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572,"[Sci-Fi, Thriller]"
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266,"[Action, Comedy, Historical, Parody, Samurai, ..."


In [238]:
from sklearn.model_selection import train_test_split

## Train–Test Split (Item-based)

In [239]:
train_df,test_df = train_test_split(df,test_size=0.2,random_state=30)

In [240]:
test_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,genre_list
2977,18745,Chihayafuru 2: Waga Mi Yo ni Furu Nagame Seshi...,"Comedy, Josei, Slice of Life",OVA,1,7.11,18659,"[Comedy, Josei, Slice of Life]"
8239,23731,"Boku datte, Kirei ni Shitainda",Kids,OVA,1,4.0,54,[Kids]
9279,26091,Koe wo Kikasete,"Drama, Kids",OVA,1,8.33,47,"[Drama, Kids]"
10620,6273,Tsuyu no Hito Shizuku,Historical,OVA,1,6.5,181,[Historical]
10398,18447,Spheres,"Action, Fantasy, Super Power",TV,26,7.33,171,"[Action, Fantasy, Super Power]"


In [241]:
train_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,genre_list
9701,33826,Neko no Kuni no Kenpou,Historical,OVA,1,7.33,20,[Historical]
9220,17485,Kiki to Lala no Ohimesama ni Naritai,Kids,OVA,1,5.65,98,[Kids]
10879,31698,Zhan Long Si Qu,"Cars, Kids",TV,64,4.8,67,"[Cars, Kids]"
7665,9337,Mayo Elle Otoko no Ko,"Comedy, School",OVA,1,5.17,820,"[Comedy, School]"
6500,6641,Tonari no 801-chan R,"Comedy, Music",OVA,1,6.09,1947,"[Comedy, Music]"


## Build similarity model on TRAIN set

In [242]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity

In [254]:
mlb_1 = MultiLabelBinarizer()
train_genre_matrix = mlb_1.fit_transform(train_df['genre_list'])
train_similarity = cosine_similarity(train_genre_matrix)

print(f"Number of genres learned by mlb_1: {len(mlb_1.classes_)}")

Number of genres learned by mlb_1: 43


## For predicting genres

In [247]:
import numpy as np


def pred_genres(test_row, top=5):

  test_vec = mlb_1.transform([test_row['genre_list']]) #Take one test anime
  sim = cosine_similarity(test_vec,train_genre_matrix)[0]  #Compare it with all training anime


  pred = set()
  top = np.argsort(sim)[-top:];top #Pick the top 5 most similar

  for i in top:
    pred.update(train_df.iloc[i]['genre_list']) #Combine their genres
  return pred #Return the combined genre set

# For evaluating predicted genres

In [251]:
def evaluation_metric(actual_genres_list, predict_genres):

  actual_set = set(actual_genres_list)
  pred_set = set(predict_genres)

  tp = len(actual_set & pred_set)
  fp = len(pred_set - actual_set)
  fn = len(actual_set - pred_set)

  precisions = tp / (tp + fp) if (tp + fp) else 0 ;
  recalls = tp / (tp + fn) if (tp + fn) else 0;
  f1s = (2 * precisions * recalls) / (precisions + recalls) if (precisions + recalls) else 0;
  return precisions, recalls, f1s

# Performance

In [259]:
precisions, recalls, f1s = [], [], []

for _, row in test_df.iterrows():
    predicted = pred_genres(row)
    p, r, f = evaluation_metric(row['genre_list'], predicted)

    precisions.append(p)
    recalls.append(r)
    f1s.append(f)

print("Average Precision:", np.mean(precisions))
print("Average Recall:", np.mean(recalls))
print("Average F1:", np.mean(f1s))

Average Precision: 0.8927870906825766
Average Recall: 0.9894616149292847
Average F1: 0.9280565646914007


# Performance Analysis

The recommendation system shows strong overall performance with an Average Precision of 0.89, Recall of 0.99, and F1-score of 0.93.

* The very high recall indicates the system successfully retrieves almost all relevant genres.

* The slightly lower precision suggests some extra (irrelevant) genres are being recommended.

Areas of Improvment:


Tune the number of neighbors (top) to balance precision and recall.

Avoid genre leakage if the goal is true genre prediction by using non-genre features for similarity.

Overall, the system is a strong similarity-based recommender, but the evaluation likely overestimates real-world performance.

Interview Questions

1. Can you explain the difference between user-based and item-based collaborative filtering?

User-based CF:
Recommends items by finding users similar to the target user and suggesting items those similar users liked.
Idea: “Users like you also liked this.”

Item-based CF:
Recommends items by finding items similar to what the user already liked and suggesting those similar items.
Idea: “If you liked this item, you may also like that one.”

2. What is collaborative filtering, and how does it work?

Collaborative Filtering is a recommendation technique that suggests items based on patterns of user interactions (such as ratings, likes, or purchases), rather than item content.

How it works:

It collects user–item interaction data.

Finds similarities between users or items based on this data.

Recommends items that similar users liked or that are similar to items the user liked.