**Recommendation System**

**Data Description:**

Unique ID of each anime.
Anime title.
Anime broadcast type, such as TV, OVA, etc.
anime genre.
The number of episodes of each anime.
The average rating for each anime compared to the number of users who gave ratings.


Number of community members for each anime.

**Objective:**
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset.

**Dataset:**
Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

**Tasks:**

**1.Data Preprocessing:**

Load the dataset into a suitable data structure (e.g., pandas DataFrame).
Handle missing values, if any.
Explore the dataset to understand its structure and attributes.

**2.Feature Extraction: **

Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
Convert categorical features into numerical representations if necessary.
Normalize numerical features if required.

**3.Recommendation System:**

Design a function to recommend anime based on cosine similarity.
Given a target anime, recommend a list of similar anime based on cosine similarity scores.
Experiment with different threshold values for similarity scores to adjust the recommendation list size.

**4.Evaluation:**

Split the dataset into training and testing sets.
Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
Analyze the performance of the recommendation system and identify areas of improvement.

Interview Questions:
1. Can you explain the difference between user-based and item-based collaborative filtering?
2. What is collaborative filtering, and how does it work?

Data Preprocessing:
Load the dataset into a suitable data structure (e.g., pandas DataFrame).

Handle missing values, if any.

Explore the dataset to understand its structure and attributes.



In [11]:
import pandas as pd
import numpy as np
df=pd.read_csv("/content/anime.csv")
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [4]:
df.columns

Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members'], dtype='object')

In [5]:
df.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


In [12]:
#handling missing values :
df['rating'].fillna(df['rating'].mean(), inplace=True)
df['genre'].fillna("Unknown", inplace=True)
df['episodes'].replace('Unknown', np.nan, inplace=True)
df['episodes'] = pd.to_numeric(df['episodes'], errors='coerce')
df['episodes'].fillna(df['episodes'].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['rating'].fillna(df['rating'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['genre'].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting v

In [9]:
print(df.isnull().sum())

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


In [13]:

# Unique anime count
print("Unique Anime:", df['name'].nunique())

# Broadcast type distribution
print(df['type'].value_counts())

# Top genres
from collections import Counter
genre_list = df['genre'].dropna().apply(lambda x: x.split(", "))
all_genres = [g for sublist in genre_list for g in sublist]
print(Counter(all_genres).most_common(10))

# Ratings overview
print(df['rating'].describe())

# Members distribution
print(df['members'].describe())



Unique Anime: 12292
type
TV         3787
OVA        3311
Movie      2348
Special    1676
ONA         659
Music       488
Name: count, dtype: int64
[('Comedy', 4645), ('Action', 2845), ('Adventure', 2348), ('Fantasy', 2309), ('Sci-Fi', 2070), ('Drama', 2016), ('Shounen', 1712), ('Kids', 1609), ('Romance', 1464), ('School', 1220)]
count    12294.000000
mean         6.473902
std          1.017096
min          1.670000
25%          5.900000
50%          6.550000
75%          7.170000
max         10.000000
Name: rating, dtype: float64
count    1.229400e+04
mean     1.807134e+04
std      5.482068e+04
min      5.000000e+00
25%      2.250000e+02
50%      1.550000e+03
75%      9.437000e+03
max      1.013917e+06
Name: members, dtype: float64


Feature Extraction:
Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
Convert categorical features into numerical representations if necessary.
Normalize numerical features if required.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity

# Vectorize genres
count = CountVectorizer(tokenizer=lambda x: x.split(", "))
genre_matrix = count.fit_transform(df['genre'])

# Normalize numerical features
scaler = MinMaxScaler()
numerical_features = scaler.fit_transform(df[['rating', 'members']])

# Combine features (hstack)
from scipy.sparse import hstack
feature_matrix = hstack([genre_matrix, numerical_features])




4. Recommendation Function

In [15]:
cosine_sim = cosine_similarity(feature_matrix, feature_matrix)

def recommend_anime(title, n=5, threshold=0.3):
    if title not in df['name'].values:
        return "Anime not found."

    idx = df[df['name'] == title].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort by similarity
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Apply threshold and skip self
    sim_scores = [(i, score) for i, score in sim_scores if score >= threshold and i != idx]

    top_anime = [df.iloc[i]['name'] for i, score in sim_scores[:n]]
    return top_anime

# Example
print(recommend_anime("Naruto", n=5))


['Naruto: Shippuuden', 'Naruto: Shippuuden Movie 4 - The Lost Tower', 'Naruto: Shippuuden Movie 3 - Hi no Ishi wo Tsugu Mono', 'Boruto: Naruto the Movie', 'Naruto x UT']


5.Evalution

In [16]:
from sklearn.model_selection import train_test_split

# Random split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Simple evaluation (Precision@K)
def precision_at_k(title, k=5):
    recs = recommend_anime(title, n=k)
    if isinstance(recs, str):  # anime not found
        return None
    # If test set contains any of the recommended anime, count as relevant
    relevant = sum(1 for anime in recs if anime in test_df['name'].values)
    return relevant / k

precisions = [precision_at_k(title) for title in test_df['name'].sample(20)]
precisions = [p for p in precisions if p is not None]
print("Average Precision@5:", np.mean(precisions))


Average Precision@5: 0.22000000000000003



Interview Questions:
1. Can you explain the difference between user-based and item-based collaborative filtering?
2. What is collaborative filtering, and how does it work?

1. Can you explain the difference between user-based and item-based collaborative filtering?

**User-Based Collaborative Filtering**

**Idea**: Finds users similar to the target user and recommends items they liked.

**Example:** If User A and User B have a high similarity score (watch similar anime), and User B liked Naruto, then Naruto might be recommended to User A.

**Similarity Computation**: Based on user–item interaction patterns (e.g., ratings, likes).

**Pros**: Can discover diverse recommendations since it considers user preferences.

**Cons:** Not scalable for very large datasets, since similarity must be calculated between all users.

**Item-Based Collaborative Filtering**

**Idea**: Finds items (anime) similar to the ones the target user already liked and recommends them.

**Example:** If Naruto and Bleach are often watched together, then if a user liked Naruto, Bleach will be recommended.

**Similarity Computation:** Based on item–item co-occurrence patterns across users.

**Pros**: More scalable and stable over time, since item similarities don’t change as often.

**Cons:** Might miss niche recommendations that depend on unique user patterns

2. What is collaborative filtering, and how does it work?

**Definition:**
Collaborative filtering is a recommendation technique that predicts a user’s interests by collecting preferences from many other users. The assumption is: “If two users agreed in the past, they will agree in the future.”

**How it works:**

**Data Collection:**  Gather user–item interaction data (e.g., ratings, watch history).

**Similarity Calculation:** Measure similarity between users or items using metrics like cosine similarity, Pearson correlation, or Jaccard index.

**Recommendation Generation:**

**User-Based:** Recommend items liked by similar users.

**Item-Based:** Recommend items similar to those the user liked.

**Prediction:** Rank items by predicted rating or similarity score.

**Key Point:**
Collaborative filtering doesn’t require knowledge of the item’s content (like genres or descriptions), only user–item interactions.

