                                        Recommendation System                                            

Data Description:               
                           
Unique ID of each anime.   
Anime title.                                   
Anime broadcast type, such as TV, OVA, etc.             
anime genre.                          
The number of episodes of each anime.                                                
The average rating for each anime compared to the number of users who gave ratings.                                                                                          
Number of community members for each anime.              
                              
Objective:               

The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset.          

Dataset:             
                                  
Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.             
 
Tasks:


1. Data Preprocessing:             
                                              
* Load the dataset into a suitable data structure (e.g., pandas DataFrame).
* Handle missing values, if any.
* Explore the dataset to understand its structure and attributes.



In [23]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from scipy.sparse import hstack


In [2]:
df = pd.read_csv("anime.csv")
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [3]:
print("dataset Info:\n")
df.info()

dataset Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [5]:
print("Missing values in each column:\n")
print(df.isnull().sum())

Missing values in each column:

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


In [6]:
df = df.dropna(subset=['genre', 'rating'])

In [10]:
# Replace missing episodes with 0 or could use mean/median
df['episodes'] = df['episodes'].replace('Unknown', np.nan)
df['episodes'] = df['episodes'].astype(float)
df['episodes'].fillna(df['episodes'].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['episodes'].fillna(df['episodes'].median(), inplace=True)


In [11]:
print("Duplicate Rows:", df.duplicated().sum())


Duplicate Rows: 0


In [12]:
df.drop_duplicates(inplace=True)

In [13]:
print("Statistical Summary:")
display(df.describe())

Statistical Summary:


Unnamed: 0,anime_id,episodes,rating,members
count,12017.0,12017.0,12017.0,12017.0
mean,13638.001165,12.323542,6.478264,18348.88
std,11231.076675,46.747242,1.023857,55372.5
min,1.0,1.0,1.67,12.0
25%,3391.0,1.0,5.89,225.0
50%,9959.0,2.0,6.57,1552.0
75%,23729.0,12.0,7.18,9588.0
max,34519.0,1818.0,10.0,1013917.0


In [14]:
print("Dataset Shape:", df.shape)

Dataset Shape: (12017, 7)


2. Feature Extraction:              
                           
* Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
* Convert categorical features into numerical representations if necessary.
* Normalize numerical features if required.



In [18]:
# Selecting features for similarity computation
# We’ll use 'genre' (text-based) and 'rating' (numeric)
features = df[['genre', 'rating']].copy()

In [None]:
features['genre'] = features['genre'].fillna('')

In [19]:
# Converting genre text to numerical features using CountVectorizer
vectorizer = CountVectorizer(tokenizer=lambda x: x.split(', '))
genre_matrix = vectorizer.fit_transform(features['genre'])



In [20]:
# Normalizing the 'rating' feature in order to bring it to same scale
scaler = MinMaxScaler()
rating_scaled = scaler.fit_transform(features[['rating']])


In [22]:
# Combining both genre and rating features into a single feature matrix
# Converting to array and concatenate horizontally
feature_matrix = hstack([genre_matrix, rating_scaled])
print("Shape of feature matrix:", feature_matrix.shape)

Shape of feature matrix: (12017, 44)


3. Recommendation System:           
                        
* Design a function to recommend anime based on cosine similarity.
* Given a target anime, recommend a list of similar anime based on cosine similarity scores.
* Experiment with different threshold values for similarity scores to adjust the recommendation list size.
* Analyze the performance of the recommendation system and identify areas of improvement.


In [24]:
# Computing the cosine similarity matrix between all anime
cosine_sim = cosine_similarity(feature_matrix, feature_matrix)


In [26]:
# Recommendation Function
def recommend_anime(title, threshold=0.3, top_n=10):
    """
    Recommends similar anime using cosine similarity.
    """
    if title not in df['name'].values:
        print("Anime not found in dataset.")
        return pd.DataFrame()

    idx = df[df['name'] == title].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = [(i, score) for i, score in sim_scores if score >= threshold and i != idx]
    top_similar = sim_scores[:top_n]

    indices = [i for i, _ in top_similar]
    results = df.iloc[indices][['name', 'genre', 'rating', 'type']].copy()
    results['similarity_score'] = [score for _, score in top_similar]
    return results.reset_index(drop=True)

In [27]:
# Example
anime_title = "Death Note"  
recommendations = recommend_anime(anime_title, threshold=0.3, top_n=10)

In [28]:
print(f" Top Recommendations for '{anime_title}':")
display(recommendations)

 Top Recommendations for 'Death Note':


Unnamed: 0,name,genre,rating,type,similarity_score
0,Death Note Rewrite,"Mystery, Police, Psychological, Supernatural, ...",7.84,Special,0.99914
1,Mousou Dairinin,"Drama, Mystery, Police, Psychological, Superna...",7.74,TV,0.919275
2,Higurashi no Naku Koro ni Kai,"Mystery, Psychological, Supernatural, Thriller",8.41,TV,0.908187
3,Higurashi no Naku Koro ni,"Horror, Mystery, Psychological, Supernatural, ...",8.17,TV,0.823035
4,Higurashi no Naku Koro ni Rei,"Comedy, Mystery, Psychological, Supernatural, ...",7.56,OVA,0.820105
5,Jigoku Shoujo Mitsuganae,"Mystery, Psychological, Supernatural",7.81,TV,0.805151
6,Yakushiji Ryouko no Kaiki Jikenbo,"Mystery, Police, Supernatural",7.19,TV,0.803066
7,Saint Luminous Jogakuin,"Mystery, Psychological, Supernatural",6.17,TV,0.796975
8,"Yakushiji Ryouko no Kaiki Jikenbo: Hamachou, V...","Mystery, Police, Supernatural",5.97,Special,0.795367
9,Mirai Nikki (TV),"Action, Mystery, Psychological, Shounen, Super...",8.07,TV,0.757631


# Interview Questions:
          
## 1. Can you explain the difference between user-based and item-based collaborative filtering?           

In collaborative filtering, we make recommendations based on similarities — either between users or between items.        
                 
* User-Based Collaborative Filtering looks at users who have similar tastes.                                 
** For example, if User A and User B have rated many anime similarly, then the system will recommend anime that B liked to A.                  
                                                
        ** User-based → finds similar users
                              
* Item-Based Collaborative Filtering focuses on similarities between items.          
** For example, if two anime are often rated similarly by many users, then if a user watches one of them, the system will recommend the other.                              
                          
        ** Item-based → finds similar items

## 2. What is collaborative filtering, and how does it work?

Collaborative filtering is a recommendation technique that suggests items to users based on the preferences of other users.          
                   
* It works by collecting user behavior data like ratings, watch history, or likes, and then finding patterns in that data.
                      
* For example, if many users who liked “Death Note” also liked “Attack on Titan”, then the system will recommend “Attack on Titan” to someone who liked “Death Note.”
                                               
* It’s called “collaborative” because it relies on the collaboration and shared experiences of users rather than the item’s content itself.