Recommendation System
Data Description:
Unique ID of each anime.
Anime title.
Anime broadcast type, such as TV, OVA, etc.
anime genre.
The number of episodes of each anime.
The average rating for each anime compared to the number of users who gave ratings.
Number of community members for each anime.
Objective:
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset. 
Dataset:
Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.
Tasks:
Data Preprocessing:
Load the dataset into a suitable data structure (e.g., pandas DataFrame).
Handle missing values, if any.
Explore the dataset to understand its structure and attributes.
Feature Extraction:
Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
Convert categorical features into numerical representations if necessary.
Normalize numerical features if required.
Recommendation System:
Design a function to recommend anime based on cosine similarity.
Given a target anime, recommend a list of similar anime based on cosine similarity scores.
Experiment with different threshold values for similarity scores to adjust the recommendation list size.
Evaluation:
Split the dataset into training and testing sets.
Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
Analyze the performance of the recommendation system and identify areas of improvement.
Interview Questions:
1. Can you explain the difference between user-based and item-based collaborative filtering?
2. What is collaborative filtering, and how does it work?

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity,euclidean_distances
from sklearn.preprocessing import MinMaxScaler,MultiLabelBinarizer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

### Load and Preprocess Data

In [2]:
## Loading dataset:
df = pd.read_csv('anime.csv')
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


In [3]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


## 1. Data Preprocessing

In [4]:
## Handling missing values
df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [5]:
df.shape

(12294, 7)

In [6]:
len(df.anime_id.unique()) 

12294

In [7]:
len(df.name.unique())

12292

In [8]:
len(df.rating.unique())

599

In [9]:
len(df.genre.unique())

3265

In [10]:
len(df.type.unique())

7

In [11]:
df.type.unique()

array(['Movie', 'TV', 'OVA', 'Special', 'Music', 'ONA', nan], dtype=object)

In [20]:
# Drop rows with missing 'genre' or 'rating'
df.dropna(subset=['genre', 'rating'],inplace=True)
df["genre"] = df["genre"].apply(lambda x: x.split(", "))

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12017 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12017 non-null  int64  
 1   name      12017 non-null  object 
 2   genre     12017 non-null  object 
 3   type      12017 non-null  object 
 4   episodes  12017 non-null  object 
 5   rating    12017 non-null  float64
 6   members   12017 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 751.1+ KB


In [22]:
df.describe()

Unnamed: 0,anime_id,rating,members
count,12017.0,12017.0,12017.0
mean,13638.001165,6.478264,18348.88
std,11231.076675,1.023857,55372.5
min,1.0,1.67,12.0
25%,3391.0,5.89,225.0
50%,9959.0,6.57,1552.0
75%,23729.0,7.18,9588.0
max,34519.0,10.0,1013917.0


## 2. Feature Extraction

In [26]:
# Genre Binarization
mlb = MultiLabelBinarizer()
genre_encoded = mlb.fit_transform(df["genre"])
genre_encoded

array([[0, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [27]:
# One-Hot Encode Broadcast Type
broadcast_encoded = pd.get_dummies(df['anime_id'], prefix='broadcast')
broadcast_encoded

Unnamed: 0,broadcast_1,broadcast_5,broadcast_6,broadcast_7,broadcast_8,broadcast_15,broadcast_16,broadcast_17,broadcast_18,broadcast_19,...,broadcast_34412,broadcast_34447,broadcast_34453,broadcast_34464,broadcast_34475,broadcast_34476,broadcast_34490,broadcast_34503,broadcast_34514,broadcast_34519
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
12290,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
12291,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
12292,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [28]:
# Scale Numerical Features
scaler = MinMaxScaler()
scaler

In [29]:
df['rating_scaled'] = scaler.fit_transform(df[['rating']])
df['rating_scaled'] 

0        0.924370
1        0.911164
2        0.909964
3        0.900360
4        0.899160
           ...   
12289    0.297719
12290    0.313325
12291    0.385354
12292    0.397359
12293    0.454982
Name: rating_scaled, Length: 12017, dtype: float64

In [30]:
df['community_members'] = scaler.fit_transform(df[['members']])
df['community_members'] 

0        0.197867
1        0.782769
2        0.112683
3        0.664323
4        0.149180
           ...   
12289    0.000196
12290    0.000169
12291    0.000204
12292    0.000161
12293    0.000128
Name: community_members, Length: 12017, dtype: float64

In [32]:
# Combine Features
features = np.hstack((
    genre_encoded,
    broadcast_encoded.values,
    df[['rating_scaled', 'community_members']].values
))
features

array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 9.24369748e-01, 1.97866664e-01],
       [1.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 9.11164466e-01, 7.82768603e-01],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 9.09963986e-01, 1.12683141e-01],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 3.85354142e-01, 2.04161139e-04],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 3.97358944e-01, 1.60764569e-04],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 4.54981993e-01, 1.28217141e-04]])

## Cosine_Similarity

In [33]:
# Compute Cosine Similarity
cosine_sim = cosine_similarity(features)
cosine_sim

array([[1.        , 0.26770919, 0.11961814, ..., 0.10011446, 0.10300553,
        0.1166201 ],
       [0.26770919, 1.        , 0.31929191, ..., 0.07798897, 0.08023463,
        0.0908323 ],
       [0.11961814, 0.31929191, 1.        , ..., 0.08046391, 0.08278845,
        0.0937319 ],
       ...,
       [0.10011446, 0.07798897, 0.08046391, ..., 1.        , 0.53554191,
        0.53974677],
       [0.10300553, 0.08023463, 0.08278845, ..., 0.53554191, 1.        ,
        0.54107319],
       [0.1166201 , 0.0908323 , 0.0937319 , ..., 0.53974677, 0.54107319,
        1.        ]])

In [34]:
# Recommendation Function
def recommend_anime(type, top_n=5):
    idx = df[df['type'] == type].index[0]
    sim_scores = list(enumerate(similarity_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    top_indices = [i[0] for i in sim_scores[1:top_n+1]]
    return df.iloc[top_indices][['type', 'rating', 'genre', 'anime_id']]
recommend_anime

<function __main__.recommend_anime(type, top_n=5)>

In [35]:
# Handle missing values
missing=df.dropna(subset=["name", "genre", "rating"], inplace=True)
missing

In [36]:
# Extract necessary featres
df['genre'] = df['genre'].fillna('Unknown')  # Fill missing genres
df['genre']

0                   [Drama, Romance, School, Supernatural]
1        [Action, Adventure, Drama, Fantasy, Magic, Mil...
2        [Action, Comedy, Historical, Parody, Samurai, ...
3                                       [Sci-Fi, Thriller]
4        [Action, Comedy, Historical, Parody, Samurai, ...
                               ...                        
12289                                             [Hentai]
12290                                             [Hentai]
12291                                             [Hentai]
12292                                             [Hentai]
12293                                             [Hentai]
Name: genre, Length: 12017, dtype: object

In [37]:
df['rating'] = df['rating'].fillna(df['rating'].mean())  # Fill missing ratings with mean
df['rating']

0        9.37
1        9.26
2        9.25
3        9.17
4        9.16
         ... 
12289    4.15
12290    4.28
12291    4.88
12292    4.98
12293    5.46
Name: rating, Length: 12017, dtype: float64

In [38]:
# creat a data frame
df=pd.DataFrame(df)
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,rating_scaled,community_members
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630,0.924370,0.197867
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil...",TV,64,9.26,793665,0.911164,0.782769
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.25,114262,0.909964,0.112683
3,9253,Steins;Gate,"[Sci-Fi, Thriller]",TV,24,9.17,673572,0.900360,0.664323
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.16,151266,0.899160,0.149180
...,...,...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,[Hentai],OVA,1,4.15,211,0.297719,0.000196
12290,5543,Under World,[Hentai],OVA,1,4.28,183,0.313325,0.000169
12291,5621,Violence Gekiga David no Hoshi,[Hentai],OVA,4,4.88,219,0.385354,0.000204
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,[Hentai],OVA,1,4.98,175,0.397359,0.000161


In [39]:
# One-hot encoding for the 'genre' column
df_genres = df['genre'].str.get_dummies(sep=',')
df = pd.concat([df, df_genres], axis=1)

In [40]:
df_genres

Unnamed: 0,'Adventure','Adventure'],'Cars','Cars'],'Comedy','Comedy'],'Dementia','Dementia'],'Demons','Demons'],...,['Shounen'],['Slice of Life',['Slice of Life'],['Space'],['Sports'],['Super Power',['Supernatural'],['Thriller'],['Vampire'],['Yaoi']
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,rating_scaled,community_members,'Adventure',...,['Shounen'],['Slice of Life',['Slice of Life'],['Space'],['Sports'],['Super Power',['Supernatural'],['Thriller'],['Vampire'],['Yaoi']
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630,0.924370,0.197867,0,...,0,0,0,0,0,0,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil...",TV,64,9.26,793665,0.911164,0.782769,1,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.25,114262,0.909964,0.112683,0,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,"[Sci-Fi, Thriller]",TV,24,9.17,673572,0.900360,0.664323,0,...,0,0,0,0,0,0,0,0,0,0
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.16,151266,0.899160,0.149180,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,[Hentai],OVA,1,4.15,211,0.297719,0.000196,0,...,0,0,0,0,0,0,0,0,0,0
12290,5543,Under World,[Hentai],OVA,1,4.28,183,0.313325,0.000169,0,...,0,0,0,0,0,0,0,0,0,0
12291,5621,Violence Gekiga David no Hoshi,[Hentai],OVA,4,4.88,219,0.385354,0.000204,0,...,0,0,0,0,0,0,0,0,0,0
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,[Hentai],OVA,1,4.98,175,0.397359,0.000161,0,...,0,0,0,0,0,0,0,0,0,0


In [42]:
scaler = MinMaxScaler()
scaler

In [43]:
# Normalize rating and number of episodes
scaler.fit_transform(df[['rating','members']])

array([[9.24369748e-01, 1.97866664e-01],
       [9.11164466e-01, 7.82768603e-01],
       [9.09963986e-01, 1.12683141e-01],
       ...,
       [3.85354142e-01, 2.04161139e-04],
       [3.97358944e-01, 1.60764569e-04],
       [4.54981993e-01, 1.28217141e-04]])

In [44]:
# Encode genres using TF-IDF
tfidf = TfidfVectorizer(stop_words='english')
tfidf

In [45]:
tfidf_matrix = tfidf.fit_transform(df['name'])
tfidf_matrix

<12017x11822 sparse matrix of type '<class 'numpy.float64'>'
	with 42075 stored elements in Compressed Sparse Row format>

In [46]:
#Combine features (genres + ratings + episodes)
feature_matrix = pd.concat([pd.DataFrame(tfidf_matrix.toarray()), df[['rating', 'episodes']].reset_index(drop=True)], axis=1)
feature_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11814,11815,11816,11817,11818,11819,11820,11821,rating,episodes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.37,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.26,64
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.25,51
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.17,24
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.16,51
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12012,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.15,1
12013,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.28,1
12014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.88,4
12015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.98,1


## 3.Recommendation System:

### Design a function to recommend anime based on cosine similarity.

In [47]:
scaler = StandardScaler()
scaler

In [48]:
# Prepare the features to be used for similarity (genres, rating, and episodes)
features = df[['rating', 'episodes']].join(df[df_genres.columns])
features

Unnamed: 0,rating,episodes,'Adventure','Adventure'],'Cars','Cars'],'Comedy','Comedy'],'Dementia','Dementia'],...,['Shounen'],['Slice of Life',['Slice of Life'],['Space'],['Sports'],['Super Power',['Supernatural'],['Thriller'],['Vampire'],['Yaoi']
0,9.37,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9.26,64,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,9.25,51,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9.17,24,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9.16,51,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,4.15,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,4.28,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,4.88,4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,4.98,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [49]:
cosine_sim

array([[1.        , 0.26770919, 0.11961814, ..., 0.10011446, 0.10300553,
        0.1166201 ],
       [0.26770919, 1.        , 0.31929191, ..., 0.07798897, 0.08023463,
        0.0908323 ],
       [0.11961814, 0.31929191, 1.        , ..., 0.08046391, 0.08278845,
        0.0937319 ],
       ...,
       [0.10011446, 0.07798897, 0.08046391, ..., 1.        , 0.53554191,
        0.53974677],
       [0.10300553, 0.08023463, 0.08278845, ..., 0.53554191, 1.        ,
        0.54107319],
       [0.1166201 , 0.0908323 , 0.0937319 , ..., 0.53974677, 0.54107319,
        1.        ]])

### Given a target anime, recommend a list of similar anime based on cosine similarity scores.

In [50]:
def recommend_anime(target_anime_id, cosine_sim, top_n=5):
    # Get the index of the target anime
    target_idx = df[df['anime_id'] == target_anime_id].index[0]
    # Get the cosine similarity scores for the target anime
    sim_scores = list(enumerate(cosine_sim[target_idx]))
    # Sort the anime by similarity score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the indices of the top N most similar anime
    top_similar_anime = sim_scores[1:top_n+1]  # Excluding the target anime itself (index 0)
    # Get the anime titles for the top N similar anime
    recommended_anime = [df.iloc[i[0]]['title'] for i in top_similar_anime]
    return recommended_anime

In [51]:
recommend_anime

<function __main__.recommend_anime(target_anime_id, cosine_sim, top_n=5)>

In [52]:
top_n=5
top_n

5

In [68]:
## Recommendation Function
title = "Steins;Gate"## Find index of dataset 
idx = df[df['name'] == title].index[0]

In [69]:
# Compute similarity scores for that anime
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores

[(3, 1.0000000000000002),
 (59, 0.7222054026127408),
 (126, 0.7166688896288222),
 (196, 0.6977330904405861),
 (5124, 0.5926459146664204),
 (5523, 0.5881276664198613),
 (6885, 0.577076720346108),
 (2518, 0.545864495467882),
 (5452, 0.5284519233182772),
 (238, 0.5235282904691584),
 (9059, 0.5087547185033486),
 (10370, 0.508748979014074),
 (1578, 0.5043884372187112),
 (1594, 0.5030815483778088),
 (3491, 0.49534619478160313),
 (5281, 0.4938146394062144),
 (365, 0.49285019199134816),
 (3579, 0.49247722003928396),
 (10218, 0.4924231679361308),
 (4149, 0.48932749375728535),
 (8597, 0.48876639505464176),
 (8881, 0.48799723258997824),
 (5475, 0.4873226246782624),
 (4892, 0.48596423721791604),
 (250, 0.4857064610461391),
 (5308, 0.4855431014551707),
 (5247, 0.4843044966932695),
 (10810, 0.4817436349605688),
 (5722, 0.4808682728167615),
 (5805, 0.47996279519117735),
 (5818, 0.479764316364665),
 (9456, 0.4793179584653504),
 (6050, 0.4781275834479266),
 (6271, 0.4779477952458916),
 (10071, 0.477861

In [70]:
# Print top 5 most similar anime
for i in sim_scores[1:6]:
    print(df.iloc[i[0]]['name'], "with similarity score:", i[1])

Steins;Gate Movie: Fuka Ryouiki no Déjà vu with similarity score: 0.7222054026127408
Steins;Gate: Oukoubakko no Poriomania with similarity score: 0.7166688896288222
Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero with similarity score: 0.6977330904405861
Under the Dog with similarity score: 0.5926459146664204
Loups=Garous with similarity score: 0.5881276664198613


## 4.Evaluation:

## Split the dataset into training and testing sets.

In [53]:
# Splitting data into train and test (simplified, actual split depends on availability of user ratings)
train, test = train_test_split(df, test_size=0.2, random_state=42)
train, test

(       anime_id                                               name  \
 909        9201  Air Gear: Kuro no Hane to Nemuri no Mori - Bre...   
 7480      32811                                        Black Ocean   
 496         416                                    Kurenai no Buta   
 9204      28965                     Kibun wa Uaa Jitsuzai OL Kouza   
 6846      31972                                  Tang Lang Bu Chan   
 ...         ...                                                ...   
 12231     13051    Bishoujo Animerama: Miyuki-chan SOS-H Shichauzo   
 5193       5917                   Tsuru ni Notte: Tomoko no Bouken   
 5392       3880                          Makyou Densetsu Acrobunch   
 860       22819                                     Aikatsu! Movie   
 7276       1252            Fushigi no Umi no Nadia: Original Movie   
 
                                                    genre   type episodes  \
 909             [Action, Comedy, Ecchi, Shounen, Sports]    OVA     

### Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.

In [55]:
from sklearn.metrics import precision_score, recall_score, f1_score
# Generate recommendations for the test set (for example, by comparing recommended with actual ratings)
# (Note: This is a simplified example, real-world evaluation is more complex)
y_true = df['anime_id']
y_pred = [recommend_anime(anime_id, type) for anime_id in y_true]  # Predicted anime IDs

In [56]:
y_true

0        32281
1         5114
2        28977
3         9253
4         9969
         ...  
12289     9316
12290     5543
12291     5621
12292     6133
12293    26081
Name: anime_id, Length: 12017, dtype: int64

In [57]:
# Calculate precision, recall, F1 score
precision =(y_true)
precision

0        32281
1         5114
2        28977
3         9253
4         9969
         ...  
12289     9316
12290     5543
12291     5621
12292     6133
12293    26081
Name: anime_id, Length: 12017, dtype: int64

In [59]:
recall = (y_true)
recall

0        32281
1         5114
2        28977
3         9253
4         9969
         ...  
12289     9316
12290     5543
12291     5621
12292     6133
12293    26081
Name: anime_id, Length: 12017, dtype: int64

In [60]:
f1 = (y_true)
f1

0        32281
1         5114
2        28977
3         9253
4         9969
         ...  
12289     9316
12290     5543
12291     5621
12292     6133
12293    26081
Name: anime_id, Length: 12017, dtype: int64

### Adjusting Similarity Thresholds

In [62]:
def recommend_with_threshold(target_anime_id, cosine_sim, threshold=0.8):
    # Get the index of the target anime
    target_idx = df[df['anime_id'] == target_anime_id].index[0]
    # Get the cosine similarity scores for the target anime
    sim_scores = list(enumerate(cosine_sim[target_idx]))
    # Filter anime by the threshold similarity score
    filtered_sim_scores = [x for x in sim_scores if x[1] >= threshold]
    # Get the indices of the filtered anime
    recommended_anime = [df.iloc[i[0]]['title'] for i in filtered_sim_scores]
    return recommended_anime

In [63]:
recommend_with_threshold

<function __main__.recommend_with_threshold(target_anime_id, cosine_sim, threshold=0.8)>

In [64]:
cosine_sim

array([[1.        , 0.26770919, 0.11961814, ..., 0.10011446, 0.10300553,
        0.1166201 ],
       [0.26770919, 1.        , 0.31929191, ..., 0.07798897, 0.08023463,
        0.0908323 ],
       [0.11961814, 0.31929191, 1.        , ..., 0.08046391, 0.08278845,
        0.0937319 ],
       ...,
       [0.10011446, 0.07798897, 0.08046391, ..., 1.        , 0.53554191,
        0.53974677],
       [0.10300553, 0.08023463, 0.08278845, ..., 0.53554191, 1.        ,
        0.54107319],
       [0.1166201 , 0.0908323 , 0.0937319 , ..., 0.53974677, 0.54107319,
        1.        ]])

## Interview Questions:

### 1. Can you explain the difference between user-based and item-based collaborative filtering?

** User-based: Finds similar users and recommends items liked by them.
** Item-based: Finds similar items based on user ratings and recommends them.

### What is collaborative filtering, and how does it work?