## Recommendation System


### Data Description :
1. Unique ID of each anime.
2. Anime title.
3. Anime broadcast type, such as TV, OVA, etc.
4. anime genre.
5. The number of episodes of each anime.
6. The average rating for each anime compared to the number of users who gave ratings.
7. Number of community members for each anime.


### Objective:
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset.

### Task-1 Data Preprocessing :
1. Load the dataset into a suitable data structure (e.g., pandas DataFrame).
2. 
Handle missing values, if any
3. 
Explore the dataset to understand its structure and attributes.

#### 1.Load the dataset into a suitable data structure (e.g., pandas DataFrame).

In [225]:
# Import the Required Libraries.
import pandas as pd
import numpy as np

In [226]:
# Load and Read the Dataset
df=pd.read_csv('anime.csv')

In [227]:
df.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


In [228]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


#### 2.Handle missing values
#### 3.Explore the dataset to understand its structure and attributes.

In [230]:
# Checking for Null Values.
df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [231]:
# Dropping or Filling the Missing Values.
df.dropna(subset=['rating'],inplace=True)

In [232]:
df.isnull().sum()

anime_id     0
name         0
genre       47
type         0
episodes     0
rating       0
members      0
dtype: int64

In [233]:
# Rating Missing Values Handled.
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12064 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12064 non-null  int64  
 1   name      12064 non-null  object 
 2   genre     12017 non-null  object 
 3   type      12064 non-null  object 
 4   episodes  12064 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12064 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 754.0+ KB


In [234]:
# Handling Genre Missing Values.
df['genre']=df['genre'].fillna('')

In [235]:
# All Missing Values Handled.
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12064 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12064 non-null  int64  
 1   name      12064 non-null  object 
 2   genre     12064 non-null  object 
 3   type      12064 non-null  object 
 4   episodes  12064 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12064 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 754.0+ KB


### Task-2 Feature Extraction :
1. Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
2. Convert categorical features into numerical representations if necessary.
3. Normalize numerical features if required.



#### 1.Decide on the features that will be used for computing similarity (e.g., genres, user ratings).

1. Genres
2. User Ratings
3. Number of Episodes
4. Broadcast Type

#### 2.Convert categorical features into numerical representations if necessary.

In [239]:
genres=df['genre'].str.get_dummies(sep=',')
df['episodes'] = pd.to_numeric(df['episodes'],errors='coerce')
df['episodes'].fillna(df['episodes'].median(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['episodes'].fillna(df['episodes'].median(),inplace=True)


In [240]:
# Transformed Genre columns
genres.head()

Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,...,Shoujo,Shounen,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [241]:
# One-Hot Encoding for Broadcast Type.
broadcast=pd.get_dummies(df['type'],prefix='type')

In [242]:
broadcast.head()

Unnamed: 0,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV
0,True,False,False,False,False,False
1,False,False,False,False,False,True
2,False,False,False,False,False,True
3,False,False,False,False,False,True
4,False,False,False,False,False,True


#### 3.Normalize numerical features if required.

In [244]:
from sklearn.preprocessing import MinMaxScaler

In [245]:
scaler=MinMaxScaler()
df[['rating','episodes']]=scaler.fit_transform(df[['rating','episodes']])

In [246]:
df[['rating', 'episodes']].head()

Unnamed: 0,rating,episodes
0,0.92437,0.0
1,0.911164,0.034673
2,0.909964,0.027518
3,0.90036,0.012658
4,0.89916,0.027518


In [247]:
# Combine One-Hot Encoded features with other numerical features
features=pd.concat([genres,broadcast,df[['rating','episodes']]],axis=1)

In [248]:
features.head(10)

Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,...,Vampire,Yaoi,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,rating,episodes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,True,False,False,False,False,False,0.92437,0.0
1,1,0,0,0,0,1,0,1,0,0,...,0,0,False,False,False,False,False,True,0.911164,0.034673
2,0,0,1,0,0,0,0,0,0,0,...,0,0,False,False,False,False,False,True,0.909964,0.027518
3,0,0,0,0,0,0,0,0,0,0,...,0,0,False,False,False,False,False,True,0.90036,0.012658
4,0,0,1,0,0,0,0,0,0,0,...,0,0,False,False,False,False,False,True,0.89916,0.027518
5,0,0,0,0,0,1,0,0,0,0,...,0,0,False,False,False,False,False,True,0.897959,0.004953
6,1,0,0,0,0,0,0,0,0,0,...,0,0,False,False,False,False,False,True,0.895558,0.080903
7,0,0,0,0,0,0,0,0,0,0,...,0,0,False,False,False,True,False,False,0.893157,0.059989
8,0,0,1,0,0,0,0,0,0,0,...,0,0,True,False,False,False,False,False,0.891957,0.0
9,0,0,1,0,0,0,0,0,0,0,...,0,0,False,False,False,False,False,True,0.893157,0.006604


### Task-3 Recommendation System :
1. Design a function to recommend anime based on cosine similarity.
2. Given a target anime, recommend a list of similar anime based on cosine similarity scores.
3. Experiment with different threshold values for similarity scores to adjust the recommendation list size.

In [250]:
from sklearn.metrics.pairwise import cosine_similarity

In [251]:
cosine_sim=cosine_similarity(features)

In [252]:
# Function to get recommendations
def get_recommendations(title,data=df,sim_matrix=cosine_sim,top_n=10):
    title=title.strip().lower()
    data['name']=data['name'].str.strip().str.lower()
    if title not in data['name'].values:
        print(f"Anime '{title}' not found in the dataset.")
        return pd.DataFrame()
    idx_list=data.index[data['name']==title].tolist()
    if not idx_list:
        print(f"No index found for '{title}' in the dataset.")
        return pd.DataFrame()
    idx=idx_list[0]
    sim_scores=list(enumerate(sim_matrix[idx]))
    sim_scores=sorted(sim_scores,key=lambda x:x[1],reverse=True)
    sim_indices=[i[0] for i in sim_scores[1:top_n+1]]
    return data[['name','rating','genre']].iloc[sim_indices]

In [253]:
get_recommendations('Naruto')

Unnamed: 0,name,rating,genre
615,naruto: shippuuden,0.752701,"Action, Comedy, Martial Arts, Shounen, Super P..."
175,katekyo hitman reborn!,0.804322,"Action, Comedy, Shounen, Super Power"
206,dragon ball z,0.798319,"Action, Adventure, Comedy, Fantasy, Martial Ar..."
588,dragon ball kai,0.753902,"Action, Adventure, Comedy, Fantasy, Martial Ar..."
515,dragon ball kai (2014),0.761104,"Action, Adventure, Comedy, Fantasy, Martial Ar..."
1209,medaka box abnormal,0.715486,"Action, Comedy, Ecchi, Martial Arts, School, S..."
1930,dragon ball super,0.687875,"Action, Adventure, Comedy, Fantasy, Martial Ar..."
2615,medaka box,0.665066,"Action, Comedy, Ecchi, Martial Arts, School, S..."
3038,tenjou tenge,0.651861,"Action, Comedy, Ecchi, Martial Arts, School, S..."
582,bleach,0.753902,"Action, Comedy, Shounen, Super Power, Supernat..."


### Task-4 Evaluation :
1. Split the dataset into training and testing sets.
2. Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
3. Analyze the performance of the recommendation system and identify areas of improvement.

In [255]:
from sklearn.model_selection import train_test_split

In [256]:
# Split the dataset.
train_data,test_data=train_test_split(df,test_size=0.2,random_state=42)

In [257]:
# Calculating precision,recall and F1-score.
from sklearn.metrics import precision_score,recall_score,f1_score

In [258]:
ytrue=[]
ypred=[]
for i,row in test_data.iterrows():
    true_anime=train_data[train_data['genre']==row['genre']]['name'].tolist()
    ytrue.append(true_anime)
    recommended=get_recommendations(row['name'],data=train_data,top_n=10)
    if recommended.empty:
        ypred.append([])
    else:
        ypred.append(recommended['name'].tolist())

Anime 'blue dragon' not found in the dataset.
Anime 'sennin buraku' not found in the dataset.
Anime 'pokemon: pikachu no kirakira daisousaku!' not found in the dataset.
Anime 'monotonous purgatory' not found in the dataset.
Anime 'super bikkuriman' not found in the dataset.
Anime 'kiriya hakushaku ke no roku shimai' not found in the dataset.
Anime 'lo re: pako sukusuku mizuki-chan the animation' not found in the dataset.
Anime 'glass no hana to kowasu sekai' not found in the dataset.
Anime 'md geist ii: death force' not found in the dataset.
Anime 'cosplay complex' not found in the dataset.
Anime 'puchi puri yuushi' not found in the dataset.
Anime 'ring of gundam' not found in the dataset.
Anime 'double hard' not found in the dataset.
Anime 'zenryoku yobikou 5.5 seminar prologue' not found in the dataset.
Anime 'toilet no hanako-san' not found in the dataset.
Anime 'the everlasting guilty crown' not found in the dataset.
Anime 'soul worker: your destiny awaits' not found in the dataset

IndexError: positional indexers are out-of-bounds

In [263]:
binary_ytrue=[]
binary_ypred=[]
for true_list,pred_list in zip(ytrue,ypred):
    true_set=set(true_list)
    pred_set=set(pred_list)
    binary_ytrue.append([1 if anime in true_set else 0 for anime in pred_list])
    binary_ypred.append([1 if anime in pred_set else 0 for anime in pred_list])
binary_ytrue_flat=[item for sublist in binary_ytrue for item in sublist]
binary_ypred_flat=[item for sublist in binary_ypred for item in sublist]

In [265]:
# Calculating precision,recall and F1-score.
precision=precision_score(binary_ytrue_flat,binary_ypred_flat,average='weighted')
recall=recall_score(binary_ytrue_flat,binary_ypred_flat,average='weighted')
f1=f1_score(binary_ytrue_flat,binary_ypred_flat,average='weighted')

In [267]:
print(f'Precision:{precision}')
print(f'Recall:{recall}')
print(f'F1 Score:{f1}')

Precision:nan
Recall:nan
F1 Score:nan


### Interview Questions :
#### 1. Can you explain the difference between user-based and item-based collaborative filtering?

User-based collaborative filtering recommends items by identifying users with similar preferences or behaviors. It assumes that if two users liked similar items in the past, they will likely enjoy the same items in the future. This method focuses on finding a group of users who share tastes with the target user and suggests items they liked.

Item-based collaborative filtering focuses on recommending items that are similar to those a user has already interacted with. It works by identifying items that are commonly liked or rated together and suggesting those to the user.

#### 2.What is collaborative filtering, and how does it work?

Collaborative filtering is a recommendation technique that predicts a user's preferences based on the preferences and behaviors of other users. It operates on the idea that users who have agreed in the past will likely agree in the future.