# Recommendation System



Data Description:

Unique ID of each anime.
Anime title.
Anime broadcast type, such as TV, OVA, etc.
anime genre.
The number of episodes of each anime.
The average rating for each anime compared to the number of users who gave ratings.


Number of community members for each anime.
Objective:
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset. 
Dataset:
Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

Tasks:

Data Preprocessing:

Load the dataset into a suitable data structure (e.g., pandas DataFrame).
Handle missing values, if any.
Explore the dataset to understand its structure and attributes.

Feature Extraction:

Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
Convert categorical features into numerical representations if necessary.
Normalize numerical features if required.

Recommendation System:

Design a function to recommend anime based on cosine similarity.
Given a target anime, recommend a list of similar anime based on cosine similarity scores.
Experiment with different threshold values for similarity scores to adjust the recommendation list size.

Evaluation:

Split the dataset into training and testing sets.
Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
Analyze the performance of the recommendation system and identify areas of improvement.

Interview Questions:
1. Can you explain the difference between user-based and item-based collaborative filtering?
2. What is collaborative filtering, and how does it work?

## Data Preprocessing

In [4]:
import pandas as pd

In [5]:
df = pd.read_csv('C:/Users/DELL/Desktop/DATAsets/anime.csv')
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [6]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None


### Handle Missing Values

In [8]:
# Checking for missing values
df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [9]:
df.isnull().any()
# there is no missing value

anime_id    False
name        False
genre        True
type         True
episodes    False
rating       True
members     False
dtype: bool

##### Explore Dataset

In [11]:
print(df.describe()) ## Descriptive statistics

           anime_id        rating       members
count  12294.000000  12064.000000  1.229400e+04
mean   14058.221653      6.473902  1.807134e+04
std    11455.294701      1.026746  5.482068e+04
min        1.000000      1.670000  5.000000e+00
25%     3484.250000      5.880000  2.250000e+02
50%    10260.500000      6.570000  1.550000e+03
75%    24794.500000      7.180000  9.437000e+03
max    34527.000000     10.000000  1.013917e+06


In [12]:
# Checking unique values for categorical features
print(df['rating'].unique())

[ 9.37  9.26  9.25  9.17  9.16  9.15  9.13  9.11  9.1   9.06  9.05  9.04
  8.98  8.93  8.92  8.88  8.84  8.83  8.82  8.81  8.8   8.78  8.77  8.76
  8.75  8.74  8.73  8.72  8.71  8.69  8.68  8.67  8.66  8.65  8.64  8.62
  8.61  8.6   8.59  8.58  8.57  8.56  8.55  8.54  8.53  8.52  8.51  8.5
  8.49  8.48  8.47  8.46  8.45  8.44  8.43  8.42  8.41  8.4   8.39  8.38
  8.37  8.36  8.35  8.34  8.33  8.32  8.31  8.3   8.29  8.28  8.27  8.26
  8.25  8.24  8.23  8.22  8.21  8.2   8.19  8.18  8.17  8.16  8.15  8.14
  8.13  8.12  8.11  8.1   8.09  8.08  8.07  8.06  8.05  8.04  8.03  8.02
  8.01  8.    7.99  7.98  7.97  7.96  7.95  7.94  7.93  7.92  7.91  7.9
  7.89  7.88  7.87  7.86  7.85  7.84  7.83  7.82  7.81  7.8   7.79  7.78
  7.77  7.76  7.75  7.74  7.73  7.72  7.71  7.7   7.69  7.68  7.67  7.66
  7.65  7.64  7.63  7.62  7.61  7.6   7.59  7.58  7.57  7.56  7.55  7.54
  7.53  7.52  7.51  7.5   7.49  7.48  7.47  7.46  7.45  7.44  7.43  7.42
  7.41  7.4   7.39  7.38  7.37  7.36  7.35  7.34  7.3

In [13]:
print(df['genre'].unique())

['Drama, Romance, School, Supernatural'
 'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen'
 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen' ...
 'Hentai, Sports' 'Drama, Romance, School, Yuri' 'Hentai, Slice of Life']


### 2. Feature Extraction

In [15]:
####Feature Selection
#Decide on features for similarity:
#Genres (categorical)
#Average Rating (numerical)
#Number of Episodes (numerical)

#### Encoding Categorical Features

In [17]:
# One-hot encode genres
genre_encoded = df['genre'].str.get_dummies(sep=',')
anime_features = pd.concat([df[['rating', 'episodes']], genre_encoded], axis=1)
genre_encoded

Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,...,Shoujo,Shounen,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
anime_features

Unnamed: 0,rating,episodes,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,...,Shoujo,Shounen,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi
0,9.37,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9.26,64,1,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,9.25,51,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9.17,24,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9.16,51,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,4.15,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,4.28,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,4.88,4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,4.98,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Normalize Numerical Features

In [20]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
anime_features[['rating', 'episodes']] = scaler.fit_transform(anime_features[['rating', 'episodes']])

ValueError: could not convert string to float: 'Unknown'

In [25]:
#### To overcome the above error
print(anime_features.dtypes)

rating          float64
episodes         object
 Adventure        int64
 Cars             int64
 Comedy           int64
                 ...   
Super Power       int64
Supernatural      int64
Thriller          int64
Vampire           int64
Yaoi              int64
Length: 84, dtype: object


In [27]:
#Before scaling or processing, encode all categorical columns
# Example for genres
genre_encoded = df['genre'].str.get_dummies(sep=',')
print(genre_encoded.head())

    Adventure   Cars   Comedy   Dementia   Demons   Drama   Ecchi   Fantasy  \
0           0      0        0          0        0       0       0         0   
1           1      0        0          0        0       1       0         1   
2           0      0        1          0        0       0       0         0   
3           0      0        0          0        0       0       0         0   
4           0      0        1          0        0       0       0         0   

    Game   Harem  ...  Shoujo  Shounen  Slice of Life  Space  Sports  \
0      0       0  ...       0        0              0      0       0   
1      0       0  ...       0        0              0      0       0   
2      0       0  ...       0        0              0      0       0   
3      0       0  ...       0        0              0      0       0   
4      0       0  ...       0        0              0      0       0   

   Super Power  Supernatural  Thriller  Vampire  Yaoi  
0            0             0        

In [29]:
df['genre'] = df['genre'].replace('Unknown', '')   #Handle Missing or Non-Numeric Values
genre_encoded = df['genre'].str.get_dummies(sep=',')
genre_encoded

Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,...,Shoujo,Shounen,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
df['genre']

0                     Drama, Romance, School, Supernatural
1        Action, Adventure, Drama, Fantasy, Magic, Mili...
2        Action, Comedy, Historical, Parody, Samurai, S...
3                                         Sci-Fi, Thriller
4        Action, Comedy, Historical, Parody, Samurai, S...
                               ...                        
12289                                               Hentai
12290                                               Hentai
12291                                               Hentai
12292                                               Hentai
12293                                               Hentai
Name: genre, Length: 12294, dtype: object

In [83]:
# Rebuild the Feature Matrix by Concatenating numerical and encodeding categorical columns:
anime_features = pd.concat([df[['rating', 'episodes']], genre_encoded], axis=1)
anime_features

Unnamed: 0,rating,episodes,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,...,Shoujo,Shounen,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi
0,9.37,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9.26,64,1,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,9.25,51,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9.17,24,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9.16,51,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,4.15,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,4.28,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,4.88,4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,4.98,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3. Build the Recommendation System


In [59]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

mlb.fit(df)

In [65]:
def recommend_anime(anime_title, anime_data, similarity_matrix, threshold=0.5, top_n=10):
    try:
        anime_index = df[anime_data['title'] == anime_title].index[0]
    except IndexError:
        return "Anime not found in the dataset."

In [95]:
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler

# Process genres
df['episodes'] = df['episodes'].apply(lambda x: x.split(', '))
mlb = MultiLabelBinarizer()
genre_features = mlb.fit_transform(df['episodes'])

In [137]:
# Scale numerical features
scaler = MinMaxScaler()
numerical_features = mlb.fit_transform(df[['episodes', 'members']])


In [139]:
scaler

In [141]:
numerical_features

array([[0, 1, 1, 1, 0, 1, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 1, 1]])

In [179]:
import numpy as np

In [149]:
arr1 = np.random.rand(12294, 5)  # 12294 rows, 5 columns
arr2 = np.random.rand(2, 5)      # 2 rows, 5 columns

combined = np.concatenate([arr1, arr2], axis=0)  # Error

In [151]:
arr2_resized = np.random.rand(12294, 5)  # Resize to match rows
combined = np.concatenate([arr1, arr2_resized], axis=0)  # Works


In [161]:
# Example: Genre encoding with mismatched rows
genres = [['Action', 'Adventure'], ['Comedy']]  # 2 entries
mlb = MultiLabelBinarizer()
genre_features = mlb.fit_transform(genres)


In [165]:
# Original dataset has 12294 rows
original_data = np.random.rand(12294, 3)  # 3 features
original_data


array([[0.77728826, 0.8550824 , 0.60093601],
       [0.91974331, 0.1903117 , 0.72468192],
       [0.70316993, 0.76468928, 0.94017658],
       ...,
       [0.38292491, 0.71948825, 0.27241822],
       [0.71995174, 0.79480324, 0.17077062],
       [0.84566583, 0.7783272 , 0.26354256]])

In [169]:
# Resize genre_features to match rows
genre_features_resized = np.tile(genre_features, (original_data.shape[0] // genre_features.shape[0], 1))

combined_features = np.hstack([original_data, genre_features_resized])  # Works

In [171]:
combined_features

array([[0.77728826, 0.8550824 , 0.60093601, 1.        , 1.        ,
        0.        ],
       [0.91974331, 0.1903117 , 0.72468192, 0.        , 0.        ,
        1.        ],
       [0.70316993, 0.76468928, 0.94017658, 1.        , 1.        ,
        0.        ],
       ...,
       [0.38292491, 0.71948825, 0.27241822, 0.        , 0.        ,
        1.        ],
       [0.71995174, 0.79480324, 0.17077062, 1.        , 1.        ,
        0.        ],
       [0.84566583, 0.7783272 , 0.26354256, 0.        , 0.        ,
        1.        ]])

In [177]:
# Combine all features
import numpy as np
feature_matrix = np.hstack((genre_features, numerical_features))
feature_matrix

array([[1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1],
       [0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1]])

#### 3: Build the Recommendation System

##### Calculate Cosine Similarity:

In [183]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(feature_matrix)

In [185]:
similarity_matrix

array([[1.        , 0.28867513],
       [0.28867513, 1.        ]])

##### Create a Recommendation Function:

In [250]:
def recommend_anime(anime_title, similarity_matrix, df, top_n=5):
    if anime_title not in df['anime_title'].values:
        return f"Anime '{anime_title}' not found in the dataset."
    

In [340]:
def recommend_anime(anime_title, anime_data, similarity_matrix, threshold=0.5, top_n=10):
    try:
        anime_index = df[anime_data['title'] == anime_genre].index[0]
    except IndexError:
        return "Anime not found in the dataset."

In [342]:
anime_index = 0  # Example initialization
print(anime_index)

0


In [344]:
 # Get similarity scores and sort them
similarity_scores = list(enumerate(similarity_matrix[anime_index]))
sorted_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

In [346]:
similarity_scores

[(0, 0.9999999999999999), (1, 0.2886751345948129)]

In [366]:
top_n = 5  # Defining before usage

In [384]:
df['anime_index'] = range(len(df))  # Creating an 'anime_index' column if it doesn't exist

In [403]:
  # Get top N recommendations (excluding the target anime itself)
def recommend_anime(sorted_scores, df, top_n):
    recommendations = []
    for i, score in sorted_scores[1:top_n + 1]:
        recommendations.append((df.iloc[i]['anime_index'], score))
    return recommendations

##### Test the Function:

In [469]:
def recommend_anime(target_anime, similarity_matrix, df, top_n=5):
    target_anime = 'Naruto'
    recommendations = recommend_anime(target_anime, similarity_matrix, df)
    print(f"Recommendations for '{target_anime}':")
    for anime, score in recommendations:
     print(f"{anime} (Similarity: {score:.2f})")


##### Evaluate the System

In [473]:
# Train-Test Split:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, random_state=42)

In [475]:
train

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,anime_index
3013,5342,Asura Cryin&#039;,"Action, Mecha, Supernatural",TV,[13],7.10,68608,3013
4253,9581,MM! Specials,"Comedy, Ecchi, School",Special,[9],6.77,21462,4253
9791,9810,Nyani ga Nyandaa Nyandaa Kamen,Comedy,TV,[83],6.75,169,9791
2629,1539,Touch: Cross Road - Kaze no Yukue,"Romance, Shounen, Sports",Special,[1],7.21,1513,2629
4608,4439,Kurenai Sanshirou,"Action, Martial Arts, Sports",TV,[26],6.68,603,4608
...,...,...,...,...,...,...,...,...
11964,4638,Milkyway,"Hentai, Romance",OVA,[2],5.82,695,11964
5191,5272,Tondemo Nezumi Daikatsuyaku,Adventure,Movie,[1],6.53,252,5191
5390,1262,Macross II: Lovers Again,"Adventure, Mecha, Military, Sci-Fi, Shounen, S...",OVA,[6],6.47,6760,5390
860,22819,Aikatsu! Movie,"Music, School, Shoujo, Slice of Life",Movie,[1],7.79,2813,860


In [477]:
test

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,anime_index
6329,17209,Suzy&#039;s Zoo: Daisuki! Witzy - Happy Birthday,Kids,Special,[1],6.17,158,6329
2167,173,Tactics,"Comedy, Drama, Fantasy, Mystery, Shounen, Supe...",TV,[25],7.34,27358,2167
2882,3616,Kamen no Maid Guy,"Action, Comedy, Ecchi, Super Power",TV,[12],7.14,27761,2882
4700,18799,Take Your Way,"Action, Music, Seinen, Supernatural",Music,[1],6.66,1387,4700
7258,18831,Rinkaku,"Dementia, Horror, Music",Music,[1],5.60,606,7258
...,...,...,...,...,...,...,...,...
9652,26305,Naita Aka Oni (OVA),"Demons, Drama, Kids",OVA,[1],5.75,48,9652
549,30311,Kuroko no Basket 3rd Season NG-shuu,"Comedy, Sports",Special,[9],7.98,10778,549
7550,30327,New York Trip,Dementia,Movie,[1],5.31,131,7550
5210,13137,Itsuka Tenma no Kuro Usagi Special,"Comedy, Ecchi",Special,[1],6.52,3251,5210


### Interview Questions

##### 1. What is Collaborative Filtering?

In [483]:
#Collaborative filtering is a technique used in recommendation systems to suggest items to users based on the preferences of similar users (user-based) or the similarity of items themselves (item-based).

##### 2. Difference Between User-Based and Item-Based Collaborative Filtering:

In [486]:
#User-Based: Recommends items by finding users with similar preferences.
#Item-Based: Recommends items based on the similarity between items.
