## RECOMMENDATION SYSTEM

**Objective:**
  
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset. 

**Dataset:**

Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

### Tasks:

#### Data Preprocessing:

1. Load the dataset into a suitable data structure (e.g., pandas DataFrame).
2. Handle missing values, if any.
3. Explore the dataset to understand its structure and attributes.


#### Feature Extraction:

1. Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
2. Convert categorical features into numerical representations if necessary.
3. Normalize numerical features if required.

#### Recommendation System:

1. Design a function to recommend anime based on cosine similarity.
1. Given a target anime, recommend a list of similar anime based on cosine similarity scores.
3. Experiment with different threshold values for similarity scores to adjust the recommendation list size.

#### Evaluation:

1. Split the dataset into training and testing sets.
2. Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
3. Analyze the performance of the recommendation system and identify areas of improvement.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np

In [2]:
# Load Dataset
data = pd.read_csv("anime.csv")
data.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


In [3]:
data.shape

(12294, 7)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


Data type of 'episodes' should be changed to numerical

In [5]:
# Check for duplicates
data.duplicated().sum()

0

In [6]:
# Check for missing values
data.isna().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [7]:
# Imputing missing values
data['genre'] = data['genre'].fillna( "Unknown")
data['type'] = data['type'].fillna("Unknown")
data['rating'] = data['rating'].fillna(0)

# Check for missing values
data.isna().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

In [8]:
# Looking into the values in 'episodes'
data['episodes'].unique()

array(['1', '64', '51', '24', '10', '148', '110', '13', '201', '25', '22',
       '75', '4', '26', '12', '27', '43', '74', '37', '2', '11', '99',
       'Unknown', '39', '101', '47', '50', '62', '33', '112', '23', '3',
       '94', '6', '8', '14', '7', '40', '15', '203', '77', '291', '120',
       '102', '96', '38', '79', '175', '103', '70', '153', '45', '5',
       '21', '63', '52', '28', '145', '36', '69', '60', '178', '114',
       '35', '61', '34', '109', '20', '9', '49', '366', '97', '48', '78',
       '358', '155', '104', '113', '54', '167', '161', '42', '142', '31',
       '373', '220', '46', '195', '17', '1787', '73', '147', '127', '16',
       '19', '98', '150', '76', '53', '124', '29', '115', '224', '44',
       '58', '93', '154', '92', '67', '172', '86', '30', '276', '59',
       '72', '330', '41', '105', '128', '137', '56', '55', '65', '243',
       '193', '18', '191', '180', '91', '192', '66', '182', '32', '164',
       '100', '296', '694', '95', '68', '117', '151', '130',

There is a value 'Unknown' in 'episodes' column. That should be imputed.

In [9]:
# Changing Unknown to 0
data['episodes'] = data['episodes'].replace('Unknown', np.NaN )

# Changing the column to numerical
data['episodes'] = data['episodes'].astype(float)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  11954 non-null  float64
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 672.5+ KB


In [11]:
data.isnull().sum()

anime_id      0
name          0
genre         0
type          0
episodes    340
rating        0
members       0
dtype: int64

In [12]:
# Imputing Missing values in episodes
# Group by 'type' and compute median episode count for each category
type_median_episodes = data.groupby('type')['episodes'].median()
# Fill missing episode counts with the median based on type
data['episodes'] = data.apply(lambda row: type_median_episodes[row['type']] if pd.isna(row['episodes']) else row['episodes'], axis=1)

In [13]:
data.isnull().sum()

anime_id     0
name         0
genre        0
type         0
episodes    25
rating       0
members      0
dtype: int64

In [14]:
#Imputing the remaining null values
data['episodes'] = data['episodes'].fillna(data['episodes'].median())
data.isnull().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

In [15]:
anime_df = data.copy()

In [16]:
# Convert genres into multiple binary columns
from sklearn.preprocessing import MultiLabelBinarizer

# Split genre strings into lists
data['genre'] = data['genre'].apply(lambda x: x.split(', ') if isinstance(x, str) else [])

# Use MultiLabelBinarizer to create genre binary columns
mlb = MultiLabelBinarizer()
genre_encoded = pd.DataFrame(mlb.fit_transform(data['genre']), columns=mlb.classes_)

# Concatenate with original dataframe
data = pd.concat([data, genre_encoded], axis=1)
data = data.drop(columns=['genre'])  # Drop original genre column

In [17]:
data.head(10)

Unnamed: 0,anime_id,name,type,episodes,rating,members,Action,Adventure,Cars,Comedy,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri
0,32281,Kimi no Na wa.,Movie,1.0,9.37,200630,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,TV,64.0,9.26,793665,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,TV,51.0,9.25,114262,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,TV,24.0,9.17,673572,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,9969,Gintama&#039;,TV,51.0,9.16,151266,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,TV,10.0,9.15,93351,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
6,11061,Hunter x Hunter (2011),TV,148.0,9.13,425855,1,1,0,0,...,0,0,0,1,0,0,0,0,0,0
7,820,Ginga Eiyuu Densetsu,OVA,110.0,9.11,80679,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,Movie,1.0,9.1,72534,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9,15417,Gintama&#039;: Enchousen,TV,13.0,9.11,81109,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Encode 'type' into numeric values
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['type_encoded'] = le.fit_transform(data['type'])
data = data.drop(columns=['type'])  # Drop original type column

In [19]:
data.head(10)

Unnamed: 0,anime_id,name,episodes,rating,members,Action,Adventure,Cars,Comedy,Dementia,...,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri,type_encoded
0,32281,Kimi no Na wa.,1.0,9.37,200630,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,64.0,9.26,793665,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,5
2,28977,Gintama°,51.0,9.25,114262,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,5
3,9253,Steins;Gate,24.0,9.17,673572,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,5
4,9969,Gintama&#039;,51.0,9.16,151266,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,5
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,10.0,9.15,93351,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,5
6,11061,Hunter x Hunter (2011),148.0,9.13,425855,1,1,0,0,0,...,0,0,1,0,0,0,0,0,0,5
7,820,Ginga Eiyuu Densetsu,110.0,9.11,80679,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,3
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,1.0,9.1,72534,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9,15417,Gintama&#039;: Enchousen,13.0,9.11,81109,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,5


In [20]:
# Normalizing 'Ratings', 'Episodes' & 'Members'
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Normalize numerical columns
data[['rating', 'members', 'episodes']] = scaler.fit_transform(data[['rating', 'members', 'episodes']])

### Recommendation System¶

In [21]:
# Step 1: Feature Selection
features = ['rating', 'type_encoded']  # Numerical features
features.extend(mlb.classes_)  # Add all genre binary columns
features

['rating',
 'type_encoded',
 'Action',
 'Adventure',
 'Cars',
 'Comedy',
 'Dementia',
 'Demons',
 'Drama',
 'Ecchi',
 'Fantasy',
 'Game',
 'Harem',
 'Hentai',
 'Historical',
 'Horror',
 'Josei',
 'Kids',
 'Magic',
 'Martial Arts',
 'Mecha',
 'Military',
 'Music',
 'Mystery',
 'Parody',
 'Police',
 'Psychological',
 'Romance',
 'Samurai',
 'School',
 'Sci-Fi',
 'Seinen',
 'Shoujo',
 'Shoujo Ai',
 'Shounen',
 'Shounen Ai',
 'Slice of Life',
 'Space',
 'Sports',
 'Super Power',
 'Supernatural',
 'Thriller',
 'Unknown',
 'Vampire',
 'Yaoi',
 'Yuri']

In [22]:
# Step 2: Compute Cosine Similarity Matrix
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(data[features])  # Compute similarity matrix

# Convert similarity matrix into a DataFrame
similarity_df = pd.DataFrame(cosine_sim, index = data['name'], columns = data['name'])
similarity_df

name,Kimi no Na wa.,Fullmetal Alchemist: Brotherhood,Gintama°,Steins;Gate,Gintama&#039;,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou,Hunter x Hunter (2011),Ginga Eiyuu Densetsu,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien Nare,Gintama&#039;: Enchousen,...,Super Erotic Anime,Taimanin Asagi 3,Teleclub no Himitsu,Tenshi no Habataki Jun,The Satisfaction,Toushindai My Lover: Minami tai Mecha-Minami,Under World,Violence Gekiga David no Hoshi,Violence Gekiga Shin David no Hoshi: Inma Densetsu,Yasuji no Pornorama: Yacchimae!!
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Kimi no Na wa.,1.000000,0.147524,0.068463,0.073731,0.067814,0.232974,0.070915,0.225678,0.137985,0.067453,...,0.059118,0.130704,0.061980,0.057554,0.058076,0.055203,0.056901,0.064704,0.065998,0.203309
Fullmetal Alchemist: Brotherhood,0.147524,1.000000,0.847823,0.854648,0.847783,0.874839,0.921313,0.837057,0.177247,0.847760,...,0.841949,0.755411,0.842234,0.841777,0.841836,0.841497,0.841702,0.842469,0.842568,0.077416
Gintama°,0.068463,0.847823,1.000000,0.887706,0.999999,0.874835,0.889370,0.790126,0.488967,0.999997,...,0.841949,0.755433,0.842232,0.841777,0.841836,0.841498,0.841702,0.842466,0.842565,0.077334
Steins;Gate,0.073731,0.854648,0.887706,1.000000,0.887657,0.881856,0.896503,0.857969,0.124263,0.887629,...,0.914426,0.820652,0.914724,0.914245,0.914307,0.913950,0.914166,0.914968,0.915071,0.083284
Gintama&#039;,0.067814,0.847783,0.999999,0.887657,1.000000,0.874797,0.889332,0.789941,0.488580,1.000000,...,0.841942,0.755623,0.842215,0.841776,0.841833,0.841506,0.841704,0.842439,0.842533,0.076601
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Toushindai My Lover: Minami tai Mecha-Minami,0.055203,0.841497,0.841498,0.913950,0.841506,0.868366,0.882805,0.790670,0.042321,0.841509,...,0.999957,0.905110,0.999870,0.999984,0.999977,1.000000,0.999992,0.999744,0.999670,0.337547
Under World,0.056901,0.841702,0.841702,0.914166,0.841704,0.868570,0.883010,0.791243,0.043623,0.841704,...,0.999986,0.904623,0.999927,0.999999,0.999996,0.999992,1.000000,0.999827,0.999765,0.339318
Violence Gekiga David no Hoshi,0.064704,0.842469,0.842466,0.914968,0.842439,0.869325,0.883771,0.793710,0.049605,0.842423,...,0.999911,0.902192,0.999979,0.999855,0.999875,0.999744,0.999827,1.000000,0.999995,0.347392
Violence Gekiga Shin David no Hoshi: Inma Densetsu,0.065998,0.842568,0.842565,0.915071,0.842533,0.869422,0.883868,0.794093,0.050597,0.842514,...,0.999866,0.901757,0.999954,0.999798,0.999822,0.999670,0.999765,0.999995,1.000000,0.348722


**Implementing the Recommendation Function**

We create a function that:

1. Takes an anime title as input.
2. Finds similar anime using cosine similarity.
3. Returns top N recommendations.
   
We will set the similarity threshold to be 0.5 for now. And obtain the top 10 similar values.

In [23]:
def recommend_anime(anime_name, top_n=10, similarity_threshold=0.5):
    # Ensure similarity_df has unique index and columns
    global similarity_df, anime_df
    similarity_df = similarity_df[~similarity_df.index.duplicated(keep='first')]
    similarity_df = similarity_df.loc[:, ~similarity_df.columns.duplicated(keep='first')]

    # Ensure anime_df['name'] is unique
    anime_df = anime_df.drop_duplicates(subset='name', keep='first')

    if anime_name not in similarity_df.index:
        return f"Anime '{anime_name}' not found in dataset."

    # Get similarity scores
    similar_scores = similarity_df[anime_name].sort_values(ascending=False)

    # Filter by threshold and get top N (excluding the anime itself)
    recommended_anime = similar_scores[similar_scores > similarity_threshold].iloc[1:top_n+1]
    sim_anime = recommended_anime.index.tolist()

    # Filter anime_df and attach similarity score
    filtered_df = anime_df[anime_df['name'].isin(sim_anime)].copy()
    filtered_df['similarity_score'] = filtered_df['name'].map(similar_scores)

    # Sort by score
    filtered_df = filtered_df.sort_values(by='similarity_score', ascending=False)

    return filtered_df

In [24]:
# Example 
anime = recommend_anime('Ginga Eiyuu Densetsu', top_n = 10, similarity_threshold = 0.5)
anime

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,similarity_score
413,3665,Ginga Eiyuu Densetsu Gaiden: Rasen Meikyuu,"Drama, Military, Sci-Fi, Space",OVA,28.0,8.1,7712,0.999649
3037,342,Starship Operators,"Drama, Military, Sci-Fi, Space",TV,13.0,7.1,9704,0.972615
4677,11307,Ginga Patrol PJ,"Drama, Military, Sci-Fi, Space",TV,26.0,6.66,316,0.971628
92,12029,Uchuu Senkan Yamato 2199,"Action, Drama, Military, Sci-Fi, Space",OVA,26.0,8.53,44223,0.965342
4141,23931,Uchuu Senkan Yamato 2199: Tsuioku no Koukai,"Action, Drama, Military, Sci-Fi, Space",Special,1.0,6.8,950,0.964647
1426,1241,Mobile Suit Gundam Seed Destiny Final Plus: Th...,"Drama, Mecha, Military, Sci-Fi, Space",OVA,1.0,7.55,16102,0.964259
5136,1677,Cosmo Warrior Zero Gaiden,"Adventure, Drama, Military, Sci-Fi, Space",Special,2.0,6.54,1215,0.964052
3370,3854,Ginga Tetsudou Monogatari: Wasurerareta Toki n...,"Drama, Sci-Fi, Space",OVA,4.0,7.01,1236,0.961577
5062,1495,Maetel Legend,"Drama, Sci-Fi, Space",OVA,2.0,6.56,1974,0.960809
448,2158,Terra e... (TV),"Action, Drama, Military, Sci-Fi, Space",TV,24.0,8.07,36941,0.958533


In [25]:
anime = recommend_anime("Kimi no Na wa.", top_n=10, similarity_threshold=0.5)
anime

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,similarity_score
1111,14669,Aura: Maryuuin Kouga Saigo no Tatakai,"Comedy, Drama, Romance, School, Supernatural",Movie,1.0,7.67,22599,0.903777
208,28725,Kokoro ga Sakebitagatterunda.,"Drama, Romance, School",Movie,1.0,8.32,59652,0.890595
1494,20903,Harmonie,"Drama, School, Supernatural",Movie,1.0,7.52,29029,0.888309
1959,713,Air Movie,"Drama, Romance, Supernatural",Movie,1.0,7.39,44179,0.887805
60,10408,Hotarubi no Mori e,"Drama, Romance, Shoujo, Supernatural",Movie,1.0,8.61,197439,0.791564
1199,6408,&quot;Bungaku Shoujo&quot; Movie,"Drama, Mystery, Romance, School",Movie,1.0,7.63,40984,0.78577
2103,1723,Clannad Movie,"Drama, Fantasy, Romance, School",Movie,1.0,7.35,99506,0.783817
5796,30585,Taifuu no Noruda,"Drama, School, Sci-Fi, Supernatural",Movie,1.0,6.35,14281,0.775699
894,10389,Momo e no Tegami,"Drama, Supernatural",Movie,1.0,7.78,30519,0.765516
1697,31245,Zutto Mae kara Suki deshita.: Kokuhaku Jikkou ...,"Romance, School",Movie,1.0,7.47,35058,0.764334


We notice that because of the extensive list, most of our anime similar to the inputted anime have high score similarity scores.

We notice that in these two examples we have very good cosine similarity of greater than 0.75. That would be a decent threshold to use.

In [26]:
# Checking for different thresholds
def recommend_by_threshold(anime_name, threshold, similarity_df, anime_df, top_n=10):
    # Remove duplicates from index/columns
    similarity_df = similarity_df.loc[~similarity_df.index.duplicated(), ~similarity_df.columns.duplicated()]
    anime_df = anime_df.drop_duplicates(subset='name')

    # Check if anime exists
    if anime_name not in similarity_df.index:
        return f"Anime '{anime_name}' not found in dataset."

    # Get similarity scores for the anime
    similar_scores = similarity_df[anime_name].sort_values(ascending=False)

    # Remove the anime itself
    filtered_scores = similar_scores[similar_scores.index != anime_name]

    # Filter by threshold
    filtered_scores = filtered_scores[filtered_scores > threshold]

    # Limit to at most top_n items
    filtered_scores = filtered_scores[:top_n]

    # Get matching rows from anime_df
    sim_anime = filtered_scores.index.tolist()
    filtered_df = anime_df[anime_df['name'].isin(sim_anime)].copy()
    filtered_df['similarity_score'] = filtered_df['name'].map(filtered_scores)
    filtered_df = filtered_df.sort_values(by='similarity_score', ascending=False)

    return filtered_df

In [27]:
# Evaluation
thresholds = [0.2, 0.5, 0.9]
results = {}

for t in thresholds:
    print(f"\n=== Recommendations for Threshold {t} ===")
    recommendations = recommend_by_threshold("Kimi no Na wa.", t, similarity_df, anime_df, top_n=10)

    if isinstance(recommendations, str):
        print(recommendations)
    elif recommendations.empty:
        print("No recommendations found above this threshold.")
    else:
        print(recommendations[['name', 'similarity_score']])
        print(f"Number of Recommendations: {len(recommendations)}")

    results[t] = recommendations


=== Recommendations for Threshold 0.2 ===
                                                   name  similarity_score
1111              Aura: Maryuuin Kouga Saigo no Tatakai          0.903777
208                       Kokoro ga Sakebitagatterunda.          0.890595
1494                                           Harmonie          0.888309
1959                                          Air Movie          0.887805
60                                   Hotarubi no Mori e          0.791564
1199                   &quot;Bungaku Shoujo&quot; Movie          0.785770
2103                                      Clannad Movie          0.783817
5796                                   Taifuu no Noruda          0.775699
894                                    Momo e no Tegami          0.765516
1697  Zutto Mae kara Suki deshita.: Kokuhaku Jikkou ...          0.764334
Number of Recommendations: 10

=== Recommendations for Threshold 0.5 ===
                                                   name  similarity_sc

#### EVALUATION

In [28]:
# Split the dataset into traininga nd testing
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(anime_df, test_size = 0.2, random_state = 42)

# Prepare features agian for cosine similarity
train_features = data.loc[train_data.index, features]
test_features = data.loc[test_data.index, features]

In [29]:
# Initialize lists for true and predicted labels
y_true = []
y_pred = []

# Evaluate each test item
for idx, test_item in test_data.iterrows():
    test_vector = test_features.loc[idx].values.reshape(1,-1)
    
    # Compute similarity with training data
    similarity_scores = cosine_similarity(test_vector, train_features)[0]
    top_index = similarity_scores.argsort()[::-1][:5]
    
    # Get top 5 recommended names
    recommended_names = train_data.iloc[top_index]['name'].values.tolist()
    
    # Determine true relevant names based on shared genres
    test_genres = set(test_item['genre'])
    relevant = train_data[train_data['genre'].apply(lambda g: bool(test_genres & set(g)))]
    relevant_names = set(relevant['name'])
    
    # Evaluate recommendations
    y_true.extend([1 if name in relevant_names else 0 for name in recommended_names])
    y_pred.extend([1] * len(recommended_names))  # All are predicted as relevant

In [30]:
# Compute evaluation metrics
from sklearn.metrics import precision_score, recall_score, f1_score
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")

Precision: 0.9998
Recall:    1.0000
F1 Score:  0.9999


### Interview Questions:

**1. Can you explain the difference between user-based and item-based collaborative filtering?**

🔹 User-Based Collaborative Filtering focuses on finding users who are similar to the target user. The idea is: *'If users A and B have shown similar preferences in the past, then the items liked by user B can be recommended to user A.'*

For example, if I and another user have both rated similar anime highly, and that user has also liked another show that I haven’t seen, the system might recommend that show to me.

🔹 Item-Based Collaborative Filtering, on the other hand, looks at the similarity between items — regardless of the user. It asks: *'What are the items similar to the ones this user already liked?'*


---------------------------------------------------------------------------------

**2. What is collaborative filtering, and how does it work?**

Collaborative filtering is a technique used in recommendation systems where we make predictions about a user's interests based on past interactions — either with other users or items.

It works by finding similarities between users or items based on their behavior, such as ratings or purchases, and then recommending items that similar users have liked or items similar to those the user has interacted with.

It does not require explicit knowledge about items or users and is widely used in applications like e-commerce and streaming services to improve user experience and engagement.