Recommendation System

Data Description:

Unique ID of each anime.
Anime title.
Anime broadcast type, such as TV, OVA, etc.
anime genre.
The number of episodes of each anime.
The average rating for each anime compared to the number of users who gave ratings.

In [1]:
import pandas as pd
import numpy as np


Data Preprocessing:

Load the dataset into a suitable data structure (e.g., pandas DataFrame).
Handle missing values, if any.
Explore the dataset to understand its structure and attributes.

In [2]:
df = pd.read_csv('anime.csv')
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [3]:
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None


In [4]:
df.describe

<bound method NDFrame.describe of        anime_id                                               name  \
0         32281                                     Kimi no Na wa.   
1          5114                   Fullmetal Alchemist: Brotherhood   
2         28977                                           Gintama°   
3          9253                                        Steins;Gate   
4          9969                                      Gintama&#039;   
...         ...                                                ...   
12289      9316       Toushindai My Lover: Minami tai Mecha-Minami   
12290      5543                                        Under World   
12291      5621                     Violence Gekiga David no Hoshi   
12292      6133  Violence Gekiga Shin David no Hoshi: Inma Dens...   
12293     26081                   Yasuji no Pornorama: Yacchimae!!   

                                                   genre   type episodes  \
0                   Drama, Romance, School, Super

In [5]:
df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [6]:
# Drop rows missing essential info
df.dropna(subset=['anime_id', 'name'], inplace=True)

# Fill missing ratings with mean
df['rating'].fillna(df['rating'].mean(), inplace=True)

# Fill missing genres with empty string
df['genre'] = df['genre'].fillna('')

#FIll missing type with empty string
df['type'] = df['type'].fillna('')

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['rating'].fillna(df['rating'].mean(), inplace=True)


In [7]:
df.isnull().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

Step 2: Feature Extraction

In [8]:
from sklearn.preprocessing import MultiLabelBinarizer

# Convert genre string into list
df['genre'] = df['genre'].fillna('Unknown').apply(lambda x: x.split(','))

# One-hot encode genres
mlb = MultiLabelBinarizer()
genre_encoded = mlb.fit_transform(df['genre'])

# Combine genre features with rating
from sklearn.preprocessing import StandardScaler
numerical_features = df[['rating', 'members']].fillna(0).values
numerical_features_scaled = StandardScaler().fit_transform(numerical_features)

# Final feature matrix
X = np.hstack([genre_encoded, numerical_features_scaled])


Step 3: Compute Cosine Similarity

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity between all anime
cos_sim = cosine_similarity(X, X)

# Create a mapping from anime name to index
anime_indices = pd.Series(df.index, index=df['name']).drop_duplicates()


Step 4: Recommendation Function

In [10]:
def recommend_anime(title, top_n=10, similarity_threshold=0.2):
    idx = anime_indices[title]
    sim_scores = list(enumerate(cos_sim[idx]))
    
    # Filter by threshold
    sim_scores = [s for s in sim_scores if s[1] >= similarity_threshold and s[0] != idx]
    
    # Sort by similarity score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    top_indices = [i[0] for i in sim_scores[:top_n]]
    
    return df['name'].iloc[top_indices].tolist()


In [11]:
recommend_anime("Naruto", top_n=5, similarity_threshold=0.3)

['Naruto: Shippuuden',
 'Bleach',
 'Shingeki no Kyojin',
 'Kill la Kill',
 'Angel Beats!']

Step 5: Experiment with Thresholds

In [12]:
thresholds = [0.1, 0.2, 0.3, 0.5]
for t in thresholds:
    recommendations = recommend_anime("Naruto", top_n=10, similarity_threshold=t)
    print(f"Threshold {t}: {recommendations}")

Threshold 0.1: ['Naruto: Shippuuden', 'Bleach', 'Shingeki no Kyojin', 'Kill la Kill', 'Angel Beats!', 'Soul Eater', 'Sword Art Online', 'Fairy Tail', 'Ao no Exorcist', 'Death Note']
Threshold 0.2: ['Naruto: Shippuuden', 'Bleach', 'Shingeki no Kyojin', 'Kill la Kill', 'Angel Beats!', 'Soul Eater', 'Sword Art Online', 'Fairy Tail', 'Ao no Exorcist', 'Death Note']
Threshold 0.3: ['Naruto: Shippuuden', 'Bleach', 'Shingeki no Kyojin', 'Kill la Kill', 'Angel Beats!', 'Soul Eater', 'Sword Art Online', 'Fairy Tail', 'Ao no Exorcist', 'Death Note']
Threshold 0.5: ['Naruto: Shippuuden', 'Bleach', 'Shingeki no Kyojin', 'Kill la Kill', 'Angel Beats!', 'Soul Eater', 'Sword Art Online', 'Fairy Tail', 'Ao no Exorcist', 'Death Note']


Observation:

Lower threshold → more recommendations but less similar.

Higher threshold → fewer, more focused recommendations.

Step 6: Evaluation

In [13]:
# Example: list of anime the user actually liked
liked_anime = ["Naruto", "Bleach", "One Piece"]

# Create binary arrays
y_true = np.array([1 if anime in liked_anime else 0 for anime in df['name']])

# Example: anime recommended by your system
recommended_anime = recommend_anime("Naruto", top_n=10, similarity_threshold=0.3)
y_pred = np.array([1 if anime in recommended_anime else 0 for anime in df['name']])

# Import metrics
from sklearn.metrics import precision_score, recall_score, f1_score

# Calculate evaluation metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1-score: {f1:.4f}")


Precision: 0.1000, Recall: 0.3333, F1-score: 0.1538


Step 7: Analysis and Improvements

Threshold Impact: Higher threshold → fewer but more relevant recommendations.

Features: Currently using genres + rating; adding episodes, type, tags could improve similarity.

Hybrid approach: Combine content-based with collaborative filtering for better personalization.

Performance: Cosine similarity is fast; for large datasets, consider sparse matrices or approximate nearest neighbors.

**1. Difference between User-Based and Item-Based Collaborative Filtering**

User-Based Collaborative Filtering (UBCF):

Looks at similarities between users.

Assumes that if two users have rated items similarly in the past, they will continue to like similar items in the future.

Example: If User A and User B both like Naruto and One Piece, and User A also likes Bleach, then Bleach can be recommended to User B.

Item-Based Collaborative Filtering (IBCF):

Looks at similarities between items.

Assumes that if a user liked an item, they will also like other items that are similar to it.

Example: If many users who liked Attack on Titan also liked Death Note, then Death Note will be recommended to a user who liked Attack on Titan.



**2. What is Collaborative Filtering, and How Does It Work?**

Definition:
Collaborative Filtering (CF) is a recommendation technique that makes predictions about a user’s interests by collecting preferences from many users (a “collaboration”). It’s widely used in recommendation systems like Netflix, Amazon, and Spotify.

How It Works:

Collect user-item interaction data (ratings, likes, purchases, views, etc.).

Find patterns:

Either users with similar preferences (user-based)

Or items consumed together (item-based).

Recommend items based on these patterns.

Example:

If you rated Inception and Interstellar highly, and many others who liked these movies also liked Tenet, the system will recommend Tenet to you.