# Anime Recommendation System: Introduction

## Objective
The goal of this project is to build a **recommendation system** that can suggest anime to users based on their past ratings and anime characteristics.

## Dataset Overview
We are using the [Anime Dataset](#https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database?select=rating.csv) containing:
- **Anime information**: anime_id,name,genre,type,episodes,rating,members.
- **User ratings**: user_id,anime_id,rating.

## Project Goals
1. Explore and preprocess the dataset to handle sparsity, missing values and noisy ratings.
2. Implement multiple recommendation techniques:
    - **Baseline recommendation**: Simple popularity-based recommendations.
    - **Collaborative Filtering (CF)**: Suggest anime based on user-user or item-item similarity.
    - **Content-Based Filtering (CBF)**: Recommend anime using genre, type, and other features.
    - **Hybrid approaches**: Combine CF and CBF for better predictions.
3. Evaluate model performance using metrics like **RMSE** and **MAE**.
4. Visualise recommendations and analyze patterns in user preferences and anime clusters.

## Why This Matters
Recommendation systems help users **discover content they are likely to enjoy** and are a critical part of many modern applications. By combining user behavior and content features, I aim to build a system that can provide **personalized anime recommendations** and gain insights into anime clustering and user preferences.

In [27]:
# Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries loaded successfully!")

Libraries loaded successfully!


# Data Preprocessing

Before building any recommendation system, we need to **clean, preprocess and structure the data**. This involves:

1. **Handling missing values**. 
2. **Filtering invalid ratings**.  
4. **Creating a user-item rating matrix**.

In [28]:
# Loading datasets

anime_df = pd.read_csv('data/anime.csv')
ratings_df = pd.read_csv('data/rating.csv')

print(f"Anime dataset: {anime_df.shape}")
print(f"Ratings dataset: {ratings_df.shape}")

# Displaying basic information

print(f"Unique anime: {anime_df['anime_id'].nunique()}")
print(f"Unique users: {ratings_df['user_id'].nunique()}")
print(f"Total ratings: {ratings_df['rating'].value_counts()}")

# Display data 

print(anime_df.head())
print(ratings_df.head())

Anime dataset: (12294, 7)
Ratings dataset: (7813737, 3)
Unique anime: 12294
Unique users: 73515
Total ratings: rating
 8     1646019
-1     1476496
 7     1375287
 9     1254096
 10     955715
 6      637775
 5      282806
 4      104291
 3       41453
 2       23150
 1       16649
Name: count, dtype: int64
   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17

In [29]:
# Cleaning anime data

anime_df['rating'] = pd.to_numeric(anime_df['rating'], errors='coerce')
anime_df['episodes'] = pd.to_numeric(anime_df['episodes'], errors='coerce')
anime_df['members'] = pd.to_numeric(anime_df['members'], errors='coerce')

# removing all anime without ratings (-1) and adult content

anime_clean = anime_df.dropna(subset=['rating'].copy())
anime_clean = anime_clean[~anime_clean['genre'].str.contains('Hentai', case=False, na=False)].copy()
print(f"Anime entries after cleaning: {len(anime_clean)}")

# Cleaning rating data

ratings_clean = ratings_df[(ratings_df['rating'] != -1) & (ratings_df['rating'].notna())].copy()
print(f"Ratings after cleaning: {len(ratings_clean)}")

# Removing all users and items with few interactions (deals with the sparcity)

user_counts = ratings_clean['user_id'].value_counts()
item_counts = ratings_clean['anime_id'].value_counts()

# Keeping anime with at least 5 ratings

min_user_ratings = 5
min_item_ratings = 5

valid_users = user_counts[user_counts >= min_user_ratings].index
valid_items = item_counts[item_counts >= min_item_ratings].index

ratings_filtered = ratings_clean[
    (ratings_clean['user_id'].isin(valid_users)) & 
    (ratings_clean['anime_id'].isin(valid_items))
].copy()

print(ratings_filtered.head())
# Calculating sparcity

n_users = ratings_filtered['user_id'].nunique()
n_items = ratings_filtered['anime_id'].nunique()
n_ratings = len(ratings_filtered)
sparsity = (1 - n_ratings / (n_users * n_items)) * 100

print(f"Users: {n_users:,}")
print(f"Items: {n_items:,}")
print(f"Ratings: {n_ratings:,}")
print(f"Sparsity: {sparsity:.2f}%")
print(f"Avg ratings per user: {n_ratings/n_users:.1f}")
print(f"Avg ratings per item: {n_ratings/n_items:.1f}")



Anime entries after cleaning: 10931
Ratings after cleaning: 6337241
     user_id  anime_id  rating
156        3        20       8
157        3       154       6
158        3       170       9
159        3       199      10
160        3       225       9
Users: 60,970
Items: 8,030
Ratings: 6,314,650
Sparsity: 98.71%
Avg ratings per user: 103.6
Avg ratings per item: 786.4


## Popularity Baseline Recommender

The first step in building our recommendation system, we implement a **popularity-based baseline**. This baseline recommends the most popular items to all users, ignoring personal preferences. 

### How it Works:
1. **Fit Phase**:
- Compute average rating and count for each item.
- Filter items with too few ratings (to reduce noise).
- Rank items by average rating to determine popularity.
2. **Prediction**:
- If the item exists in the popular list, return its average rating.
- Otherwise, return the global average rating.
3. **Recommendation**:
- Return top-N popular items.
- Optionally, exclude items the user has already seen.

In [35]:
class PopularityBaseline:

    def __init__(self):

        self.popular_items = None
        self.global_mean = None

    def fit(self, ratings_df):
        # Learns the most popular anime from the data

        # Calculate the popularity score

        item_stats = ratings_df.groupby('anime_id').agg({'rating' : ['mean','count']}).round(3)
        item_stats.columns = ['avg_rating','rating_count']

        # Only get anime with more than 10 ratings

        item_stats = item_stats[item_stats['rating_count'] >= 10]

        # Sort the anime by their avg_rating
        self.popular_items = item_stats.sort_values('avg_rating', ascending = False)
        self.global_mean = ratings_df['rating'].mean()

        print(f"Baseline trained on {len(item_stats)} popular items")
        return self

    def predict(self, user_id, item_id):
        # Predicts the rating for a user-item pair
        if self.popular_items is None:
            return self.global_mean
        
        if item_id in self.popular_items.index:
            return self.popular_items.loc[item_id, 'avg_rating']
        else:
            return self.global_mean
        
    def recommend(self, user_id, n_recommendations, exclude_seen = None):
        # Recomend the top popular items

        recommendations = self.popular_items.head(n_recommendations).index.tolist()

        if exclude_seen:
            # Get all the popular items which have not been seen by the user
            recommendations = [item for item in recommendations if item not in exclude_seen]

            remaining = n_recommendations - len(recommendations)
            if remaining > 0:
                additional = self.popular_items.iloc[n_recommendations:n_recommendations+remaining*2]
                for item in additional.index:
                    if item not in exclude_seen:
                        recommendations.append(item)
                        if len(recommendations) >= n_recommendations:
                            break

            return recommendations[:n_recommendations]

In [36]:
# Train baseline model
print("Training Popularity Baseline")
baseline_model = PopularityBaseline()
baseline_model.fit(ratings_filtered)

# Show the top recommendation

top_anime_ids = baseline_model.popular_items.head(10).index
for i, anime_id in enumerate(top_anime_ids, 1):
    anime_name = anime_clean[anime_clean['anime_id'] == anime_id]['name'].iloc[0]
    avg_rating = baseline_model.popular_items.loc[anime_id, 'avg_rating']
    rating_count = baseline_model.popular_items.loc[anime_id, 'rating_count']
    print(f'{1:2d}. {anime_name[:40]:40} | Rating: {avg_rating:.2f} |Count: {rating_count:,}')

Training Popularity Baseline
Baseline trained on 7364 popular items
 1. Gintama°                                 | Rating: 9.45 |Count: 1,182
 1. Kimi no Na wa.                           | Rating: 9.42 |Count: 1,948
 1. Ginga Eiyuu Densetsu                     | Rating: 9.39 |Count: 799
 1. Fullmetal Alchemist: Brotherhood         | Rating: 9.32 |Count: 21,220
 1. Gintama&#039;                            | Rating: 9.27 |Count: 3,098
 1. Steins;Gate                              | Rating: 9.26 |Count: 17,019
 1. Hunter x Hunter (2011)                   | Rating: 9.23 |Count: 7,418
 1. Gintama                                  | Rating: 9.23 |Count: 4,222
 1. Gintama&#039;: Enchousen                 | Rating: 9.20 |Count: 2,121
 1. Gintama Movie: Kanketsu-hen - Yorozuya y | Rating: 9.19 |Count: 2,139


# Content-Based Filtering for Anime Recommendations

This section implements a **content-based recommendation system** that leverages **anime features** to generate personalised recommendations.

### Key Steps:

1. **Feature Preparation**
   - Genre features transformed using **TF-IDF**.
   - Anime type features encoded as **one-hot vectors**.
   - Numeric features like `rating`, `episodes`, and `members` **normalised**.
   - Combined all features into a **single feature matrix**.

2. **Similarity Computation**
   - Compute **cosine similarity** between all anime.
   - Store similarity matrix.

3. **Personalized Recommendation**
   - For each anime a user has rated, retrieve similar anime.
   - Weight similarity scores by the user’s rating.
   - Aggregate scores and sort to recommend the top-N anime.

In [None]:
class ContentBasedFilter:

    def __init__(self):
        self.item_features = None
        self.item_similarity_matrix = None
        self.tfidf_vectorizer = None
        self.anime_to_idx = None
        self.idx_to_anime = None

    def prepare_features(self, anime_df):
        # preparing the feature matrix for anime


        anime_ids = anime_df['anime_id'].values
        self.anime_to_idx = {anime_id: idx for idx, anime_id in enumerate(anime_id)}
        self.idx_to_anime = {idx : anime_id for anime_id, idx in self.anime_to_idx.items()}

        features_list = []

        # Genres features using TF-IDF

        genres = anime_df['genre'].fillna('Unknown')
        self.tfidf_vectorizer = TfidfVectorizer(max_features=50, stop_words=None)
        genre_features = self.tfidf_vectorizer.fit_transform(genres)
        features_list.append(genre_features)

        # Type features

        type_dummies = pd.get_dummies(anime_df['type']).values
        features_list.append(type_dummies)

        # Numerical features (normalised)

        numerical_features = anime_df[['rating', 'episodes', 'members']].copy()
        numerical_features['episodes'] = numerical_features['episodes'].fillna(1)

        # normalise numerical features 
        for col in numerical_features.columns:
            numerical_features[col] = (numerical_features[col] - numerical_features[col].mean()) / numerical_features[col].std()

        features_list.append(numerical_features.value)

        # Combine all features

        
        
