### RECOMMENDATION SYSTEM ON ANIME DATASET


In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')

  from scipy.sparse import csr_matrix, issparse


In [2]:
# Load Dataset
df = pd.read_csv("anime.csv")

In [3]:
df.shape

(12294, 7)

In [4]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [5]:
# Check missing values
df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [6]:
# Handling missing values in teacher-style
df['genre'] = df['genre'].fillna('')              # empty text for TF-IDF
df['type'] = df['type'].fillna('Unknown')         # simple category
df['rating'] = df['rating'].fillna(df['rating'].median())   # numeric median

In [7]:
df.isnull().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

#### FEATURE EXTRACTION

In [9]:
# TF-IDF Vectorisation
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['genre'])
tfidf_matrix.shape

(12294, 46)

In [10]:
# COSINE SIMILARITY
similarities = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(similarities.shape)

(12294, 12294)


In [11]:
# INDEX MAPPING
indices = pd.Series(df.index, index=df['name']).drop_duplicates()

In [12]:
# RECOMMENDATION FUNCTION (with threshold)
def recommend_anime(anime_name, threshold=0.0):
    if anime_name in indices.index:
        idx = indices[anime_name]

        # compute similarity scores
        sim_scores = list(enumerate(similarities[idx]))

        # apply threshold
        sim_scores = [x for x in sim_scores if x[1] >= threshold]

        # sort
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:6]

        print(f"\nRecommended Anime for: {anime_name} (Threshold={threshold})")
        print("=" * 50)

        if len(sim_scores) == 0:
            print("No recommendations above this threshold.")
            return

        for i, score in sim_scores:
            print(df['name'].iloc[i], "---> Similarity:", round(score, 4))
    else:
        print("Anime not found in the dataset!")

In [13]:
# Test Recommendations
print("\nTesting Recommendation on First Anime:")
sample_anime = df['name'].iloc[0]
recommend_anime(sample_anime)

print("\nTesting with Threshold = 0.1:")
recommend_anime(sample_anime, threshold=0.1)

print("\nTesting with Threshold = 0.2:")
recommend_anime(sample_anime, threshold=0.2)


Testing Recommendation on First Anime:

Recommended Anime for: Kimi no Na wa. (Threshold=0.0)
Wind: A Breath of Heart OVA ---> Similarity: 1.0
Wind: A Breath of Heart (TV) ---> Similarity: 1.0
Aura: Maryuuin Kouga Saigo no Tatakai ---> Similarity: 0.9553
Angel Beats!: Another Epilogue ---> Similarity: 0.8715
Harmonie ---> Similarity: 0.8715

Testing with Threshold = 0.1:

Recommended Anime for: Kimi no Na wa. (Threshold=0.1)
Wind: A Breath of Heart OVA ---> Similarity: 1.0
Wind: A Breath of Heart (TV) ---> Similarity: 1.0
Aura: Maryuuin Kouga Saigo no Tatakai ---> Similarity: 0.9553
Angel Beats!: Another Epilogue ---> Similarity: 0.8715
Harmonie ---> Similarity: 0.8715

Testing with Threshold = 0.2:

Recommended Anime for: Kimi no Na wa. (Threshold=0.2)
Wind: A Breath of Heart OVA ---> Similarity: 1.0
Wind: A Breath of Heart (TV) ---> Similarity: 1.0
Aura: Maryuuin Kouga Saigo no Tatakai ---> Similarity: 0.9553
Angel Beats!: Another Epilogue ---> Similarity: 0.8715
Harmonie ---> Simil

### **Performance Analysis**

1. **Genre-Only Limitation**  
   Using only the *genre* column reduces accuracy because many anime share general or vague tags.

2. **TF-IDF Limitation**  
   TF-IDF cannot understand deeper semantic meaning (e.g., “action” vs “combat”).

3. **Ignoring Numerical Features**  
   Features like *rating*, *episodes*, and *members* are not included in similarity, which may affect rankings.

4. **Cold-Start Problem**  
   Anime with missing or empty genres cannot be recommended effectively.

5. **No User Behavior Learning**  
   The system is content-based only and does not learn from user preferences.

6. **Possible Improvements**  
   - Combine text + numeric features (hybrid model)  
   - Use embeddings like BERT/SBERT  
   - Implement collaborative filtering  
   - Weight ratings or popularity in ranking  

### **Interview Questions**

#### **1. What is the difference between user-based and item-based collaborative filtering?**

- **User-Based CF:**  
  Finds users with similar behavior and recommends items those users liked.

- **Item-Based CF:**  
  Finds items similar to what the user already likes.  
  More stable and scalable.

---

#### **2. What is collaborative filtering, and how does it work?**

Collaborative filtering uses user–item interactions (ratings, likes, watch history) to make recommendations.  
It assumes that users with similar past behavior will have similar preferences.

Types of CF:
- User–User  
- Item–Item  
- Model-Based (Matrix Factorization, SVD)  

CF does not require item features like genre; it relies on user behavior patterns.  
