<a href="https://colab.research.google.com/github/Lokeshgadhi/introonpandas/blob/main/lokesh_recommendation%20system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anime Recommendation System using Cosine Similarity

In [1]:
!pip install pandas scikit-learn scipy



In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from scipy.sparse import hstack

In [3]:
# Load dataset
anime_df = pd.read_csv("anime.csv")
print("Dataset loaded successfully.\n")
print(anime_df.head())

Dataset loaded successfully.

   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17   
4  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   

   members  
0   200630  
1   793665  
2   114262  
3   673572  
4   151266  


In [4]:
# Data cleaning
anime_df.dropna(subset=['genre', 'rating'], inplace=True)
anime_df.fillna({'episodes': 0}, inplace=True)
anime_df.drop_duplicates(subset='name', inplace=True)
anime_df.reset_index(drop=True, inplace=True)
print("Data cleaned and ready for feature extraction.")

Data cleaned and ready for feature extraction.


In [5]:
# Feature extraction
tfidf = TfidfVectorizer(stop_words='english')
genre_matrix = tfidf.fit_transform(anime_df['genre'])
scaler = MinMaxScaler()
normalized_ratings = scaler.fit_transform(anime_df[['rating']])
features = hstack([genre_matrix, normalized_ratings])
print("Features combined using genres and ratings.")

Features combined using genres and ratings.


In [6]:
# Cosine similarity matrix
cosine_sim = cosine_similarity(features, features)
print("Cosine similarity matrix computed.")

Cosine similarity matrix computed.


In [7]:
# Recommendation function
def recommend_anime(title, top_n=10, threshold=0.2):
    if title not in anime_df['name'].values:
        return f"Anime '{title}' not found in the dataset."
    idx = anime_df[anime_df['name'] == title].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = [x for x in sim_scores if x[0] != idx and x[1] >= threshold]
    top_anime = sim_scores[:top_n]
    return [(anime_df.iloc[i]['name'], score) for i, score in top_anime]

In [8]:
# Try a recommendation
print(recommend_anime("Naruto", top_n=5))

[('Naruto: Shippuuden', np.float64(0.9999496292892799)), ('Boruto: Naruto the Movie - Naruto ga Hokage ni Natta Hi', np.float64(0.9999481049853969)), ('Boruto: Naruto the Movie', np.float64(0.999857229347821)), ('Naruto x UT', np.float64(0.9998356939954685)), ('Naruto: Shippuuden Movie 4 - The Lost Tower', np.float64(0.9997551002626752))]


In [9]:
# Evaluate using Precision@10
train, test = train_test_split(anime_df, test_size=0.2, random_state=42)
def evaluate_precision(test_set):
    hits = 0
    total = 0
    for title in test_set['name'].sample(100, random_state=42):
        recs = recommend_anime(title, top_n=10)
        if isinstance(recs, str):
            continue
        rec_titles = [r[0] for r in recs]
        hits += len(set(rec_titles).intersection(set(test_set['name']))) > 0
        total += 1
    precision = hits / total if total > 0 else 0
    print(f"Precision@10: {precision:.2f}")

evaluate_precision(test)

Precision@10: 0.87


1. Can you explain the difference between user-based and item-based collaborative filtering?

---

### 1. **User-Based Collaborative Filtering (UBCF)**

* **Idea**: Recommend items to a user based on what **similar users** liked.

* **How it works**:

  * Find users with similar preferences (based on rating history).
  * Recommend items that those similar users liked, which the current user hasn't seen yet.

* **Example**:
  If Alice and Bob both liked the same 3 anime, and Bob also liked a 4th anime, recommend that 4th anime to Alice.

* **Similarity Computed Between**: **Users**

* **Common Similarity Metrics**: Cosine similarity, Pearson correlation

#### Pros:

* Captures social-like behavior (“people like you liked this”)

#### Cons:

* Can be slow or sparse with many users
* Performance drops with new users (cold start)

---

###  2. **Item-Based Collaborative Filtering (IBCF)**

* **Idea**: Recommend items **similar to the ones a user already liked**.

* **How it works**:

  * For each item the user likes, find similar items (based on how other users rated them).
  * Recommend those similar items.

* **Similarity Computed Between**: **Items**

* **Common Similarity Metrics**: Cosine similarity, adjusted cosine similarity

#### Pros:

* Scales better with large user bases
* More stable over time (items change less frequently than users)

#### Cons:

* May miss out on social signals (e.g., “users like you liked...”)

---

### Summary Table:

| Feature                 | User-Based CF              | Item-Based CF               |
| ----------------------- | -------------------------- | --------------------------- |
| Similarity Between      | Users                      | Items                       |
| Recommendation Based On | Similar users’ preferences | Similar items to liked ones |
| Better For              | Small user bases           | Large, stable item sets     |
| Cold Start Issue        | New users                  | New items                   |
| Performance             | Can be slower              | Often faster                |




2. What is collaborative filtering, and how does it work?

**Collaborative Filtering** is a popular technique used in recommendation systems to suggest items (like movies, books, or anime) based on **user preferences and behavior** rather than item content.

---

###  What Is Collaborative Filtering?

It is a method that makes automatic predictions about a user's interests by collecting preferences or taste information from **many users** (collaboration).

---

###  Core Idea

> "If User A and User B liked similar items in the past, and User A likes a new item, then User B might like it too."

---

###  How It Works

Collaborative filtering operates using a **user-item interaction matrix**, where rows represent users, columns represent items, and cells contain ratings (explicit) or interactions (implicit like clicks, views).

#### Two Main Types:

---

#### 1. **User-Based Collaborative Filtering**

* Compares users to find similar ones.
* Recommends items liked by similar users.

#### 2. **Item-Based Collaborative Filtering**

* Compares items to find similar ones.
* Recommends items similar to what the user has already liked.

---

### Example Matrix:

|        | Anime A | Anime B | Anime C | Anime D |
| ------ | ------- | ------- | ------- | ------- |
| User 1 | 5       | 4       | ?       | 3       |
| User 2 | 5       | 4       | 2       | 3       |
| User 3 | 1       | 2       | 5       | ?       |

* Collaborative filtering will predict `User 1's` rating for **Anime C** based on similar users (like User 2) or similar items.

---

###  Key Techniques:

* **Similarity Calculation**: Cosine similarity, Pearson correlation
* **Matrix Factorization**: Advanced methods like SVD, ALS
* **Deep Learning**: Neural Collaborative Filtering (NCF)

---

###  Pros:

* No need for item metadata or content
* Can discover complex patterns and relationships

### Cons:

* **Cold Start**: Struggles with new users/items (no data)
* **Sparsity**: Most users rate only a few items
* **Scalability**: Can be slow for very large datasets
