# RECOMMENDATION SYSTEM USING COSINE SIMILARITY

Dataset: Anime Dataset

## 1. Objective

The objective of this assignment is to implement a content-based recommendation system using cosine similarity. The system recommends similar anime based on attributes such as genre, number of episodes, ratings, and popularity. This approach helps users discover anime with similar characteristics without relying on user interaction history.

## 2. Dataset Description

The Anime dataset contains the following attributes.
Anime ID represents the unique identifier for each anime.
Name represents the title of the anime.
Genre indicates the type of anime such as action, romance, fantasy, etc.
Type represents the broadcast format such as TV, Movie, or OVA.
Episodes indicate the total number of episodes.
Rating represents the average user rating.
Members represent the number of users who have added the anime to their list.

## 3. Data Preprocessing
### 3.1 Loading the dataset

In [7]:
import pandas as pd

df = pd.read_csv("anime.csv")
df.head()


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


### 3.2 Exploring the dataset

In [10]:
df.info()
df.describe()
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

### 3.3 Handling missing values

Missing genre values are replaced with the value Unknown.
The episodes column contains text values such as Unknown, which are converted into numeric values.
Missing numeric values are filled using the median to avoid bias.

In [13]:
df['genre'] = df['genre'].fillna('Unknown')

df['episodes'] = pd.to_numeric(df['episodes'], errors='coerce')
df['episodes'] = df['episodes'].fillna(df['episodes'].median())

df['rating'] = df['rating'].fillna(df['rating'].median())


## 4. Feature Extraction
### 4.1 Feature selection

The following features are used to compute similarity.
Genre as a textual feature.
Rating, episodes, and members as numerical features.

### 4.2 Converting genre into numerical form using TF-IDF

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
genre_matrix = tfidf.fit_transform(df['genre'])


### 4.3 Normalizing numerical features

In [19]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
numeric_features = scaler.fit_transform(
    df[['rating', 'episodes', 'members']]
)


### 4.4 Combining all features

In [22]:
from scipy.sparse import hstack

feature_matrix = hstack([genre_matrix, numeric_features])


## 5. Recommendation System Using Cosine Similarity
### 5.1 Computing cosine similarity

In [27]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(feature_matrix)


### 5.2 Building the recommendation function

In [30]:
def recommend_anime(anime_name, df, cosine_sim, top_n=10, threshold=0.3):
    if anime_name not in df['name'].values:
        return "Anime not found in dataset"

    idx = df[df['name'] == anime_name].index[0]
    similarity_scores = list(enumerate(cosine_sim[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    recommendations = []
    for i, score in similarity_scores[1:]:
        if score >= threshold:
            recommendations.append((df.iloc[i]['name'], score))
        if len(recommendations) == top_n:
            break

    return recommendations


### 5.3 Testing the recommendation system

In [33]:
recommend_anime("Naruto", df, cosine_sim, top_n=5, threshold=0.35)


[('Naruto: Shippuuden', 0.9914950851356025),
 ('Dragon Ball Z', 0.9427895332767986),
 ('Dragon Ball', 0.9158938129399173),
 ('Naruto: Shippuuden Movie 4 - The Lost Tower', 0.905891082411689),
 ('Naruto: Shippuuden Movie 3 - Hi no Ishi wo Tsugu Mono', 0.9055519665419227)]

## 6. Threshold Experimentation

A lower similarity threshold produces more recommendations but with lower relevance.
A higher similarity threshold produces fewer but more accurate recommendations.

In [36]:
recommend_anime("Naruto", df, cosine_sim, top_n=10, threshold=0.5)


[('Naruto: Shippuuden', 0.9914950851356025),
 ('Dragon Ball Z', 0.9427895332767986),
 ('Dragon Ball', 0.9158938129399173),
 ('Naruto: Shippuuden Movie 4 - The Lost Tower', 0.905891082411689),
 ('Naruto: Shippuuden Movie 3 - Hi no Ishi wo Tsugu Mono', 0.9055519665419227),
 ('Boruto: Naruto the Movie', 0.9019616028763021),
 ('Naruto x UT', 0.8844796415003492),
 ('Naruto Soyokazeden Movie: Naruto to Mashin to Mitsu no Onegai Dattebayo!!',
  0.884093236280787),
 ('Dragon Ball Kai', 0.8829969889355951),
 ('Boruto: Naruto the Movie - Naruto ga Hokage ni Natta Hi',
  0.8820976379819145)]

## 7. Analysis and Performance Evaluation

The recommendation system successfully identifies anime with similar genres and popularity.
Genre similarity contributes the most to recommendation quality.
Numerical features such as rating and members help refine recommendations.
The system performs well for content discovery when user interaction data is unavailable.

## 8. Limitations and Improvements

The system does not personalize recommendations based on user behavior.
It may struggle to recommend diverse content outside preferred genres.
Performance can be improved by incorporating collaborative filtering or building a hybrid recommendation system.

## 9. Interview Questions
### 9.1 Difference between user-based and item-based collaborative filtering

User-based collaborative filtering recommends items based on the preferences of similar users.
Item-based collaborative filtering recommends items similar to those the user has already interacted with.
Item-based filtering is more scalable and commonly used in real-world systems.

### 9.2 What is collaborative filtering and how does it work

Collaborative filtering is a recommendation technique that predicts user preferences by analyzing user interactions such as ratings or clicks.
It works by identifying similarities between users or items and generating recommendations based on those similarities rather than item content.

## 10. Conclusion

A content-based recommendation system using cosine similarity was successfully implemented on the Anime dataset. By leveraging genre information and numerical features, the system provides relevant anime recommendations. While effective, integrating user interaction data would further enhance recommendation quality.