# Recommendation System


## Data Description:
    Unique ID of each anime.
    Anime title.
    Anime broadcast type, such as TV, OVA, etc.
    anime genre.
    The number of episodes of each anime.
    The average rating for each anime compared to the number of users who gave ratings.
    Number of community members for each anime.
                               
## Objective:
    The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset. 
        
## Dataset:
    Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

In [23]:
# Import libraries
import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

# Tasks:

## Data Preprocessing:
    Load the dataset into a suitable data structure (e.g., pandas DataFrame).
    Handle missing values, if any.
    Explore the dataset to understand its structure and attributes.


In [4]:
# Load the dataset
df = pd.read_csv(r"F:\Drive\ExcelR\Assignments\Recommendation System\Recommendation System\anime.csv")
df.head()


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [6]:
# Check for missing values
df.isnull().sum()


anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [7]:
# Drop rows with missing values 
df = df.dropna()

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12017 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12017 non-null  int64  
 1   name      12017 non-null  object 
 2   genre     12017 non-null  object 
 3   type      12017 non-null  object 
 4   episodes  12017 non-null  object 
 5   rating    12017 non-null  float64
 6   members   12017 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 751.1+ KB


## Feature Extraction:
    Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
    Convert categorical features into numerical representations if necessary.
    Normalize numerical features if required.


In [14]:
# Convert 'genre' feature into numerical representation using one-hot encoding
genres_df = df['genre'].str.get_dummies(sep=', ')
genres_df

Unnamed: 0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,1,1,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# Normalize the 'rating' feature
df['rating_normalized'] = (df['rating'] - df['rating'].min()) / (df['rating'].max() - df['rating'].min())
df['rating_normalized']

0        0.924370
1        0.911164
2        0.909964
3        0.900360
4        0.899160
           ...   
12289    0.297719
12290    0.313325
12291    0.385354
12292    0.397359
12293    0.454982
Name: rating_normalized, Length: 12017, dtype: float64

In [12]:
# Combine genres and normalized rating into a single DataFrame for cosine similarity
features_df = pd.concat([genres_df, df['rating_normalized']], axis=1)

In [13]:
features_df.head()

Unnamed: 0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri,rating_normalized
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0.92437
1,1,1,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0.911164
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.909964
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0.90036
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.89916


In [40]:
# Ensure indices of both anime_df and features_df match
features_df = features_df.reset_index(drop=True)
df = df.reset_index(drop=True)

## Recommendation System:
    Design a function to recommend anime based on cosine similarity.
    Given a target anime, recommend a list of similar anime based on cosine similarity scores.
    Experiment with different threshold values for similarity scores to adjust the recommendation list size.

In [41]:
# Function to recommend anime based on cosine similarity
def recommend_anime(target_anime, num_recommendations=5, threshold=0.5):
    # Check if the target anime exists in the dataset
    if target_anime not in df['name'].values:
        print(f"Anime '{target_anime}' not found in the dataset.")
        return pd.DataFrame()  # Return an empty DataFrame if not found
    # Get index of the target anime
    target_index = df[df['name'] == target_anime].index[0]
    
    # Calculate cosine similarity between the target anime and all others
    similarity_scores = cosine_similarity([features_df.iloc[target_index]], features_df)[0]
    similar_indices = np.argsort(similarity_scores)[::-1]
    # Filter out the target anime and apply the threshold
    filtered_indices = [i for i in similar_indices if similarity_scores[i] > threshold and i != target_index]
    return df.iloc[filtered_indices][:num_recommendations][['name', 'genre', 'rating']]

In [42]:
recommend_anime('Naruto', num_recommendations=5, threshold=0.5)

Unnamed: 0,name,genre,rating
615,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",7.94
1103,Boruto: Naruto the Movie - Naruto ga Hokage ni...,"Action, Comedy, Martial Arts, Shounen, Super P...",7.68
486,Boruto: Naruto the Movie,"Action, Comedy, Martial Arts, Shounen, Super P...",8.03
1343,Naruto x UT,"Action, Comedy, Martial Arts, Shounen, Super P...",7.58
1472,Naruto: Shippuuden Movie 4 - The Lost Tower,"Action, Comedy, Martial Arts, Shounen, Super P...",7.53


## Evaluation:
    Split the dataset into training and testing sets.
    Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
    Analyze the performance of the recommendation system and identify areas of improvement.


In [43]:
# Split the dataset into training and testing sets 
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

In [44]:
# Creating a function to test on the test set
def evaluate_recommendations(test_data, num_recommendations=5, threshold=0.5):

    y_true = []
    y_pred = []

    for _, row in test_data.iterrows():
        target_anime = row['name']
        target_genre = row['genre']
        # Get recommendations for the target anime
        recommendations = recommend_anime(target_anime, num_recommendations, threshold)
        # If no recommendations are found, skip this anime
        if recommendations.empty:
            continue
        #  the genre of the target anime
        y_true.append(target_genre)
        # genres of recommended anime 
        recommended_genres = ', '.join(recommendations['genre'].values)
        y_pred.append(recommended_genres)

    return y_true, y_pred


In [45]:
# Get true and predicted genres for the test set
y_true, y_pred = evaluate_recommendations(test_df)

In [46]:
#  use a binary relevance check
def binary_relevance(true_genres, predicted_genres):
    return int(bool(set(true_genres.split(', ')) & set(predicted_genres.split(', '))))

In [47]:
# binary relevance lists for evaluation
y_true_binary = [1] * len(y_true)  # Consider all true genres as 1 
y_pred_binary = [binary_relevance(t, p) for t, p in zip(y_true, y_pred)]

In [48]:
# Compute evaluation metrics
precision = precision_score(y_true_binary, y_pred_binary, average='binary')
recall = recall_score(y_true_binary, y_pred_binary, average='binary')
f1 = f1_score(y_true_binary, y_pred_binary, average='binary')


In [49]:
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

Precision: 1.0
Recall: 1.0
F1 Score: 1.0


# Interview Questions:

## 1. Can you explain the difference between user-based and item-based collaborative filtering?


- **User-based Collaborative Filtering:** Recommends items to a user based on the preferences of similar users. The system identifies users with similar behavior (rating patterns) and suggests items that those users have liked.

- **Item-based Collaborative Filtering:** Recommends items based on the similarity between items. It focuses on finding similarities between items (based on their ratings by users) and suggests items that are similar to those the user has interacted with.

## 2. What is collaborative filtering, and how does it work?

**Collaborative filtering** is a recommendation technique that makes predictions about a user’s preferences by collecting preferences from multiple users. The underlying assumption is that if a person has agreed with another person in the past, they are likely to agree in the future. Collaborative filtering can be user-based or item-based, depending on whether the similarity is computed between users or items.