# Recommendation System

### Data Preprocessing:
- Load the dataset into a suitable data structure (e.g., pandas DataFrame).

In [100]:
import pandas as pd

df = pd.read_csv('anime.csv')
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [101]:
df.shape

(12294, 7)

- Handle missing values, if any.

In [102]:
df.isna().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [103]:
df['rating'].fillna(df['rating'].mean(),inplace=True)
df['genre'].fillna('Unknown',inplace= True)
df['type'].fillna('Unknown',inplace=True)

In [104]:
df.isna().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

- Explore the dataset to understand its structure and attributes.

In [105]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [106]:
df.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12294.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.017096,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.9,225.0
50%,10260.5,6.55,1550.0
75%,24794.5,7.17,9437.0
max,34527.0,10.0,1013917.0


### Feature Extraction:
- Decide on the features that will be used for computing similarity (e.g., genres, user ratings).

In [107]:
# features_df = df[['genre', 'type', 'episodes', 'rating']].copy()
features_df = df[['name', 'genre', 'type', 'episodes', 'rating']].copy()

- Convert categorical features into numerical representations if necessary.

In [108]:
# Convert text to numeric codes
features_df['genre'] = features_df['genre'].astype('category').cat.codes
features_df['type'] = features_df['type'].astype('category').cat.codes

In [109]:
import numpy as np
features_df['episodes'].replace('Unknown',np.nan,inplace=True)
features_df['episodes'] = pd.to_numeric(features_df['episodes'],errors='coerce')
features_df['episodes'].fillna(features_df['episodes'].median(), inplace=True)

In [110]:
features_df.head()

Unnamed: 0,name,genre,type,episodes,rating
0,Kimi no Na wa.,2686,0,1.0,9.37
1,Fullmetal Alchemist: Brotherhood,161,5,64.0,9.26
2,Gintama°,534,5,51.0,9.25
3,Steins;Gate,3240,5,24.0,9.17
4,Gintama&#039;,534,5,51.0,9.16


- Normalize numerical features if required.

In [111]:
# anime_reference = df[['name', 'genre']].reset_index(drop=True)

In [112]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features_df[['episodes', 'rating']] = scaler.fit_transform(features_df[['episodes', 'rating']])

In [113]:
features_df.head()

Unnamed: 0,name,genre,type,episodes,rating
0,Kimi no Na wa.,2686,0,-0.239941,2.847535
1,Fullmetal Alchemist: Brotherhood,161,5,1.122451,2.73938
2,Gintama°,534,5,0.841323,2.729547
3,Steins;Gate,3240,5,0.25744,2.650889
4,Gintama&#039;,534,5,0.841323,2.641057


### Recommendation System:

- Design a function to recommend anime based on cosine similarity.
- Given a target anime, recommend a list of similar anime based on cosine similarity scores.
- Experiment with different threshold values for similarity scores to adjust the recommendation list size.

In [114]:
from sklearn.metrics.pairwise import cosine_similarity
# Compute cosine similarity between all anime
similarity_matrix = cosine_similarity(features_df[['genre', 'type', 'episodes', 'rating']])


In [115]:
similarity_matrix

array([[1.        , 0.99936619, 0.99994658, ..., 0.99999819, 0.99999824,
        0.99999902],
       [0.99936619, 1.        , 0.99967971, ..., 0.99937114, 0.99937158,
        0.99934287],
       [0.99994658, 0.99967971, 1.        , ..., 0.999948  , 0.99994816,
        0.99993992],
       ...,
       [0.99999819, 0.99937114, 0.999948  , ..., 1.        , 1.        ,
        0.99999945],
       [0.99999824, 0.99937158, 0.99994816, ..., 1.        , 1.        ,
        0.99999945],
       [0.99999902, 0.99934287, 0.99993992, ..., 0.99999945, 0.99999945,
        1.        ]])

In [116]:

def recommend_anime(target_name, df=df, similarity_matrix=similarity_matrix, top_n=5):
    """
    Given a target anime name, recommend top N similar anime based on cosine similarity.
    """
    
    # Check if the anime exists in the dataset
    if target_name not in df['name'].values:
        return "Anime not found in dataset."
    
    # Get the row index of the target anime in the DataFrame
    target_index = df[df['name'] == target_name].index[0]
    
    # Extract similarity scores of this anime with all other anime
    # similarity_scores is a list of tuples: (index, similarity_score)
    similarity_scores = list(enumerate(similarity_matrix[target_index]))
    
    # Sort the similarity scores in descending order (highest similarity first)
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    
    # Select the top N similar anime indices, skip the first one (itself)
    top_indices = [i[0] for i in similarity_scores[1:top_n+1]]
    
    # Return the names of the recommended anime
    return df['name'].iloc[top_indices].tolist()

# Example usage: get top 5 anime similar to 'Fullmetal Alchemist: Brotherhood'
print(recommend_anime("Fullmetal Alchemist: Brotherhood", top_n=5))


['Berserk', 'Claymore', 'Arslan Senki (TV)', 'Wolf&#039;s Rain', 'Lupin III (2015)']


# Evaluation:

- Split the dataset into training and testing sets.
- Analyze the performance of the recommendation system and identify areas of improvement.

In [117]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(features_df, test_size=0.2, random_state=42)

In [118]:
train_df.columns

Index(['name', 'genre', 'type', 'episodes', 'rating'], dtype='object')

- Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.


In [119]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity on training data only
similarity_matrix_train = cosine_similarity(train_df[['genre', 'type', 'episodes', 'rating']])

In [120]:
from sklearn.metrics import precision_score,recall_score,f1_score

In [121]:

def evaluate_system(reference_df, similarity_matrix, top_n=5):
    """
    Evaluate recommendation system using a simple genre-match approach.
    Calculates Precision, Recall, and F1-score.
    """
    
    y_true, y_pred = [], []  # Lists to store actual and predicted relevance labels

    # Loop through each anime in the reference DataFrame
    for anime in reference_df['name']:

        # Get the genre of the target anime
        target_genre = reference_df[reference_df['name'] == anime]['genre'].values[0]

        # Get the row index of the target anime in the DataFrame
        # Use get_loc to convert original index to local 0-based index in this DataFrame
        target_index = reference_df.index.get_loc(reference_df[reference_df['name'] == anime].index[0])

        # Get similarity scores of the target anime with all other anime
        similarity_scores = list(enumerate(similarity_matrix[target_index]))

        # Sort by similarity (highest first)
        similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
        
        # Get indices of top N similar anime (skip itself)
        top_indices = [i[0] for i in similarity_scores[1:top_n+1]]
        
        # Compare genres of recommended anime with target anime
        for idx in top_indices:
            rec_genre = reference_df.iloc[idx]['genre']
            
            y_true.append(1)                 # We assume each recommendation is supposed to match
            y_pred.append(1 if rec_genre == target_genre else 0)  # 1 if genre matches, else 0

    # Calculate evaluation metrics
    print("Precision:", precision_score(y_true, y_pred, zero_division=0))
    print("Recall:", recall_score(y_true, y_pred, zero_division=0))
    print("F1-Score:", f1_score(y_true, y_pred, zero_division=0))


In [122]:
# Run evaluation
evaluate_system(train_df, similarity_matrix_train, top_n=5)  

Precision: 1.0
Recall: 0.17826131164209455
F1-Score: 0.30258366269135845


Current system is limited because genres and types are encoded as integers, and evaluation only checks exact genre match, so precision and recall are low.

Improvement: Use one-hot or multi-label encoding for genres/types and consider weighting features or overlapping genres to boost recommendation quality.

# Interview Questions:

### 1. Can you explain the difference between user-based and item-based collaborative filtering?

- User-based CF: Recommends items to a user based on what similar users liked. Focuses on user-to-user similarity.

- Item-based CF: Recommends items similar to what the user already liked. Focuses on item-to-item similarity.

### 2. What is collaborative filtering, and how does it work? Give each in 2 bullet points?

- It predicts a user’s preferences by analyzing past behavior or ratings from multiple users.
- Works by finding patterns or similarities between users or items to generate recommendations.