#### Recommendation System

##### Data Description:

###### Unique ID of each anime.
###### Anime title.
###### Anime broadcast type, such as TV, OVA, etc.
###### anime genre.
###### The number of episodes of each anime.
###### The average rating for each anime compared to the number of users who gave ratings.
###### Number of community members for each anime.

##### Objective:
###### The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset. 
##### Dataset:
###### Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

##### Tasks:

##### Data Preprocessing:
###### Load the dataset into a suitable data structure (e.g., pandas DataFrame).
###### Handle missing values, if any.
###### Explore the dataset to understand its structure and attributes.

##### Feature Extraction:
###### Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
###### Convert categorical features into numerical representations if necessary.
###### Normalize numerical features if required.

##### Recommendation System:
###### Design a function to recommend anime based on cosine similarity.
###### Given a target anime, recommend a list of similar anime based on cosine similarity scores.
###### Experiment with different threshold values for similarity scores to adjust the recommendation list size.
###### Analyze the performance of the recommendation system and identify areas of improvement.

##### Interview Questions:
###### 1. Can you explain the difference between user-based and item-based collaborative filtering?
###### 2. What is collaborative filtering, and how does it work?


In [21]:
### Data Preprocessing:
## Load the dataset into a suitable data structure (e.g., pandas DataFrame).

import pandas as pd
import numpy as np

# Load the dataset

df = pd.read_csv("C:\\Users\\moulika\\Downloads\\anime.csv")

# 1. Basic Information

print("Dataset Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())
print("\nDataset Info:")
print(df.info())

Dataset Shape: (12294, 7)

Column Names: ['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members']

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None


In [22]:
## Handle missing values, if any.

# Count missing values per column
print("\nMissing Values Before Cleaning:")
print(df.isnull().sum())

# Rating: missing values → replace with mean rating
df['rating'] = df['rating'].fillna(df['rating'].mean())

# Genre: missing values → replace with 'Unknown'
df['genre'] = df['genre'].fillna('Unknown')

# Episodes: replace 'Unknown' with NaN → convert to integer
df['episodes'] = df['episodes'].replace('Unknown', np.nan)
df['episodes'] = df['episodes'].astype(float)

# Fill missing episode counts with median
df['episodes'] = df['episodes'].fillna(df['episodes'].median())

# Members: if missing, replace with 0
df['members'] = df['members'].fillna(0)

print("\nMissing Values After Cleaning:")
print(df.isnull().sum())



Missing Values Before Cleaning:
anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

Missing Values After Cleaning:
anime_id     0
name         0
genre        0
type        25
episodes     0
rating       0
members      0
dtype: int64


In [23]:
## Explore the dataset to understand its structure and attributes.

# 3. Preprocessing 'genre' Column

# Convert genre string → list of genres
df['genre_list'] = df['genre'].apply(lambda x: x.split(", ") if isinstance(x, str) else [])

# 4. Preview Cleaned Data

print("\nCleaned Dataset Sample:")
print(df.head(10))



Cleaned Dataset Sample:
   anime_id                                               name  \
0     32281                                     Kimi no Na wa.   
1      5114                   Fullmetal Alchemist: Brotherhood   
2     28977                                           Gintama°   
3      9253                                        Steins;Gate   
4      9969                                      Gintama&#039;   
5     32935  Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...   
6     11061                             Hunter x Hunter (2011)   
7       820                               Ginga Eiyuu Densetsu   
8     15335  Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...   
9     15417                           Gintama&#039;: Enchousen   

                                               genre   type  episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie       1.0    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV      64.0    9.26   
2  Ac

In [24]:
df.columns

Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members',
       'genre_list'],
      dtype='object')

In [25]:
df.dtypes

anime_id        int64
name           object
genre          object
type           object
episodes      float64
rating        float64
members         int64
genre_list     object
dtype: object

In [26]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,genre_list
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1.0,9.37,200630,"[Drama, Romance, School, Supernatural]"
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.26,793665,"[Action, Adventure, Drama, Fantasy, Magic, Mil..."
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.25,114262,"[Action, Comedy, Historical, Parody, Samurai, ..."
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.17,673572,"[Sci-Fi, Thriller]"
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.16,151266,"[Action, Comedy, Historical, Parody, Samurai, ..."


In [27]:
### Feature Extraction:
## Decide on the features that will be used for computing similarity (e.g., genres, user ratings).

import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler

df = pd.read_csv("C:\\Users\\moulika\\Downloads\\anime.csv")

# Clean missing values

df['rating'] = df['rating'].fillna(df['rating'].mean())
df['genre'] = df['genre'].fillna("Unknown")
df['episodes'] = df['episodes'].replace('Unknown', np.nan).astype(float)
df['episodes'] = df['episodes'].fillna(df['episodes'].median())

# 1. GENRE FEATURE (Multi-Hot Encoding)

# Convert "Action, Comedy" → ["Action","Comedy"]
df['genre_list'] = df['genre'].apply(lambda x: x.split(", ") if isinstance(x, str) else [])

mlb = MultiLabelBinarizer()
genre_features = mlb.fit_transform(df['genre_list'])

genre_df = pd.DataFrame(genre_features, columns=mlb.classes_)

# 2. NUMERIC FEATURES (Rating + Episodes)

numeric_features = df[['rating', 'episodes']]

scaler = MinMaxScaler()
numeric_scaled = scaler.fit_transform(numeric_features)

numeric_df = pd.DataFrame(numeric_scaled, columns=['rating_scaled', 'episodes_scaled'])

# 3. FINAL FEATURE MATRIX

# Combine genre + rating + episodes features
feature_matrix = pd.concat([genre_df, numeric_df], axis=1)

print("Final feature matrix shape:", feature_matrix.shape)
print(feature_matrix.head())

Final feature matrix shape: (12294, 46)
   Action  Adventure  Cars  Comedy  Dementia  Demons  Drama  Ecchi  Fantasy  \
0       0          0     0       0         0       0      1      0        0   
1       1          1     0       0         0       0      1      0        1   
2       1          0     0       1         0       0      0      0        0   
3       0          0     0       0         0       0      0      0        0   
4       1          0     0       1         0       0      0      0        0   

   Game  ...  Sports  Super Power  Supernatural  Thriller  Unknown  Vampire  \
0     0  ...       0            0             1         0        0        0   
1     0  ...       0            0             0         0        0        0   
2     0  ...       0            0             0         0        0        0   
3     0  ...       0            0             0         1        0        0   
4     0  ...       0            0             0         0        0        0   

   Yaoi  Y

In [28]:
## Convert categorical features into numerical representations if necessary.

import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder

# Handle missing categorical values

df['genre'] = df['genre'].fillna("Unknown")
df['type'] = df['type'].fillna("Unknown")

# 1. GENRE → MULTI-HOT ENCODING

# Convert "Action, Comedy" → ["Action", "Comedy"]
df['genre_list'] = df['genre'].apply(lambda x: x.split(", "))

mlb = MultiLabelBinarizer()
genre_encoded = mlb.fit_transform(df['genre_list'])

genre_df = pd.DataFrame(genre_encoded, columns=mlb.classes_)

# 2. TYPE → ONE-HOT ENCODING

ohe = OneHotEncoder(sparse_output=False)
type_encoded = ohe.fit_transform(df[['type']])

type_df = pd.DataFrame(type_encoded, columns=ohe.get_feature_names_out(['type']))

# 3. Combine Encoded Categorical Features

categorical_numerical = pd.concat([genre_df, type_df], axis=1)

print("Categorical features converted to numeric.")
print("Shape:", categorical_numerical.shape)
print(categorical_numerical.head())

Categorical features converted to numeric.
Shape: (12294, 51)
   Action  Adventure  Cars  Comedy  Dementia  Demons  Drama  Ecchi  Fantasy  \
0       0          0     0       0         0       0      1      0        0   
1       1          1     0       0         0       0      1      0        1   
2       1          0     0       1         0       0      0      0        0   
3       0          0     0       0         0       0      0      0        0   
4       1          0     0       1         0       0      0      0        0   

   Game  ...  Vampire  Yaoi  Yuri  type_Movie  type_Music  type_ONA  type_OVA  \
0     0  ...        0     0     0         1.0         0.0       0.0       0.0   
1     0  ...        0     0     0         0.0         0.0       0.0       0.0   
2     0  ...        0     0     0         0.0         0.0       0.0       0.0   
3     0  ...        0     0     0         0.0         0.0       0.0       0.0   
4     0  ...        0     0     0         0.0         0.0 

In [29]:
## Normalize numerical features if required.

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Clean numerical fields
df['rating'] = df['rating'].fillna(df['rating'].mean())
df['episodes'] = df['episodes'].replace('Unknown', np.nan).astype(float)
df['episodes'] = df['episodes'].fillna(df['episodes'].median())

# Select numerical columns for normalization

numeric_cols = ['rating', 'episodes']

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform
normalized_values = scaler.fit_transform(df[numeric_cols])

# Create a DataFrame from normalized features
normalized_df = pd.DataFrame(
    normalized_values,
    columns=[col + "_scaled" for col in numeric_cols]
)

print("Normalized numerical features:")
print(normalized_df.head())

Normalized numerical features:
   rating_scaled  episodes_scaled
0       0.924370         0.000000
1       0.911164         0.034673
2       0.909964         0.027518
3       0.900360         0.012658
4       0.899160         0.027518


In [30]:
### Recommendation System:
## Design a function to recommend anime based on cosine similarity.

import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity

# Clean missing values

df['genre'] = df['genre'].fillna("Unknown")
df['rating'] = df['rating'].fillna(df['rating'].mean())
df['episodes'] = df['episodes'].replace("Unknown", np.nan).astype(float)
df['episodes'] = df['episodes'].fillna(df['episodes'].median())

# GENRE → Multi-Hot Encoding

df['genre_list'] = df['genre'].apply(lambda x: x.split(", "))

mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(df['genre_list'])
genre_df = pd.DataFrame(genre_matrix, columns=mlb.classes_)

# NUMERICAL FEATURES → Normalization

scaler = MinMaxScaler()
numeric_scaled = scaler.fit_transform(df[['rating', 'episodes']])
numeric_df = pd.DataFrame(numeric_scaled, columns=['rating_scaled', 'episodes_scaled'])

# FINAL FEATURE MATRIX

feature_matrix = pd.concat([genre_df, numeric_df], axis=1)

# Compute cosine similarity matrix (Anime × Anime)
cosine_sim = cosine_similarity(feature_matrix)

# RECOMMENDATION FUNCTION

def recommend_anime(title, n=10):
    """
    Recommend top-N similar anime based on cosine similarity.
    title: Anime name (string)
    n: number of recommendations
    """
    # Ensure title exists
    if title not in df['name'].values:
        return f"Anime '{title}' not found in dataset."

    # Get index of the target anime
    idx = df[df['name'] == title].index[0]

    # Get similarity scores for this anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort by similarity, highest first (exclude itself)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:n+1]

    # Fetch recommended anime names
    recommended_indices = [i[0] for i in sim_scores]
    recommended_titles = df['name'].iloc[recommended_indices].tolist()

    return recommended_titles

# EXAMPLE USAGE

example = recommend_anime("Steins;Gate", n=10)
print("Recommended Anime:", example)

Recommended Anime: ['Steins;Gate Movie: Fuka Ryouiki no Déjà vu', 'Steins;Gate: Oukoubakko no Poriomania', 'Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero', 'Steins;Gate 0', 'Under the Dog', 'Loups=Garous', 'Loups=Garous Pilot', 'Subarashii Sekai Ryokou: New York Tabi &quot;Computopia Seireki Nisennen no Monogatari&quot;', 'Kaitei Toshi no Dekiru made', 'Sakasama no Patema: Beginning of the Day']


In [31]:
## Given a target anime, recommend a list of similar anime based on cosine similarity scores.

import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity

# Cleaning
df['genre'] = df['genre'].fillna("Unknown")
df['rating'] = df['rating'].fillna(df['rating'].mean())
df['episodes'] = df['episodes'].replace("Unknown", np.nan).astype(float).fillna(df['episodes'].median())

# Feature Extraction
df['genre_list'] = df['genre'].apply(lambda x: x.split(", "))
mlb = MultiLabelBinarizer()
genre_vec = mlb.fit_transform(df['genre_list'])

scaler = MinMaxScaler()
num_vec = scaler.fit_transform(df[['rating', 'episodes']])

# Final matrix
features = np.hstack([genre_vec, num_vec])
cos_sim = cosine_similarity(features)

# Recommend function
def recommend(title, n=5):
    if title not in df['name'].values:
        return [f"'{title}' not found."]
    idx = df[df['name'] == title].index[0]
    scores = sorted(list(enumerate(cos_sim[idx])), key=lambda x: x[1], reverse=True)[1:n+1]
    return [(df.iloc[i]['name'], round(score, 4)) for i, score in scores]

# Input
target = input("Enter anime name: ")

# Output
print("\nRecommendations:")
for name, score in recommend(target, 5):
    print(f"{name} → similarity: {score}")

Enter anime name:  Naruto



Recommendations:
Naruto: Shippuuden → similarity: 0.9987
Boruto: Naruto the Movie - Naruto ga Hokage ni Natta Hi → similarity: 0.9987
Boruto: Naruto the Movie → similarity: 0.9986
Naruto x UT → similarity: 0.9986
Naruto: Shippuuden Movie 4 - The Lost Tower → similarity: 0.9986


In [32]:
## Experiment with different threshold values for similarity scores to adjust the recommendation list size.

import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity

df['genre'] = df['genre'].fillna("Unknown")
df['rating'] = df['rating'].fillna(df['rating'].mean())
df['episodes'] = df['episodes'].replace("Unknown", np.nan).astype(float).fillna(df['episodes'].median())

# Feature extraction
df['genre_list'] = df['genre'].apply(lambda x: x.split(", "))
mlb = MultiLabelBinarizer()
genre_vec = mlb.fit_transform(df['genre_list'])

scaler = MinMaxScaler()
num_vec = scaler.fit_transform(df[['rating', 'episodes']])

features = np.hstack([genre_vec, num_vec])
cos_sim = cosine_similarity(features)

# Threshold recommendation function
def recommend_threshold(title, threshold=0.7):
    if title not in df['name'].values:
        return [f"'{title}' not found."]
    idx = df[df['name'] == title].index[0]
    scores = list(enumerate(cos_sim[idx]))

    # Keep only anime above threshold
    filtered = [(df.iloc[i]['name'], round(s, 4))
                for i, s in scores if s >= threshold and i != idx]

    return filtered if filtered else ["No anime found above threshold."]

# Input
title = input("Enter anime name: ")
th = float(input("Enter similarity threshold (0–1): "))

# Output
print("\nRecommended Anime:")
for x in recommend_threshold(title, th):
    print(x)

Enter anime name:  Naruto
Enter similarity threshold (0–1):  0.75



Recommended Anime:
('Katekyo Hitman Reborn!', np.float64(0.9052))
('Boku no Hero Academia', np.float64(0.82))
('Saint Seiya: The Lost Canvas - Meiou Shinwa 2', np.float64(0.7557))
('Dragon Ball Z', np.float64(0.8593))
('Shijou Saikyou no Deshi Kenichi', np.float64(0.8201))
('Saint Seiya: The Lost Canvas - Meiou Shinwa', np.float64(0.7553))
('Dragon Ball', np.float64(0.7561))
('One Piece Film: Strong World Episode 0', np.float64(0.7549))
('Boruto: Naruto the Movie', np.float64(0.9986))
('Shijou Saikyou no Deshi Kenichi OVA', np.float64(0.8192))
('Dragon Ball Kai (2014)', np.float64(0.8574))
('Bleach', np.float64(0.8203))
('Dragon Ball Kai', np.float64(0.8574))
('Naruto: Shippuuden', np.float64(0.9987))
('The Last: Naruto the Movie', np.float64(0.8187))
('One Piece: Episode of Luffy - Hand Island no Bouken', np.float64(0.7536))
('Naruto: Shippuuden Movie 6 - Road to Ninja', np.float64(0.8186))
('One Piece Movie 4: Dead End no Bouken', np.float64(0.7532))
('Bleach Movie 4: Jigoku-hen', n

In [35]:
## Analyze the performance of the recommendation system and identify areas of improvement.

def recommend(title, n=5):
    if title not in df['name'].values:
        return []
    idx = df[df['name'] == title].index[0]
    scores = list(enumerate(cos_sim[idx]))
    scores = sorted(scores, key=lambda x: x[1], reverse=True)[1:n+1]
    return [(df.iloc[i]['name'], round(s, 4)) for i, s in scores]

def analyze_performance(title):
    recs = recommend(title, 5)

    if not recs:
        print("Anime not found.")
        return

    # average similarity
    avg_sim = sum([s for _, s in recs]) / len(recs)

    # average genre overlap
    target_genres = set(df[df['name'] == title].iloc[0]['genre_list'])
    overlaps = []
    for name, score in recs:
        g = set(df[df['name'] == name].iloc[0]['genre_list'])
        overlap = len(target_genres & g) / len(target_genres) if target_genres else 0
        overlaps.append(overlap)
    avg_overlap = sum(overlaps) / len(overlaps)

    diversity = 1 - avg_sim

    # -------- PRINT OUTPUT --------
    print("\n--- Recommendation System Performance ---")
    print("Target Anime:", title)
    print("Average Similarity Score:", round(avg_sim, 4))
    print("Average Genre Overlap:", round(avg_overlap, 4))
    print("Diversity:", round(diversity, 4))

    print("\nRecommended Anime:")
    for name, score in recs:
        print(f"{name} → similarity: {score}")

    print("\n--- Areas of Improvement ---")
    if avg_overlap < 0.5:
        print(" Improve genre matching.")
    if avg_sim > 0.85:
        print(" Recommendations may be too similar (low diversity).")
    if diversity < 0.2:
        print(" Add more features to improve diversity.")
    else:
        print(" System has healthy diversity.")

# USER INPUT + OUTPUT

target = input("Enter anime name for evaluation: ")
analyze_performance(target)

Enter anime name for evaluation:  Naruto



--- Recommendation System Performance ---
Target Anime: Naruto
Average Similarity Score: 0.9986
Average Genre Overlap: 1.0
Diversity: 0.0014

Recommended Anime:
Naruto: Shippuuden → similarity: 0.9987
Boruto: Naruto the Movie - Naruto ga Hokage ni Natta Hi → similarity: 0.9987
Boruto: Naruto the Movie → similarity: 0.9986
Naruto x UT → similarity: 0.9986
Naruto: Shippuuden Movie 4 - The Lost Tower → similarity: 0.9986

--- Areas of Improvement ---
 Recommendations may be too similar (low diversity).
 Add more features to improve diversity.


## Interview Questions
### 1. Can you explain the difference between user-based and item-based collaborative filtering?
user-based collaborative filtering recommends items to a user based on the preferences of similar users. It identifies users with similar tastes and suggests items that those users have liked. In contrast, item-based collaborative filtering focuses on the relationships between items themselves. It recommends items that are similar to those the user has already liked, based on the preferences of all users who have interacted with those items.

### 2. What is collaborative filtering, and how does it work?
Collaborative filtering is a recommendation technique that makes predictions about a user's interests by collecting preferences from many users. It works by analyzing patterns in user-item interactions, such as ratings or purchase history, to identify similarities between users or items. Based on these similarities, it recommends items that similar users have liked or items similar to those the user has interacted with.