<a href="https://colab.research.google.com/github/Sam-krish2411/DATA-SCIENCE-ASSIGNMENT/blob/main/RECOMMENDATION_SYSTEM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import pandas as pd

#Load the anime dataset
anime_df = pd.read_csv("anime.csv")  # replace with your file path




In [4]:
# Display the first few rows
(anime_df.head())

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [5]:
# Basic information about the dataset
anime_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [6]:
# Check for missing values
anime_df.isnull().sum()


Unnamed: 0,0
anime_id,0
name,0
genre,62
type,25
episodes,0
rating,230
members,0


In [7]:
# Summary statistics for numerical columns
anime_df.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


In [8]:
# View unique values in categorical columns (optional)
print(anime_df['type'].unique())
print(anime_df['genre'].unique())

['Movie' 'TV' 'OVA' 'Special' 'Music' 'ONA' nan]
['Drama, Romance, School, Supernatural'
 'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen'
 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen' ...
 'Hentai, Sports' 'Drama, Romance, School, Yuri' 'Hentai, Slice of Life']


In [10]:
#Handling Missing Values

# Fill missing genres with 'Unknown'
anime_df['genre'].fillna('Unknown', inplace=True)

# Fill missing type with 'Unknown'
anime_df['type'].fillna('Unknown', inplace=True)



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  anime_df['genre'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  anime_df['type'].fillna('Unknown', inplace=True)


In [11]:
 #Drop rows with missing ratings or members if they are essential

anime_df.dropna(subset=['rating', 'members'], inplace=True)

In [12]:
# Verify no missing values remain
print(anime_df.isnull().sum())

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


#**FEATURE EXTRACTION**

In [14]:
#Decide on Features for Recommendation System

# Select the useful features
features = anime_df[['genre', 'type', 'rating', 'members']]

# Display first few rows of selected features
features.head()

Unnamed: 0,genre,type,rating,members
0,"Drama, Romance, School, Supernatural",Movie,9.37,200630
1,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,9.26,793665
2,"Action, Comedy, Historical, Parody, Samurai, S...",TV,9.25,114262
3,"Sci-Fi, Thriller",TV,9.17,673572
4,"Action, Comedy, Historical, Parody, Samurai, S...",TV,9.16,151266


In [16]:
#Converting CAtegorical Variables

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder


# Convert genres into tokenized features
vectorizer = CountVectorizer(tokenizer=lambda x: x.split(', '))
genre_matrix = vectorizer.fit_transform(anime_df['genre'])


encoder = OneHotEncoder(handle_unknown='ignore')
type_matrix = encoder.fit_transform(anime_df[['type']])

In [17]:
#Normalise Numerical Columns

from sklearn.preprocessing import MinMaxScaler
import numpy as np

scaler = MinMaxScaler()
numeric_features = scaler.fit_transform(anime_df[['rating', 'members']])

In [18]:
#Combine all features

from scipy.sparse import hstack

# Combine sparse (genre + type) with dense (rating + members)
feature_matrix = hstack([genre_matrix, type_matrix, numeric_features])

print("Final feature matrix shape:", feature_matrix.shape)

Final feature matrix shape: (12064, 52)


#**RECOMMENDATION SYSTEM**

In [19]:
#Compute cosine similarity

from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity matrix from the feature matrix
cosine_sim = cosine_similarity(feature_matrix, feature_matrix)

In [20]:
def recommend_anime(title, top_n=5, threshold=0.3):

    # Check if the anime exists
    if title not in anime_df['name'].values:
        return f"Anime '{title}' not found in the dataset."

    # Get index of the given anime
    idx = anime_df[anime_df['name'] == title].index[0]

    # Get pairwise similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort by similarity score (descending), skip itself (index 0)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Apply threshold filter (exclude very low similarities)
    sim_scores = [(i, score) for i, score in sim_scores if score >= threshold]

    # Get top_n most similar animes
    sim_indices = [i for i, score in sim_scores[1:top_n+1]]

    # Return recommended anime names with similarity scores
    recommendations = anime_df.iloc[sim_indices][['name', 'genre', 'type', 'rating']]
    recommendations['similarity'] = [score for i, score in sim_scores[1:top_n+1]]

    return recommendations

In [21]:
 #Example: Recommend similar anime to "Naruto"
print(recommend_anime("Naruto", top_n=5, threshold=0.4))

                       name  \
615      Naruto: Shippuuden   
175  Katekyo Hitman Reborn!   
206           Dragon Ball Z   
582                  Bleach   
588         Dragon Ball Kai   

                                                 genre type  rating  \
615  Action, Comedy, Martial Arts, Shounen, Super P...   TV    7.94   
175               Action, Comedy, Shounen, Super Power   TV    8.37   
206  Action, Adventure, Comedy, Fantasy, Martial Ar...   TV    8.32   
582  Action, Comedy, Shounen, Super Power, Supernat...   TV    7.95   
588  Action, Adventure, Comedy, Fantasy, Martial Ar...   TV    7.95   

     similarity  
615    0.998469  
175    0.911802  
206    0.872676  
582    0.856316  
588    0.856006  


In [22]:
recommend_anime("Naruto: Shippuuden", top_n=10, threshold=0.5)

Unnamed: 0,name,genre,type,rating,similarity
841,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,7.81,0.998469
175,Katekyo Hitman Reborn!,"Action, Comedy, Shounen, Super Power",TV,8.37,0.917997
206,Dragon Ball Z,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,8.32,0.876986
588,Dragon Ball Kai,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,7.95,0.864897
1930,Dragon Ball Super,"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,7.4,0.862907
2615,Medaka Box,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.21,0.862143
3038,Tenjou Tenge,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.1,0.861343
1209,Medaka Box Abnormal,"Action, Comedy, Ecchi, Martial Arts, School, S...",TV,7.63,0.861032
515,Dragon Ball Kai (2014),"Action, Adventure, Comedy, Fantasy, Martial Ar...",TV,8.01,0.860611
582,Bleach,"Action, Comedy, Shounen, Super Power, Supernat...",TV,7.95,0.854421


#**EVALUATION**

In [70]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(anime_df, test_size=0.2, random_state=42)

# Reset index for clean alignment
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

In [71]:
# Create mapping from anime name → index in train_df
name_to_index = {name: idx for idx, name in enumerate(train_df['name'])}

# Add mapped train indices to test_df
test_df['train_index'] = test_df['name'].map(name_to_index)

In [75]:
import numpy as np
import pandas as pd

def evaluate_recommendations(test_df, train_df, cosine_sim_train, top_k=5, threshold=0.4):
    precision_list, recall_list, f1_list = [], [], []

    for _, row in test_df.iterrows():
        train_idx = row['train_index']
        if pd.isna(train_idx):  # skip if anime not in training set
            continue

        genre = row['genre']
        if pd.isna(genre):
            continue

        # Similarity scores for this anime
        sim_scores = list(enumerate(cosine_sim_train[int(train_idx)]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

        # Filter by threshold
        sim_scores = [(i, score) for i, score in sim_scores if score >= threshold]

        # Pick top-k recommendations
        top_recs = [i for i, _ in sim_scores[:top_k]]
        rec_genres = train_df.iloc[top_recs]['genre'].dropna().tolist()

        # Relevant recommendations = at least one genre match
        relevant = [g for g in rec_genres if any(gen in g for gen in genre.split(", "))]

        # Compute metrics
        possible_relevant = train_df[train_df['genre'].str.contains('|'.join(genre.split(", ")), na=False)]
        precision = len(relevant) / len(top_recs) if top_recs else 0
        recall = len(relevant) / len(possible_relevant) if len(possible_relevant) > 0 else 0
        f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0

        precision_list.append(precision)
        recall_list.append(recall)
        f1_list.append(f1)

    metrics = {
        "Precision": np.mean(precision_list) if precision_list else 0,
        "Recall": np.mean(recall_list) if recall_list else 0,
        "F1": np.mean(f1_list) if f1_list else 0,
    }
    return metrics


In [76]:
metrics = evaluate_recommendations(test_df, train_df, cosine_sim_train, top_k=5, threshold=0.4)
print("Evaluation Results:", metrics)

Evaluation Results: {'Precision': np.float64(0.5), 'Recall': np.float64(0.000675310642895732), 'F1': np.float64(0.001348799568384138)}


#**INTERVIEW QUESTIONS**

**1. Can you explain the difference between user-based and item-based collaborative filtering?**

 Collaborative filtering is a recommendation technique that relies on user behavior rather than item attributes. It assumes that users who agreed in the past will agree again. It works by analyzing user-item interactions (e.g., ratings, clicks) to find patterns and suggest items based on similar users or items.


**2. What is collaborative filtering, and how does it work?**


Collaborative filtering is a technique used in recommendation systems that makes predictions about a user's interests by collecting preferences from many users. The core idea is simple: people who agreed in the past tend to agree again.

Here’s how it works:

 - Instead of relying on item attributes (like genre or price), collaborative filtering looks at user behavior—such as ratings, clicks, purchases, or watch history.

- It builds a user-item interaction matrix, where rows represent users and columns represent items. The values in the matrix reflect how much a user liked or interacted with an item.

- Then, it finds patterns in this matrix to recommend items. For example:

    A. If User A and User B both liked the same shows, and User A liked a new show, the system might recommend that show to User B.

    B. Or, if a user liked Item X, and Item X is often liked alongside Item Y by other users, the system might suggest Item Y.


There are two main types:

A. User-based collaborative filtering: Finds users similar to the target user and recommends items they liked.

b. Item-based collaborative filtering: Finds items similar to what the user already liked and recommends those.


        