## Recommendation System 

In [1]:
#Load libraries and dataset
import pandas as pd
import numpy as np

df=pd.read_csv("anime.csv")
df.shape

(12294, 7)

In [2]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [3]:
# Basic info about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [4]:
df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [5]:
data=df.copy()

In [6]:
data.genre.unique()

array(['Drama, Romance, School, Supernatural',
       'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen',
       'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen',
       ..., 'Hentai, Sports', 'Drama, Romance, School, Yuri',
       'Hentai, Slice of Life'], shape=(3265,), dtype=object)

In [7]:
data.type.unique()

array(['Movie', 'TV', 'OVA', 'Special', 'Music', 'ONA', nan], dtype=object)

In [8]:
data.episodes.unique()

array(['1', '64', '51', '24', '10', '148', '110', '13', '201', '25', '22',
       '75', '4', '26', '12', '27', '43', '74', '37', '2', '11', '99',
       'Unknown', '39', '101', '47', '50', '62', '33', '112', '23', '3',
       '94', '6', '8', '14', '7', '40', '15', '203', '77', '291', '120',
       '102', '96', '38', '79', '175', '103', '70', '153', '45', '5',
       '21', '63', '52', '28', '145', '36', '69', '60', '178', '114',
       '35', '61', '34', '109', '20', '9', '49', '366', '97', '48', '78',
       '358', '155', '104', '113', '54', '167', '161', '42', '142', '31',
       '373', '220', '46', '195', '17', '1787', '73', '147', '127', '16',
       '19', '98', '150', '76', '53', '124', '29', '115', '224', '44',
       '58', '93', '154', '92', '67', '172', '86', '30', '276', '59',
       '72', '330', '41', '105', '128', '137', '56', '55', '65', '243',
       '193', '18', '191', '180', '91', '192', '66', '182', '32', '164',
       '100', '296', '694', '95', '68', '117', '151', '130',

In [10]:
#Convert 'episodes' - replace 'Unknown' with NaN then convert to numeric
data['episodes']=data['episodes'].replace('Unknown',np.nan)
data['episodes']=pd.to_numeric(data['episodes'],errors='coerce')

In [11]:
data['episodes']=data['episodes'].fillna(data['episodes'].median())

In [12]:
#fill missing 'type' with 'Unknown'
data['type']=data['type'].fillna('Unknown')

In [13]:
#For 'genre' - fill missing with 'Unknown' so we keep row but mark as unknown
data['genre']=data['genre'].fillna('Unknown')

In [14]:
#For 'rating' - fill missing with the median rating
median_rating=data['rating'].median()
data['rating']=data['rating'].fillna(median_rating)

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  float64
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 672.5+ KB


In [16]:
data.type.unique()

array(['Movie', 'TV', 'OVA', 'Special', 'Music', 'ONA', 'Unknown'],
      dtype=object)

In [17]:
data.isna().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

In [None]:
#3. Feature extraction
#We'll use: genres (multi-hot), type (one-hot), rating (normalized), members (log + normalized), episodes (normalized)
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder, MinMaxScaler

feat_df=data[['anime_id','name','genre','type','episodes','rating','members']].copy()

In [19]:
#Process genres: split by comma
def split_genres(x):
    if x=='Unknown':
        return []
    return [g.strip() for g in x.split(',') if g.strip()!='']

feat_df['genre_list']=feat_df['genre'].apply(split_genres)
#genre is a string like "Action, Comedy, Drama".We split it by , into a list of genres--["Action","Comedy","Drama"].
#If genre is "Unknown", we return an empty list.
#So now we have a new column genre_list containing lists of genres.

In [20]:
#MultiLabelBinarizer turns a list of labels into a multi-hot encoded vector.
mlb=MultiLabelBinarizer()
genre_matrix=mlb.fit_transform(feat_df['genre_list'])
genre_cols=['genre_' + g.replace(' ', '_').replace('-', '_') for g in mlb.classes_]
genre_df=pd.DataFrame(genre_matrix, columns=genre_cols, index=feat_df.index)
#genre_df is a wide dataframe with 1 column per genre (genre_Action, genre_Comedy, etc.), values are 0/1.

In [21]:
# One-hot for type (simple)
type_ohe=pd.get_dummies(feat_df['type'],prefix='type')

In [22]:
# Prepare numeric columns
num_df=feat_df[['episodes','rating','members']].copy()

In [23]:
#Normalize numeric columns (episodes, rating, members)
scaler=MinMaxScaler()
num_scaled=scaler.fit_transform(num_df[['episodes','rating','members']])
num_scaled_df=pd.DataFrame(
    num_scaled, 
    columns=['episodes_scaled','rating_scaled','members_scaled'], 
    index=feat_df.index
)

In [24]:
#Combine all features into single dataframe
features=pd.concat([genre_df,type_ohe,num_scaled_df],axis=1)

print('Feature matrix shape:',features.shape)
features.head()

Feature matrix shape: (12294, 53)


Unnamed: 0,genre_Action,genre_Adventure,genre_Cars,genre_Comedy,genre_Dementia,genre_Demons,genre_Drama,genre_Ecchi,genre_Fantasy,genre_Game,...,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,type_Unknown,episodes_scaled,rating_scaled,members_scaled
0,0,0,0,0,0,0,1,0,0,0,...,True,False,False,False,False,False,False,0.0,0.92437,0.197872
1,1,1,0,0,0,0,1,0,1,0,...,False,False,False,False,False,True,False,0.034673,0.911164,0.78277
2,1,0,0,1,0,0,0,0,0,0,...,False,False,False,False,False,True,False,0.027518,0.909964,0.112689
3,0,0,0,0,0,0,0,0,0,0,...,False,False,False,False,False,True,False,0.012658,0.90036,0.664325
4,1,0,0,1,0,0,0,0,0,0,...,False,False,False,False,False,True,False,0.027518,0.89916,0.149186


In [25]:
#4.Build cosine similarity recommender
from sklearn.metrics.pairwise import cosine_similarity

#Compute cosine similarity matrix on feature vectors
feature_matrix=features.values
cos_sim=cosine_similarity(feature_matrix)

In [26]:
#simple recommender
def recommend_by_name(anime_name,k=5):
    #find index of anime
    idx_list=feat_df.index[feat_df['name']==anime_name].tolist()
    if len(idx_list)==0:
        print("Anime not found. Type the exact name.")
        return []
    idx=idx_list[0]
    
    #get similarity scores with all others
    scores=[]
    for i in range(len(cos_sim[idx])):
        if i!=idx:   # skip itself
            scores.append((i,cos_sim[idx][i]))
    
    #sort by similarity
    scores=sorted(scores,key=lambda x: x[1],reverse=True)
    
    #take top k
    topk=scores[:k]
    
    #prepare results
    recs=[]
    for i,s in topk:
        recs.append((feat_df.loc[i,'name'],round(s,3)))
    
    return recs

#example
print("Recommendations for Fullmetal Alchemist: Brotherhood")
print(recommend_by_name("Fullmetal Alchemist: Brotherhood",k=5))

Recommendations for Fullmetal Alchemist: Brotherhood
[('Fullmetal Alchemist', np.float64(0.946)), ('Magi: The Labyrinth of Magic', np.float64(0.874)), ('Magi: The Kingdom of Magic', np.float64(0.87)), ('Densetsu no Yuusha no Densetsu', np.float64(0.861)), ('Magi: Sinbad no Bouken (TV)', np.float64(0.856))]


In [27]:
from sklearn.model_selection import train_test_split
genre_bool=genre_df.astype(bool).values  #astype(bool) turns 1 to True and 0 to False.

In [28]:
idx_all=np.arange(len(feat_df))
train_idx, test_idx=train_test_split(idx_all,test_size=0.2,random_state=42)
#test_idx is a list of indices we will use as query items to evaluate recommendations.

In [None]:
def evaluate_at_k(k=10,sample_size=300):
    test_sample=np.random.choice(test_idx,size=min(sample_size,len(test_idx)),replace=False)
    precs=[]
    recs=[]
    f1s=[]
    
    for q in test_sample:
        # ground truth: items sharing at least 1 genre with q
        q_genres=genre_bool[q]
        gt_mask=np.any(genre_bool & q_genres,axis=1)
        gt_mask[q]=False
        gt_idx=set(np.where(gt_mask)[0])
        
        # get top-k recommendations
        scores=[]
        for i in range(len(cos_sim[q])):
            if i!= q:
                scores.append((i,cos_sim[q][i]))
        scores=sorted(scores,key=lambda x: x[1],reverse=True)
        rec_idx=[i for i,s in scores[:k]]
        rec_set=set(rec_idx)
        
        # evaluation
        if len(rec_set)==0:
            precs.append(0.0)
            recs.append(0.0)
            f1s.append(0.0)
            continue
        tp=len(rec_set & gt_idx)
        prec=tp/len(rec_set)
        rec=tp/(len(gt_idx) if len(gt_idx)>0 else 1)
        if prec + rec == 0:
            f1 = 0.0
        else:
            f1 = 2 * prec * rec / (prec + rec)
        
        precs.append(prec)
        recs.append(rec)
        f1s.append(f1)
    
    return {'k': k, 'precision': np.mean(precs), 'recall': np.mean(recs), 'f1': np.mean(f1s)}

#We randomly pick up to sample_size anime from test_idx to evaluate.
#For each chosen query q we:
#1)Build the ground-truth set (items that share a genre with q).
#2)Get top-k recommendations for q from the similarity matrix.
#3)Compute precision, recall and F1 for that query.
#4)Store the numbers and later average them.

In [30]:
#run evaluation
results = []
for k in [5, 10, 20]:
    results.append(evaluate_at_k(k=k, sample_size=300))

for r in results:
    print(r)

{'k': 5, 'precision': np.float64(0.996), 'recall': np.float64(0.002287005482799253), 'f1': np.float64(0.004546483039003933)}
{'k': 10, 'precision': np.float64(0.9853333333333334), 'recall': np.float64(0.004184731844553136), 'f1': np.float64(0.008266271232755164)}
{'k': 20, 'precision': np.float64(0.9980000000000001), 'recall': np.float64(0.007557629062220689), 'f1': np.float64(0.014873109644894068)}


#### Analysis
* The recommendation system built works fine with the chosen features (genres, type, rating, episodes, and members) and gives sensible results.  
* It shows that even with basic feature engineering and cosine similarity, we can capture some meaningful relationships between anime.   
* However, the performance is naturally limited because it relies on only a few features and a simple similarity approach.   
* For stronger results, we would need richer features such as detailed descriptions, tags, or actual user rating data, and possibly advanced models like collaborative filtering or hybrid recommenders.   
* These would help capture deeper patterns in user preferences and improve recommendation quality.

### Interview Questions:

#### 1. Can you explain the difference between user-based and item-based collaborative filtering?

- **User-based collaborative filtering**:  
  - Idea: "Users who are similar will like similar items."  
  - It finds users with tastes similar to the target user and recommends items that those similar users liked.  
  - Example:If User A and B both like Pizza and Burger, and User A also likes Pasta it recommends Pasta to User B.  

- **Item-based collaborative filtering**:  
  - Idea: "Items that are similar will be liked by the same user."  
  - It looks at item-to-item similarity and recommends items similar to what the user has already liked.  
  - Example:If many users who bought a Phone also bought Earphones it recommends Earphones to a user who bought a Phone.  

* Difference:  
- User-based = finds similar **users** first.  
- Item-based = finds similar **items** first.  

---

#### 2. What is collaborative filtering, and how does it work?

- Collaborative filtering is a recommendation method that predicts what a user will like based on preferences of similar users or similar items.  
- **How it works**:  
  1. Collect user-item interaction data (ratings, purchases, clicks, etc.).  
  2. Find patterns: either users who behave alike, or items that often go together.  
  3. Recommend new items to the user based on these patterns.  

* Example: On Netflix, if many people who watch Movie A also watch Movie B, then Movie B is recommended to someone who watched Movie A.  
