# Data Preprocessing:

- Load the dataset into a suitable data structure (e.g., pandas DataFrame).
- Handle missing values, if any.
- Explore the dataset to understand its structure and attributes.

In [1]:
import pandas as pd
import numpy as np
df= pd.read_csv('anime.csv')
df.shape

(12294, 7)

In [2]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [4]:
df.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


In [5]:
# Checking Missing Values

df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [6]:
# Dropping Null Values

df['genre'] = df['genre'].fillna('').apply(lambda x : x.split(', ') if isinstance(x, str) else x if isinstance(x, list) else [])
df['type'] = df['type'].fillna('Unknown')
df['rating'] = df['rating'].fillna(df['rating'].median())

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


# Feature Extraction:

- Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
- Convert categorical features into numerical representations if necessary.
- Normalize numerical features if required.

In [10]:
# Encoding Categorical Features

from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

In [11]:
genre_split = df['genre'].str.get_dummies(sep = ', ')
genre_split.head()

Unnamed: 0,'Adventure','Adventure'],'Cars','Cars'],'Comedy','Comedy'],'Dementia','Dementia'],'Demons','Demons'],...,['Shounen'],['Slice of Life',['Slice of Life'],['Space'],['Sports'],['Super Power',['Supernatural'],['Thriller'],['Vampire'],['Yaoi']
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
# one-hot encoding the 'type' column
type_encoded = pd.get_dummies(df['type'], prefix = 'type')
type_encoded

Unnamed: 0,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,type_Unknown
0,True,False,False,False,False,False,False
1,False,False,False,False,False,True,False
2,False,False,False,False,False,True,False
3,False,False,False,False,False,True,False
4,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...
12289,False,False,False,True,False,False,False
12290,False,False,False,True,False,False,False
12291,False,False,False,True,False,False,False
12292,False,False,False,True,False,False,False


In [13]:
from sklearn.preprocessing import MinMaxScaler

# Replacing 'Unknown' strings with NaN
df[['rating', 'members', 'episodes']] = df[['rating', 'members', 'episodes']].replace('Unknown', np.nan)

In [14]:
# Converting columns to numeric
df[['rating', 'members', 'episodes']] = df[['rating', 'members', 'episodes']].apply(pd.to_numeric, errors='coerce')

In [15]:
# Filling NaN values with the mean of each column
df[['rating', 'members', 'episodes']] = df[['rating', 'members', 'episodes']].fillna(df[['rating', 'members', 'episodes']].mean())

In [16]:
# normalizing numerical features
ms = MinMaxScaler()
df[['rating', 'members', 'episodes']] = ms.fit_transform(df[['rating', 'members', 'episodes']])

In [17]:
num_cols = df[['rating', 'members', 'episodes']]
num_cols.head()

Unnamed: 0,rating,members,episodes
0,0.92437,0.197872,0.0
1,0.911164,0.78277,0.034673
2,0.909964,0.112689,0.027518
3,0.90036,0.664325,0.012658
4,0.89916,0.149186,0.027518


In [18]:
df_encoded = pd.concat([genre_split, type_encoded, num_cols], axis = 1)
df_encoded

Unnamed: 0,'Adventure','Adventure'],'Cars','Cars'],'Comedy','Comedy'],'Dementia','Dementia'],'Demons','Demons'],...,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,type_Unknown,rating,members,episodes
0,0,0,0,0,0,0,0,0,0,0,...,True,False,False,False,False,False,False,0.924370,0.197872,0.000000
1,1,0,0,0,0,0,0,0,0,0,...,False,False,False,False,False,True,False,0.911164,0.782770,0.034673
2,0,0,0,0,1,0,0,0,0,0,...,False,False,False,False,False,True,False,0.909964,0.112689,0.027518
3,0,0,0,0,0,0,0,0,0,0,...,False,False,False,False,False,True,False,0.900360,0.664325,0.012658
4,0,0,0,0,1,0,0,0,0,0,...,False,False,False,False,False,True,False,0.899160,0.149186,0.027518
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,0,0,0,0,0,0,0,0,0,0,...,False,False,False,True,False,False,False,0.297719,0.000203,0.000000
12290,0,0,0,0,0,0,0,0,0,0,...,False,False,False,True,False,False,False,0.313325,0.000176,0.000000
12291,0,0,0,0,0,0,0,0,0,0,...,False,False,False,True,False,False,False,0.385354,0.000211,0.001651
12292,0,0,0,0,0,0,0,0,0,0,...,False,False,False,True,False,False,False,0.397359,0.000168,0.000000


In [19]:
df_new = pd.concat([df['anime_id'], df['name'], genre_split, type_encoded, num_cols], axis = 1)
df_new.head()

Unnamed: 0,anime_id,name,'Adventure','Adventure'],'Cars','Cars'],'Comedy','Comedy'],'Dementia','Dementia'],...,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,type_Unknown,rating,members,episodes
0,32281,Kimi no Na wa.,0,0,0,0,0,0,0,0,...,True,False,False,False,False,False,False,0.92437,0.197872,0.0
1,5114,Fullmetal Alchemist: Brotherhood,1,0,0,0,0,0,0,0,...,False,False,False,False,False,True,False,0.911164,0.78277,0.034673
2,28977,Gintama°,0,0,0,0,1,0,0,0,...,False,False,False,False,False,True,False,0.909964,0.112689,0.027518
3,9253,Steins;Gate,0,0,0,0,0,0,0,0,...,False,False,False,False,False,True,False,0.90036,0.664325,0.012658
4,9969,Gintama&#039;,0,0,0,0,1,0,0,0,...,False,False,False,False,False,True,False,0.89916,0.149186,0.027518


# Recommendation System:

- Design a function to recommend anime based on cosine similarity.
- Given a target anime, recommend a list of similar anime based on cosine similarity scores.
- Experiment with different threshold values for similarity scores to adjust the recommendation list size.

In [20]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Columns: 165 entries, anime_id to episodes
dtypes: bool(7), float64(3), int64(154), object(1)
memory usage: 14.9+ MB


In [21]:
df_new.isna().sum()

anime_id        0
name            0
'Adventure'     0
'Adventure']    0
'Cars'          0
               ..
type_TV         0
type_Unknown    0
rating          0
members         0
episodes        0
Length: 165, dtype: int64

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

In [23]:
def recommend_anime(df_new, target_anime_id, threshold = 0.5, top_n = 10):
    
    feature_cols = df_new.columns.difference(['anime_id','name'])  # identifying feature columns except anime_id and name 
    
    anime_features = df_new[feature_cols]  # extracting feature values for similarity calculation

    similarity_matrix = cosine_similarity(anime_features)  # computing cosine similarity matrix

    target_index = df_new[df_new['anime_id'] == target_anime_id].index[0]  # finding the index of the target anime

    similarity_scores = similarity_matrix[target_index]   # extracting similarity scores for the target anime

    similar_anime_indices = [                               
        idx for idx, score in enumerate(similarity_scores)   # filtering recommendations based on the threshold and excluding the target
        if score >= threshold and idx != target_index
    ]                                                       

    similar_anime_indices = sorted(
        similar_anime_indices, key = lambda idx: similarity_scores[idx], reverse = True
    )[:top_n]                # sorting by similarity scores and selecting top N recommendations

    recommendations = df_new.iloc[similar_anime_indices][['anime_id','name']]  # retrieving recommended anime information

    return recommendations

In [25]:
target_anime_id = 2415

recommendations = recommend_anime(df_new, target_anime_id = target_anime_id, threshold = 0.5, top_n = 10)
recommendations

Unnamed: 0,anime_id,name
3628,10370,Metal Fight Beyblade 4D
2401,2416,Grander Musashi RV
2768,3137,Tsurikichi Sanpei
4952,13231,Metal Fight Beyblade Zero G
3742,8410,Metal Fight Beyblade: Baku
4014,234,Dan Doh!!
4461,5962,Metal Fight Beyblade
1446,1391,Future GPX Cyber Formula
2113,3272,Kinnikuman
3859,2705,Bakusou Kyoudai Let&#039;s &amp; Go


# 1. Can you explain the difference between user-based and item-based collaborative filtering?


a. User-Based Collaborative Filtering: This method recommends items by finding similar users. It assumes that if User A and User B have agreed on items in the past, they will agree on items in the future as well.
- How it works: For a given user, the system identifies other users who have similar preferences (based on past interactions, such as ratings).The system recommends items that these similar users have liked, which the target user has not yet interacted with.
- Steps:

1. Find the most similar users to the target user (using similarity metrics like cosine similarity, Pearson correlation, etc.).
2. Recommend items that the similar users have rated highly but the target user has not yet rated.

- Advantages: 
1. Good for scenarios where user preferences are consistent over time.
2. Relatively simple to implement.

- Disadvantages:
1. Cold start problem: Struggles to recommend items to new users who haven't interacted with many items.
2. Computationally expensive with large datasets, as it requires computing similarities between all users.

b. Item-Based Collaborative Filtering: This method recommends items based on the similarity between items themselves. It assumes that if a user likes a certain item, they are more likely to like items that are similar to it.

- How it works: For a given item, the system identifies other items that are similar (based on user interactions, like ratings).The system then recommends those similar items to the user who liked the original item.
- Steps:

1. Find the most similar items to the target item (using similarity metrics).
2. Recommend those similar items to the user who interacted with the target item.

- Advantages:
1. Can be more scalable than user-based collaborative filtering, especially for large datasets.
2. Less affected by the cold start problem for new users, as it focuses on item similarities rather than user interactions.

- Disadvantages:

1. It might not work well in cases where the items don't have clear similarities.
2. For highly dynamic or evolving preferences, it may not capture the changing preferences of users as effectively as user-based methods.

# 2. What is collaborative filtering, and how does it work?

Collaborative filtering (CF) is a technique used in recommendation systems to predict the preferences or behaviors of users based on the preferences or behaviors of other similar users. It is one of the most widely used methods for building recommendation engines, such as those found in e-commerce platforms (like Amazon), movie streaming services (like Netflix), and music platforms (like Spotify).

- How Collaborative Filtering Works: Collaborative filtering operates under the assumption that people who have agreed in the past will agree in the future. In other words, if two users have shared similar tastes in the past, it is likely that they will enjoy similar items in the future. This method uses historical data (e.g., ratings, purchases, views) to make predictions about what items a user may like based on the preferences of other users who have similar tastes.

- There are two main types of collaborative filtering:
1. User-Based Collaborative Filtering (UBCF):
- Recommending items to a user based on the preferences of similar users. If User A likes the same items as User B, then the system will recommend to User A the items that User B has liked but User A has not yet interacted with.
2. Item-Based Collaborative Filtering (IBCF):
- Recommending items based on the similarity between items. If a user likes a particular item, the system will recommend other items that are similar to it.