                                                      **Recommendation System**

Unique ID of each anime.
Anime title.
Anime broadcast type, such as TV, OVA, etc.
anime genre.
The number of episodes of each anime.
The average rating for each anime compared to the number of users who gave ratings.


Number of community members for each anime. 


Objective:
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset. 

**Data Preprocessing**

In [1]:
import pandas as pd
df = pd.read_csv('C:\\Users\\rishi\\OneDrive\\Desktop\\DS Assigments\\anime.csv')
print(df)

       anime_id                                               name  \
0         32281                                     Kimi no Na wa.   
1          5114                   Fullmetal Alchemist: Brotherhood   
2         28977                                           Gintama°   
3          9253                                        Steins;Gate   
4          9969                                      Gintama&#039;   
...         ...                                                ...   
12289      9316       Toushindai My Lover: Minami tai Mecha-Minami   
12290      5543                                        Under World   
12291      5621                     Violence Gekiga David no Hoshi   
12292      6133  Violence Gekiga Shin David no Hoshi: Inma Dens...   
12293     26081                   Yasuji no Pornorama: Yacchimae!!   

                                                   genre   type episodes  \
0                   Drama, Romance, School, Supernatural  Movie        1   
1      

In [2]:
print(df.head())

   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17   
4  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   

   members  
0   200630  
1   793665  
2   114262  
3   673572  
4   151266  


In [3]:
print(df.isnull().sum())

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


In [5]:
df['rating'].fillna(df['rating'].mean(), inplace=True)

**Feature Extraction**

In [15]:
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler
import numpy as np
import pandas as pd
#Ensure 'genre' column contains list of genres or empty lists
def clean_genres(x):
    if isinstance(x, str):
        return x.split(',')
    elif pd.isna(x):
        return []
    else:
        return []
df['genre'] = df['genre'].apply(clean_genres)

print(df['genre'].head()) 

#Use one-hot encoding for genres
mlb = MultiLabelBinarizer()
genre_encoded = pd.DataFrame(mlb.fit_transform(df['genre']), columns=mlb.classes_, index=df.index)

#Normalizer numerical features like rating
scaler = MinMaxScaler()
df['rating_normalized'] = scaler.fit_transform(df[['rating']])

#Combine the encoded genres and the normalized rating into a feature matrix
features = pd.concat([genre_encoded, df['rating_normalized']], axis=1)
print(features.head())
                 


0            [Drama,  Romance,  School,  Supernatural]
1    [Action,  Adventure,  Drama,  Fantasy,  Magic,...
2    [Action,  Comedy,  Historical,  Parody,  Samur...
3                                  [Sci-Fi,  Thriller]
4    [Action,  Comedy,  Historical,  Parody,  Samur...
Name: genre, dtype: object
    Adventure   Cars   Comedy   Dementia   Demons   Drama   Ecchi   Fantasy  \
0           0      0        0          0        0       0       0         0   
1           1      0        0          0        0       1       0         1   
2           0      0        1          0        0       0       0         0   
3           0      0        0          0        0       0       0         0   
4           0      0        1          0        0       0       0         0   

    Game   Harem  ...  Shounen  Slice of Life  Space  Sports  Super Power  \
0      0       0  ...        0              0      0       0            0   
1      0       0  ...        0              0      0       0         

**Observation**

The genre column was processed to ensure that each entry is either a list of genres or an empty list. This was necessary to handle missing values and to prepare the data for one-hot encoding

The MultiLabelBinarizer performed one-hot encoding on the 'genre' column. This transformed the genres into a binary format where each genre became a separate column and the presence of a genre in an anime was marked as '1'.

The rating column was normalized using 'MinMaxScaler' which scaled the rating to a range between 0 and 1. This step was important to ensure that the ratings were on a similar scale as the encoded genre features to make data suitable for similar calculations.

The encoded genres and the normalized ratings were combined into a single feature matrix.
This matrix will be used in further steps to calculate similarities between different anime forming the basis of the recommendation system

In [17]:
print(df.columns)

Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members',
       'genres', 'rating_normalized'],
      dtype='object')


In [18]:
print(df.head())

   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0          [Drama,  Romance,  School,  Supernatural]  Movie        1    9.37   
1  [Action,  Adventure,  Drama,  Fantasy,  Magic,...     TV       64    9.26   
2  [Action,  Comedy,  Historical,  Parody,  Samur...     TV       51    9.25   
3                                [Sci-Fi,  Thriller]     TV       24    9.17   
4  [Action,  Comedy,  Historical,  Parody,  Samur...     TV       51    9.16   

   members                                             genres  \
0   200630          [Drama,  Romance,  School,  Supernatural]   
1   793665  [Action,  Adventure,  Drama,  Fantasy,  Magic,...   
2   114262  [Action,  Comedy,

**Recommendation System**                            

In [19]:
from sklearn.metrics.pairwise import cosine_similarity

def recommend_anime(target_anime, features, df, top_n=5):
    #Calculate cosine similarity between the target anime and all other anime
    target_index = df[df['name'] == target_anime].index[0]
    similarity_scores = cosine_similarity(features.iloc[target_index:target_index+1], features).flatten()
    #Get the indices of the most similar anime
    similar_indices = similarity_scores.argsort()[-top_n-1:-1][::-1]
    #Return the title of the most similar anime
    return df['name'].iloc[similar_indices]
#Example usage:
recommendations = recommend_anime('Naruto', features,df,top_n=5)
print(recommendations)

615                                    Naruto: Shippuuden
1103    Boruto: Naruto the Movie - Naruto ga Hokage ni...
486                              Boruto: Naruto the Movie
1343                                          Naruto x UT
1472          Naruto: Shippuuden Movie 4 - The Lost Tower
Name: name, dtype: object


**Observation**

Defined a function 'recommend_anime' that uses cosine similarity to recommend similar anime based on a given anime name.
**Cosine Similarity Calculation**:The function calculates the cosine similarity between the feature vector of the target anime and all other anime in the dataset. The similarity score is used to determine how close the other anime is to the target anime in terms of features like genre and rating.
**Recommendation Process**:The function sorts these similarity scores and retrieves the indices of the most similar anime. it returns the names of these similar anime as recommendations.
**Example**: tested with "Naruto" as the target anime and it returns a list of five anime that are most similar to "Naruto" based on the feature matrix.

**Evaluation**

In [37]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

#Split the dataset
X_train, X_test, y_train, y_test = train_test_split(features, df['rating'], test_size=0.2, random_state=42)

from sklearn.metrics import precision_score, recall_score,f1_score
rating_threshold = 7
y_test_binary = (y_test >= rating_threshold).astype(int)

y_pred_binary = []
for i, anime in X_test.iterrows():
    recommendations = recommend_anime(df.loc[i, 'name'], features, df, top_n=5)
    #Assuming the first recommendation is the most relevant
    if recommendations.iloc[0] in df[df['rating'] >= rating_threshold]['name'].values:
        y_pred_binary.append(1)
    else:
        y_pred_binary.append(0)
#Calculating precision, recall, and F1-Score
precision = precision_score(y_test_binary, y_pred_binary)
recall = recall_score(y_test_binary, y_pred_binary)
f1 = f1_score(y_test_binary, y_pred_binary)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")

Precision: 0.82
Recall: 0.85
F1-Score: 0.83


**Observation**

**Data splitting**: The dataset was split into training and testing sets using an 80-20 split ratio. The 'train_test_split' function from scikit-learn was used for this purpose.
**Threshold for Relevance**: A rating threshold of 7 was chosen to determine the relevance of anime.
Ratings equal to or above this threshold are considered relevant.
**Binary Relevance Labels**:The ratings in the test set are converted into binary relevance labels. If the rating was greater than or equal to the threshold it was marked as '1' or it was marked as '0' means relevant and not relevant

**Generating Recommendations**: For each anime in the test set recommendations were generated using the 'recommend_anime' function. This function calculates the cosine similarity between the target anime and all others to find the most similar anime.

**Evaluating Recommendations**: To evaluate the recommendation system the relevance of the top recommendation for each test anime will be checked. If the top recommendation was among the relevant anime it was considered a correct prediction otherwise a wrong prediction.

**Performance Metrics**:Precision, recall, and F1-score were calculated to assess the performance of the recommendation system these metrics provide insights into the accuracy and completeness of the recommendations.
Precision: It measures the proportion of relevant recommendations among all actual relevant items.

Recall: It measures the proportion of relevant recommendations among all actual relevant items.

F1-Score: it is the harmonic mean of precision and recall providing a single metric to evaluate the overall performance.



**Interview Questions**

1. Difference between user-based and item-based collaborative filtering

**User-based collaborative filtering**: Recommend items based on similarities between users. If two users have similar tastes the items liked by one user can be recommended to the other

**item-based collaborative filtering**: Recommends items based on similar items. If two items are similar and a user likes one of them the other can be recommended

2. Collaborative filtering

Collaborative filterinmg is a recommendation technique that identifies user-item relationships by analyzing user interactions.It works by finding similar users or similar items and making recommendations based on those similarities