# Data Preprocessing:
Load the dataset into a suitable data structure (e.g., pandas DataFrame).         
Handle missing values, if any.                                                   
Explore the dataset to understand its structure and attributes.

In [1]:
import pandas as pd

df = pd.read_csv("anime.csv") #load the dataset.

df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [3]:
df.shape   #No. of Row & No. of Columns.

(12294, 7)

In [4]:
df.isnull().sum() #Check/Handle Missing Values.

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [5]:
df['genre'].fillna('Unknown', inplace=True)   # Fill missing values in 'genre' and 'type' with 'Unknown'
df['type'].fillna('Unknown', inplace=True)

df['rating'].fillna('mean', inplace=True)   # Change missing values in 'rating' with the mean values.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['genre'].fillna('Unknown', inplace=True)   # Fill missing values in 'genre' and 'type' with 'Unknown'
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['type'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because 

In [6]:
df.isnull().sum()   # Verify that there are no missing values left.

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

In [7]:
df.describe() # Explore describe for numeric attributes

Unnamed: 0,anime_id,members
count,12294.0,12294.0
mean,14058.221653,18071.34
std,11455.294701,54820.68
min,1.0,5.0
25%,3484.25,225.0
50%,10260.5,1550.0
75%,24794.5,9437.0
max,34527.0,1013917.0


In [8]:
unique_names = df['name'].nunique()  # Number of unique values in 'name' and 'genre', Return number of unique elements in the object.Excludes NA values by default.
unique_genres = df['genre'].nunique()

In [9]:
unique_names

12292

In [10]:
unique_genres

3265

In [11]:
# Explore the distribution of categorical attributes

genre_counts = df['genre'].value_counts()
type_counts = df['type'].value_counts()

In [12]:
genre_counts

genre
Hentai                                                  823
Comedy                                                  523
Music                                                   301
Kids                                                    199
Comedy, Slice of Life                                   179
                                                       ... 
Adventure, Drama, Fantasy, Game, Sci-Fi                   1
Adventure, Demons, Fantasy, Historical                    1
Action, Comedy, Drama, Mecha, Music, Sci-Fi, Shounen      1
Action, Comedy, Fantasy, Mecha, Sci-Fi, Shounen           1
Hentai, Slice of Life                                     1
Name: count, Length: 3265, dtype: int64

In [13]:
type_counts

type
TV         3787
OVA        3311
Movie      2348
Special    1676
ONA         659
Music       488
Unknown      25
Name: count, dtype: int64

In [14]:
# Calculate the correlation matrix using only numeric columns
correlation_matrix = df[['anime_id', 'rating', 'members']].corr()

correlation_matrix

ValueError: could not convert string to float: 'mean'

# Feature Extraction:

Decide on the features that will be used for computing similarity (e.g., genres, user ratings).                                                           
Convert categorical features into numerical representations if necessary.        
Normalize numerical features if required.


In [16]:
# Using one-hot encoding method to Convert 'genre' and 'type' to numerical values.

df_genres = df['genre'].str.get_dummies(sep=', ') #Convert this to a numerical format using techniques like one-hot encoding, where each genre becomes a binary vector.
df_type = pd.get_dummies(df['type'], prefix='type')

df_encoded = pd.concat([df, df_genres, df_type], axis=1)  # Combine the one-hot encoded columns with the original dataframe

In [17]:
df_encoded

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,Action,Adventure,Cars,...,Vampire,Yaoi,Yuri,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,type_Unknown
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,0,0,0,...,0,0,0,True,False,False,False,False,False,False
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665,1,1,0,...,0,0,0,False,False,False,False,False,True,False
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262,1,0,0,...,0,0,0,False,False,False,False,False,True,False
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572,0,0,0,...,0,0,0,False,False,False,False,False,True,False
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266,1,0,0,...,0,0,0,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211,0,0,0,...,0,0,0,False,False,False,True,False,False,False
12290,5543,Under World,Hentai,OVA,1,4.28,183,0,0,0,...,0,0,0,False,False,False,True,False,False,False
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219,0,0,0,...,0,0,0,False,False,False,True,False,False,False
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175,0,0,0,...,0,0,0,False,False,False,True,False,False,False


In [18]:
# Normalised Numeric Features.
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_encoded['rating'] = scaler.fit_transform(df_encoded[['rating']]) # Normalise 'rating' column

ValueError: could not convert string to float: 'mean'

In [19]:
# Combine features into a final dataset for computating similarity.

features = pd.concat([df_encoded['rating'], df_genres, df_type], axis=1)

In [20]:
features

Unnamed: 0,rating,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,...,Vampire,Yaoi,Yuri,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,type_Unknown
0,9.37,0,0,0,0,0,0,1,0,0,...,0,0,0,True,False,False,False,False,False,False
1,9.26,1,1,0,0,0,0,1,0,1,...,0,0,0,False,False,False,False,False,True,False
2,9.25,1,0,0,1,0,0,0,0,0,...,0,0,0,False,False,False,False,False,True,False
3,9.17,0,0,0,0,0,0,0,0,0,...,0,0,0,False,False,False,False,False,True,False
4,9.16,1,0,0,1,0,0,0,0,0,...,0,0,0,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,4.15,0,0,0,0,0,0,0,0,0,...,0,0,0,False,False,False,True,False,False,False
12290,4.28,0,0,0,0,0,0,0,0,0,...,0,0,0,False,False,False,True,False,False,False
12291,4.88,0,0,0,0,0,0,0,0,0,...,0,0,0,False,False,False,True,False,False,False
12292,4.98,0,0,0,0,0,0,0,0,0,...,0,0,0,False,False,False,True,False,False,False


#Recommendation System:
Design a function to recommend anime based on cosine
similarity.                     
Given a target anime, recommend a list of similar anime based on cosine similarity scores.                                                               
Experiment with different threshold values for similarity scores to adjust the recommendation list size.


In [21]:
from sklearn.metrics.pairwise import cosine_similarity  #Cosine similarity is a measure that calculates the cosine of the angle between two non-zero vectors, represent the features of the anime. The cosine similarity is 0 to 1,

# Cosine Similarity Matrix
cosine_sim = cosine_similarity(features)  # 'features' will be the DataFrame containing only the relevant features for similarity.

ValueError: could not convert string to float: 'mean'

In [22]:
# Recommended Similar Anime.

def recommend_anime(anime_name, df, cosine_sim, top_n = 10, threshold=0.0):
  # recommend_anime - takes the target anime, the data frame, and the cosine similarity matrix.
  # top_n - Number of recommendations to return.
  # threshold - Minimum similarity score for recommendations to be considered.

    idx = df.index[df['name'] == anime_name].tolist()[0]      # Get the index of the target anime.

    sim_scores = list(enumerate(cosine_sim[idx]))     # Get the similarity scores for the target anime.

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse= True)     # Sort the anime based on similarity scores in descending/reverse order.

    sim_scores = [x for x in sim_scores if x[1] >= threshold]     # Filter out anime with similarity scores below the threshold.

    sim_indices = [i[0] for i in sim_scores[1:top_n+1]]       # Get the indices of the most similar anime, Skip the first one as it's the same anime

    return df['name'].iloc[sim_indices]


In [23]:
target_anime = 'Gintama'    #Example target anime.
recommended_anime = recommend_anime(target_anime, df_encoded, cosine_sim, top_n=10, threshold=0.5)
print(recommended_anime)

NameError: name 'cosine_sim' is not defined

In [24]:
target_anime = 'Kimi no Na wa.'   #Example target anime.
recommended_anime = recommend_anime(target_anime, df_encoded, cosine_sim, top_n=20, threshold=0.5)
print(recommended_anime)

NameError: name 'cosine_sim' is not defined

In [25]:
# The error IndexError: index 0 is out of bounds for axis 0 with size 0 indicates that the list or array being accessed is empty,
# which means the code is attempting to access an index that doesn't exist.

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

def recommend_anime(anime_name, df, cosine_sim, top_n=5, threshold=0.0):
    # Check if the anime name exists in the DataFrame
    if anime_name not in df['name'].values:
        return f"Anime '{anime_name}' not found in the dataset."

    # Get the index of the target anime
    idx_list = df.index[df['name'] == anime_name].tolist()

    if len(idx_list) == 0:
        return f"Anime '{anime_name}' not found in the dataset. Index list is empty."

    idx = idx_list[0]

    # Debug: Print the index and ensure it's valid
    print(f"Index of '{anime_name}': {idx}")

    # Get the similarity scores for the target anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Debug: Print the first few similarity scores to ensure they are calculated correctly
    print(f"Similarity scores (first 5): {sim_scores[:5]}")

    # Sort the anime based on similarity scores in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Filter out anime with similarity scores below the threshold
    sim_scores = [x for x in sim_scores if x[1] >= threshold]

    # Debug: Print the number of recommendations found after applying the threshold
    print(f"Number of recommendations after applying threshold: {len(sim_scores)}")

    # Get the indices of the most similar anime
    sim_indices = [i[0] for i in sim_scores[1:top_n+1]]  # Skip the first one as it's the same anime

    if len(sim_indices) == 0:
        return f"No similar anime found for '{anime_name}' with the given threshold."

    # Return the top similar anime
    return df['name'].iloc[sim_indices]

# Example Usage
target_anime = 'Kimi no Na wa.'
recommended_anime = recommend_anime(target_anime, df_encoded, cosine_sim, top_n=10, threshold=0.5)
print(recommended_anime)


NameError: name 'cosine_sim' is not defined

#Evaluation:

Split the dataset into training and testing sets.                                
Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.                                                            
Analyze the performance of the recommendation system and identify areas of improvement.


In [26]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df_encoded, test_size = 0.2, random_state = 42)  # Split the dataset based on a threshold.

In [27]:
from sklearn.metrics import precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

def evaluate_recommendations(df_train, df_test, cosine_sim, top_n=10, threshold=0.0):
    y_true = []
    y_pred = []

    for anime in df_test['name']:

        actual_genres = set(df_test[df_test['name'] == anime]['genre'].values[0].split(', ')) # Get the actual genres (ground truth)

        recommended_anime = recommend_anime(anime, df_train, cosine_sim, top_n, threshold)    # Get the recommended anime

        if isinstance(recommended_anime, str):    # If recommend_anime returns a string (error message), skip the iteration
            continue

        predicted_genres = set()  # Get the predicted genres from the recommended anime
        for rec_anime in recommended_anime:
            predicted_genres.update(set(df_train[df_train['name'] == rec_anime]['genre'].values[0].split(', ')))

        y_true.append(list(actual_genres))  # Append the ground truth and predictions to y_true and y_pred
        y_pred.append(list(predicted_genres))

    # Calculate Precision, Recall, and F1-score
    precision = precision_score(y_true, y_pred, average='micro')
    recall = recall_score(y_true, y_pred, average='micro')
    f1 = f1_score(y_true, y_pred, average='micro')
# average = micro says the function to compute f1 by considering total true positives, false negatives, and false positives (no matter the prediction for each label in the dataset)
    return precision, recall, f1

# Example given
precision, recall, f1 = evaluate_recommendations(df_train, df_test, cosine_sim, top_n=10, threshold=0.5)
print(f'Precision: {precision}, Recall: {recall}, F1-Score: {f1}')


NameError: name 'cosine_sim' is not defined

**Analyze the Performance :**

Precision: High precision means most of the recommended anime are relevant. If precision is low, the system might be recommending many irrelevant anime.

Recall: High recall indicates that the system is successfully capturing most of the relevant anime. If recall is low, many relevant anime are being missed.

F1-Score: A balanced F1-score indicates that the system performs well in both precision and recall. A low F1-score suggests room for improvement in either precision, recall, or both.

**Areas of Improvement:**

Feature Engineering: Adding more features (e.g., user demographics if available) or refining existing ones can improve the system's ability to distinguish between anime.

Hyperparameter Tuning: Experimenting with different thresholds, top_n values, and similarity measures can help fine-tune the system.

Handling Cold Start: If the system struggles with new or less popular anime (cold start problem), techniques like collaborative filtering or hybrid approaches could be explored.


#**Interview Questions :**
**1.Can you explain the difference between user-based and item-based collaborative filtering?**

Ans - User-based and item-based collaborative filtering are two common approaches used in recommendation systems. Both methods rely on the principle of finding similarities, but they differ in what they compare :

User-Based Collaborative Filtering -

The idea behind user-based collaborative filtering is that if two users have similar preferences (e.g., they rated the same items similarly), then what one user likes, the other is likely to like as well. The system recommends items to a user based on the preferences of other similar users.

Steps:                                                                           
Find Similar Users : For a given user, find other users who have similar tastes. This can be done using similarity measures like Pearson correlation, cosine similarity, or Euclidean distance.                           
Generate Recommendations : Recommend items that similar users liked but that the target user has not yet interacted with.

Example:

Imagine two users, Alice and Bob, who both like sci-fi movies. If Alice has watched a sci-fi movie that Bob hasn’t seen yet, the system might recommend that movie to Bob based on the similarity between Alice’s and Bob’s past ratings.

Item-Based Collaborative Filtering -

Item-based collaborative filtering focuses on finding similarities between items rather than users. The premise is that if a user likes one item, they will likely like similar items. Recommendations are made by looking at the items a user has already liked and finding similar items to suggest.

Steps:                                                                          
Find Similar Items: For each item that a user has interacted with, find other items that are similar. Similarity is often calculated using methods like cosine similarity or adjusted cosine similarity.                                 
Generate Recommendations: Recommend items similar to those the user has already interacted with.

Example:
If Bob likes the movie "Inception," and "Interstellar" is similar to "Inception" based on how other users rated these movies, the system might recommend "Interstellar" to Bob.

**2.What is collaborative filtering, and how does it work?**

Collaborative filtering is a popular technique used in recommendation systems to suggest items (like movies, books, products, etc.) to users based on the preferences and behaviors of other users. The fundamental idea behind collaborative filtering is that users who have agreed in the past will agree in the future. Essentially, it leverages the collective knowledge of many users to make personalized recommendations for an individual user.

Collaborative filtering works by finding patterns in user-item interactions (like ratings, purchases, clicks, etc.) and using these patterns to predict a user's preferences for items they haven't yet interacted with. There are two main types of collaborative filtering: user-based and item-based.

Types of Data Used -

Collaborative filtering relies on a user-item interaction matrix, where rows represent users and columns represent items. The entries in the matrix indicate the interaction between users and items, such as:

Explicit Feedback: User ratings (e.g., a user rating a movie 5 stars).           
Implicit Feedback: Actions that imply preference (e.g., clicks, views, purchases).

Advantages -

No Need for Domain Knowledge: Collaborative filtering does not require knowledge of item content (e.g., genres of movies) but instead relies solely on user interactions.                                                               
Personalization: It can provide highly personalized recommendations by leveraging the preferences of similar users or similar items.  
                  
Collaborative filtering is a powerful technique used in recommendation systems to make personalized suggestions by leveraging the preferences and behaviors of a large user base. It works by finding patterns in user interactions with items and using these patterns to predict what other items a user might like. The two main approaches—user-based and item-based collaborative filtering—each have their strengths and weaknesses, and are often combined to create more effective recommendation systems.
