1. Data Preprocessing

In [2]:
import pandas as pd
import numpy as np


df = pd.read_csv('anime.csv') 

# Display the first few rows

print(df.head())

# Get a summary of the dataset
df.info()

# Handle missing values
print(df.isnull().sum())


# Fill 'genre' and 'type' with 'Unknown'
df['genre'].fillna('Unknown', inplace=True)
df['type'].fillna('Unknown', inplace=True)


df['episodes'] = df['episodes'].replace('Unknown', np.nan)
df['episodes'] = pd.to_numeric(df['episodes'])
df['episodes'].fillna(df['episodes'].median(), inplace=True) # Fill numerical NaNs with median

# Fill 'rating' and 'members' with their respective means
df['rating'].fillna(df['rating'].mean(), inplace=True)
df['members'].fillna(df['members'].mean(), inplace=True)


print(df.isnull().sum())


print(df.head())

   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17   
4  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   

   members  
0   200630  
1   793665  
2   114262  
3   673572  
4   151266  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['genre'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['type'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always

2. Feature Extraction:

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler

# Decide on features for similarity



# TF-IDF Vectorizer is excellent for this.
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the genre data
tfidf_matrix = tfidf_vectorizer.fit_transform(df['genre'])

print("\nShape of TF-IDF matrix (anime x genre terms):", tfidf_matrix.shape)

\

scaler = MinMaxScaler()
df[['rating_scaled', 'members_scaled']] = scaler.fit_transform(df[['rating', 'members']])


numerical_features_scaled = df[['rating_scaled', 'members_scaled']].values




Shape of TF-IDF matrix (anime x genre terms): (12294, 47)


3. Recommendation System (Cosine Similarity):

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

print("\nShape of Cosine Similarity Matrix (anime x anime):", cosine_sim.shape)

indices = pd.Series(df.index, index=df['name']).drop_duplicates()

def get_recommendations(name, cosine_sim_matrix=cosine_sim, df=df, indices=indices, top_n=10, similarity_threshold=0.5):
    try:
        idx = indices[name]
    except KeyError:
        print(f"Anime '{name}' not found in the dataset.")
        return pd.DataFrame()

    sim_scores = list(enumerate(cosine_sim_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = [score for score in sim_scores if score[0] != idx and score[1] > similarity_threshold][:top_n]

    anime_indices = [i[0] for i in sim_scores]
    similarity_values = [i[1] for i in sim_scores]

    recommendations = df['name'].iloc[anime_indices].reset_index(drop=True)
    recommendations_df = pd.DataFrame({
        'Recommended Anime': recommendations,
        'Similarity Score': similarity_values
    })
    return recommendations_df

print("\nRecommendations for 'Naruto':")
print(get_recommendations('Naruto', top_n=5, similarity_threshold=0.1))

print("\nRecommendations for 'Death Note' (higher threshold):")
print(get_recommendations('Death Note', top_n=10, similarity_threshold=0.3))

print("\nRecommendations for 'One Punch Man':")
print(get_recommendations('One Punch Man', top_n=7, similarity_threshold=0.2))

print("\nRecommendations for 'naruto' (will not be found due to case sensitivity):")
print(get_recommendations('naruto'))



Shape of Cosine Similarity Matrix (anime x anime): (12294, 12294)

Recommendations for 'Naruto':
                                   Recommended Anime  Similarity Score
0                           Boruto: Naruto the Movie               1.0
1                                 Naruto: Shippuuden               1.0
2  Boruto: Naruto the Movie - Naruto ga Hokage ni...               1.0
3                                        Naruto x UT               1.0
4        Naruto: Shippuuden Movie 4 - The Lost Tower               1.0

Recommendations for 'Death Note' (higher threshold):
                          Recommended Anime  Similarity Score
0                        Death Note Rewrite          1.000000
1                           Mousou Dairinin          0.967703
2             Higurashi no Naku Koro ni Kai          0.879514
3             Higurashi no Naku Koro ni Rei          0.861056
4                          Mirai Nikki (TV)          0.815429
5         Mirai Nikki (TV): Ura Mirai Nikki       

4. Evaluation

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

df_eval = df.reset_index(drop=True)

train_df, test_df = train_test_split(df_eval, test_size=0.2, random_state=42)

print(f"\nTraining set size: {len(train_df)}")
print(f"Testing set size: {len(test_df)}")

# Re-calculate TF-IDF and cosine similarity ONLY on the training data
tfidf_vectorizer_train = TfidfVectorizer(stop_words='english')
tfidf_matrix_train = tfidf_vectorizer_train.fit_transform(train_df['genre'])
cosine_sim_train = cosine_similarity(tfidf_matrix_train, tfidf_matrix_train)

# Create indices for the training DataFrame
indices_train = pd.Series(train_df.index, index=train_df['name']).drop_duplicates()

def evaluate_recommendations(train_df, test_df, tfidf_vectorizer_train, cosine_sim_train, indices_train, top_n=10, similarity_threshold=0.3):
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    num_test_anime = len(test_df)

    for _, test_anime in test_df.iterrows():
        test_name = test_anime['name']

        test_genre_vector = tfidf_vectorizer_train.transform([test_anime['genre']])
        sim_scores_with_train = cosine_similarity(test_genre_vector, tfidf_matrix_train)[0]

        recommended_indices_train = np.where(sim_scores_with_train > similarity_threshold)[0]
        recommended_anime_train_df = train_df.iloc[recommended_indices_train]

        test_genres = set(test_anime['genre'].split(', '))

        found_relevant_recommendation = False
        for _, rec_anime in recommended_anime_train_df.iterrows():
            rec_genres = set(rec_anime['genre'].split(', '))
            if len(test_genres.intersection(rec_genres)) > 0:
                true_positives += 1
                found_relevant_recommendation = True
            else:
                false_positives += 1

        if not found_relevant_recommendation and len(recommended_anime_train_df) > 0:
            pass
        if len(recommended_anime_train_df) == 0:
            pass

    if (true_positives + false_positives) > 0:
        precision = true_positives / (true_positives + false_positives)
    else:
        precision = 0.0

    print(f"\n--- Evaluation Results (Simplified) ---")
    print(f"Total test anime: {num_test_anime}")
    print(f"True Positives (genre-relevant recommendations found): {true_positives}")
    print(f"False Positives (non-genre-relevant recommendations found): {false_positives}")
    print(f"Precision (Simplified): {precision:.4f}")
    print(f"Note: Recall and F1-score are challenging to calculate accurately in this item-based setup without explicit ground truth similarity sets for each item.")

# Run the evaluation
evaluate_recommendations(train_df, test_df, tfidf_vectorizer_train, cosine_sim_train, indices_train, top_n=5, similarity_threshold=0.2)

print("\n--- Evaluation with lower similarity threshold (0.1) ---")
evaluate_recommendations(train_df, test_df, tfidf_vectorizer_train, cosine_sim_train, indices_train, top_n=5, similarity_threshold=0.1)

print("\n--- Evaluation with higher similarity threshold (0.4) ---")
evaluate_recommendations(train_df, test_df, tfidf_vectorizer_train, cosine_sim_train, indices_train, top_n=5, similarity_threshold=0.4)




Training set size: 9835
Testing set size: 2459

--- Evaluation Results (Simplified) ---
Total test anime: 2459
True Positives (genre-relevant recommendations found): 5382556
False Positives (non-genre-relevant recommendations found): 8319
Precision (Simplified): 0.9985
Note: Recall and F1-score are challenging to calculate accurately in this item-based setup without explicit ground truth similarity sets for each item.

--- Evaluation with lower similarity threshold (0.1) ---

--- Evaluation Results (Simplified) ---
Total test anime: 2459
True Positives (genre-relevant recommendations found): 7676191
False Positives (non-genre-relevant recommendations found): 23380
Precision (Simplified): 0.9970
Note: Recall and F1-score are challenging to calculate accurately in this item-based setup without explicit ground truth similarity sets for each item.

--- Evaluation with higher similarity threshold (0.4) ---

--- Evaluation Results (Simplified) ---
Total test anime: 2459
True Positives (genr

Interview Questions:


**1. Difference between User-Based and Item-Based Collaborative Filtering**

User-Based Collaborative Filtering (UBCF):

Looks at similarities between users.

Assumes that if two users have rated items similarly in the past, they will continue to like similar items in the future.

Example: If User A and User B both like Naruto and One Piece, and User A also likes Bleach, then Bleach can be recommended to User B.

Item-Based Collaborative Filtering (IBCF):

Looks at similarities between items.

Assumes that if a user liked an item, they will also like other items that are similar to it.

Example: If many users who liked Attack on Titan also liked Death Note, then Death Note will be recommended to a user who liked Attack on Titan.



**2. What is Collaborative Filtering, and How Does It Work?**

Definition:
Collaborative Filtering (CF) is a recommendation technique that makes predictions about a user’s interests by collecting preferences from many users (a “collaboration”). It’s widely used in recommendation systems like Netflix, Amazon, and Spotify.

How It Works:

Collect user-item interaction data (ratings, likes, purchases, views, etc.).

Find patterns:

Either users with similar preferences (user-based)

Or items consumed together (item-based).

Recommend items based on these patterns.

Example:

If you rated Inception and Interstellar highly, and many others who liked these movies also liked Tenet, the system will recommend Tenet to you.