In [None]:
"""Recommendation System"""

In [None]:
###Data Preprocessing

In [18]:
import pandas as pd

# Load the dataset
file_path = 'anime11.csv'  
anime_df = pd.read_csv(file_path)

# Display first few rows
print("Sample Data:")
print(anime_df.head())

# Dataset info
print("Dataset Information:")
anime_df.info()

# Check for missing values
print("Missing Values Per Column:")
missing_values = anime_df.isnull().sum()
print(missing_values)


Sample Data:
   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17   
4  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   

   members  
0   200630  
1   793665  
2   114262  
3   673572  
4   151266  

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 # 

In [19]:
# Drop rows where 'name' is missing
anime_df = anime_df.dropna(subset=['name'])

# Fill missing 'genre' and 'type' with 'Unknown'
anime_df['genre'] = anime_df['genre'].fillna('Unknown')
anime_df['type'] = anime_df['type'].fillna('Unknown')

# Replace missing or non-numeric episodes with 0
anime_df['episodes'] = pd.to_numeric(anime_df['episodes'], errors='coerce')
anime_df['episodes'] = anime_df['episodes'].fillna(0).astype(int)

# Fill missing 'rating' with mean rating
mean_rating = anime_df['rating'].mean()
anime_df['rating'] = anime_df['rating'].fillna(mean_rating)

# Fill missing 'members' with median
median_members = anime_df['members'].median()
anime_df['members'] = anime_df['members'].fillna(median_members)

# Verify missing values again
print(anime_df.isnull().sum())


anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


In [20]:
# Shape of the dataset
print(f"Dataset Shape: {anime_df.shape}")

# Column names  
print("Column Names:")
print(anime_df.columns.tolist())

# Data types
print("Data Types:")
print(anime_df.dtypes)

# Statistical summary for numerical features
print("Statistical Summary (Numerical Features):")
print(anime_df.describe())

# Statistical summary for object (categorical) features
print("Unique Values per Column:")
for col in anime_df.select_dtypes(include='object').columns:
    print(f"{col}: {anime_df[col].nunique()} unique values")

# Top 5 genres
print("Top 5 Most Common Genres:")
print(anime_df['genre'].value_counts().head())

# Types of Anime
print("Anime Types Distribution:")
print(anime_df['type'].value_counts())

# Sample data
print("Sample Data:")
print(anime_df.head())


Dataset Shape: (12294, 7)

Column Names:
['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members']

Data Types:
anime_id      int64
name         object
genre        object
type         object
episodes      int32
rating      float64
members       int64
dtype: object

Statistical Summary (Numerical Features):
           anime_id      episodes        rating       members
count  12294.000000  12294.000000  12294.000000  1.229400e+04
mean   14058.221653     12.040101      6.473902  1.807134e+04
std    11455.294701     46.257299      1.017096  5.482068e+04
min        1.000000      0.000000      1.670000  5.000000e+00
25%     3484.250000      1.000000      5.900000  2.250000e+02
50%    10260.500000      2.000000      6.550000  1.550000e+03
75%    24794.500000     12.000000      7.170000  9.437000e+03
max    34527.000000   1818.000000     10.000000  1.013917e+06

Unique Values per Column:
name: 12292 unique values
genre: 3265 unique values
type: 7 unique values

Top 5 Most Common 

In [1]:
"""Dataset Shape:

Total Rows: 12,294

Total Columns: 7

Columns & Descriptions:

Column	Description
anime_id    	Unique identifier for each anime
name	        Title of the anime
genre	        Genre(s) of the anime (comma-separated)
type	        Format type (TV, Movie, OVA, etc.)
episodes	    Number of episodes
rating	        Average user rating (scale ~1.67 to 10)
members	        Number of community members rated/joined

Numerical Features Summary:

Feature  	Min	   Max 	   Mean	    Std Dev
anime_id	1	  34,527    14,058	  11,455
episodes	0	  1,818	    12.04	  46.25
rating	    1.67   10.0	    6.47	   1.02
members  	5	 1,013,917 	18,071   54,820

 episodes: Ranges from 0 to 1818, indicating high variability in anime lengths.

 rating: Mostly between 1.67 and 10; most anime have ratings around 6.5–7.

 members: Varies widely from 5 to over a million, reflecting large differences in anime popularity.

Categorical Insights:

 genre: Contains multiple genres per anime, separated by commas. The diversity is high, with 3,265 unique combinations.

 type: Includes 7 categories, where TV, Movie, and OVA formats are most frequent.

Top 5 Most Common Genres:

Hentai

Comedy

Music

Kids

Comedy, Slice of Life

Anime Types Distribution:

TV: 3,787

OVA: 3,311

Movie: 2,348

Special: 1,676

ONA: 659

Music: 488

Unknown: 25

Example Insights:

The most common types are: TV > OVA > Movie.

 The most common genres include: Hentai, Comedy, Music, etc.

 Many anime have a very small number of episodes, but a few have hundreds, which skews the episode count.

Popularity (measured by members) is highly skewed: a few anime are extremely popular while many have low engagement."""



'Dataset Shape:\n\nTotal Rows: 12,294\n\nTotal Columns: 7\n\nColumns & Descriptions:\n\nColumn\tDescription\nanime_id    \tUnique identifier for each anime\nname\t        Title of the anime\ngenre\t        Genre(s) of the anime (comma-separated)\ntype\t        Format type (TV, Movie, OVA, etc.)\nepisodes\t    Number of episodes\nrating\t        Average user rating (scale ~1.67 to 10)\nmembers\t        Number of community members rated/joined\n\nNumerical Features Summary:\n\nFeature  \tMin\t   Max \t   Mean\t    Std Dev\nanime_id\t1\t  34,527    14,058\t  11,455\nepisodes\t0\t  1,818\t    12.04\t  46.25\nrating\t    1.67   10.0\t    6.47\t   1.02\nmembers  \t5\t 1,013,917 \t18,071   54,820\n\n episodes: Ranges from 0 to 1818, indicating high variability in anime lengths.\n\n rating: Mostly between 1.67 and 10; most anime have ratings around 6.5–7.\n\n members: Varies widely from 5 to over a million, reflecting large differences in anime popularity.\n\nCategorical Insights:\n\n genre: C

In [None]:
###  Feature Extraction

In [22]:
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler

#  1. Process genres using MultiLabelBinarizer 
# Split the genres by comma and handle missing/unknown
anime_df['genre'] = anime_df['genre'].fillna('Unknown')
anime_df['genre_list'] = anime_df['genre'].apply(lambda x: x.split(', '))

mlb = MultiLabelBinarizer()
genre_encoded = mlb.fit_transform(anime_df['genre_list'])
genre_df = pd.DataFrame(genre_encoded, columns=mlb.classes_)

# 2. Normalize numerical features 
scaler = MinMaxScaler()
numeric_features = anime_df[['rating', 'members']]
numeric_scaled = scaler.fit_transform(numeric_features)
numeric_df = pd.DataFrame(numeric_scaled, columns=['rating', 'members'])

# 3. Combine genre and numerical features 
features = pd.concat([genre_df, numeric_df], axis=1)

# Final feature set shape
print(f"Combined Feature Set Shape: {features.shape}")
features.head()


Combined Feature Set Shape: (12294, 46)


Unnamed: 0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri,rating,members
0,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0.92437,0.197872
1,1,1,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0.911164,0.78277
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.909964,0.112689
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0.90036,0.664325
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.89916,0.149186


In [3]:
"""Selected Features:

Genres: Transformed into 43 binary columns (e.g., Action, Adventure, Comedy, etc.) using MultiLabelBinarizer.

Rating: Normalized between 0 and 1.

Members: Normalized between 0 and 1.

Categorical to Numerical Conversion:

The genre column was split on commas and one-hot encoded to convert genres into binary features.

Normalization:

Rating and Members were scaled using Min-Max Scaling to align their range with genre binary features.

Final Feature Set:

The final feature matrix contains 46 features:

43 genre indicators

1 normalized rating

1 normalized members

Example Output Shape: (12294, 46)."""



'Selected Features:\n\nGenres: Transformed into 43 binary columns (e.g., Action, Adventure, Comedy, etc.) using MultiLabelBinarizer.\n\nRating: Normalized between 0 and 1.\n\nMembers: Normalized between 0 and 1.\n\nCategorical to Numerical Conversion:\n\nThe genre column was split on commas and one-hot encoded to convert genres into binary features.\n\nNormalization:\n\nRating and Members were scaled using Min-Max Scaling to align their range with genre binary features.\n\nFinal Feature Set:\n\nThe final feature matrix contains 46 features:\n\n43 genre indicators\n\n1 normalized rating\n\n1 normalized members\n\nExample Output Shape: (12294, 46).'

In [None]:
###  Recommendation System

In [24]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# --- 1. Compute cosine similarity matrix ---
cosine_sim = cosine_similarity(features)

# --- 2. Create a reverse index for anime name to index mapping ---
anime_indices = pd.Series(anime_df.index, index=anime_df['name']).drop_duplicates()

# --- 3. Recommendation function ---
def recommend_anime(anime_name, top_n=5, similarity_threshold=0.5):
    if anime_name not in anime_indices:
        return "Anime not found in the dataset."
    
    idx = anime_indices[anime_name]
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Filter based on similarity threshold
    sim_scores = [(i, score) for i, score in sim_scores if score >= similarity_threshold and i != idx]
    
    # Sort by similarity score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Select top N recommendations
    sim_indices = [i for i, score in sim_scores[:top_n]]
    
    return anime_df['name'].iloc[sim_indices].tolist()

# --- 4. Example Usage ---
recommended = recommend_anime('Steins;Gate', top_n=5, similarity_threshold=0.5)
print("Recommended Animes for 'Steins;Gate':", recommended)


Recommended Animes for 'Steins;Gate': ['Steins;Gate Movie: Fuka Ryouiki no Déjà vu', 'Steins;Gate: Oukoubakko no Poriomania', 'Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero', 'Steins;Gate 0', 'Under the Dog']


In [1]:
"""Cosine Similarity Matrix:

A cosine similarity matrix was computed from the processed feature set (46 features per anime including genres, rating, and members).

This matrix calculates pairwise similarity scores between every anime based on these features.

Index Mapping:

A reverse index mapping from anime names to their DataFrame index was created to retrieve the feature vector of any anime by its name efficiently.

Recommendation Function:

The function recommend_anime(anime_name, top_n, similarity_threshold):

anime_name: The title of the anime for which recommendations are desired.

top_n: Number of similar anime to recommend.

similarity_threshold: Minimum cosine similarity score for an anime to be considered similar.

The function:

Retrieves similarity scores for the given anime.

Filters out the input anime itself and those below the similarity threshold.

Sorts the results by similarity score in descending order.

Returns the top N most similar anime names.

Parameter Tuning:

The similarity_threshold adjusts how strictly the recommendations match the selected anime.

Higher threshold = fewer but more similar recommendations.

Lower threshold = more recommendations but less strictly similar.

The top_n parameter controls the number of recommendations returned.

Example Output:

For the anime 'Steins;Gate', using top_n=5 and similarity_threshold=0.5, the recommended anime are:

Steins;Gate Movie: Fuka Ryouiki no Déjà vu

Steins;Gate: Oukoubakko no Poriomania

Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero

Steins;Gate 0

Under the Dog

These results show that the recommendation system effectively prioritizes sequels, spin-offs, and thematically similar anime."""

"Cosine Similarity Matrix:\n\nA cosine similarity matrix was computed from the processed feature set (46 features per anime including genres, rating, and members).\n\nThis matrix calculates pairwise similarity scores between every anime based on these features.\n\nIndex Mapping:\n\nA reverse index mapping from anime names to their DataFrame index was created to retrieve the feature vector of any anime by its name efficiently.\n\nRecommendation Function:\n\nThe function recommend_anime(anime_name, top_n, similarity_threshold):\n\nanime_name: The title of the anime for which recommendations are desired.\n\ntop_n: Number of similar anime to recommend.\n\nsimilarity_threshold: Minimum cosine similarity score for an anime to be considered similar.\n\nThe function:\n\nRetrieves similarity scores for the given anime.\n\nFilters out the input anime itself and those below the similarity threshold.\n\nSorts the results by similarity score in descending order.\n\nReturns the top N most similar an

In [26]:
def experiment_thresholds(anime_name, thresholds=[0.3, 0.5, 0.7, 0.9]):
    if anime_name not in anime_indices:
        return "Anime not found in the dataset."
    
    idx = anime_indices[anime_name]
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    results = {}
    for threshold in thresholds:
        filtered = [(i, score) for i, score in sim_scores if score >= threshold and i != idx]
        filtered_sorted = sorted(filtered, key=lambda x: x[1], reverse=True)
        recommended_animes = [anime_df['name'].iloc[i] for i, _ in filtered_sorted]
        results[threshold] = {
            'count': len(recommended_animes),
            'recommendations': recommended_animes[:5]  # Show top 5 per threshold
        }
    return results

# --- Example Usage ---
threshold_results = experiment_thresholds('Steins;Gate', thresholds=[0.3, 0.5, 0.7, 0.9])

# Display results
for threshold, data in threshold_results.items():
    print(f"Threshold: {threshold}")
    print(f"Number of Recommendations: {data['count']}")
    print(f"Top 5 Recommendations: {data['recommendations']}")



Threshold: 0.3
Number of Recommendations: 2136
Top 5 Recommendations: ['Steins;Gate Movie: Fuka Ryouiki no Déjà vu', 'Steins;Gate: Oukoubakko no Poriomania', 'Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero', 'Steins;Gate 0', 'Under the Dog']

Threshold: 0.5
Number of Recommendations: 320
Top 5 Recommendations: ['Steins;Gate Movie: Fuka Ryouiki no Déjà vu', 'Steins;Gate: Oukoubakko no Poriomania', 'Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero', 'Steins;Gate 0', 'Under the Dog']

Threshold: 0.7
Number of Recommendations: 46
Top 5 Recommendations: ['Steins;Gate Movie: Fuka Ryouiki no Déjà vu', 'Steins;Gate: Oukoubakko no Poriomania', 'Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero', 'Steins;Gate 0', 'Under the Dog']

Threshold: 0.9
Number of Recommendations: 4
Top 5 Recommendations: ['Steins;Gate Movie: Fuka Ryouiki no Déjà vu', 'Steins;Gate: Oukoubakko no Poriomania', 'Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero', 'Steins;Gate 0']

In [1]:
"""Summary
Objective:
To analyze how adjusting the cosine similarity threshold affects the number of recommendations for a given anime.

Method:

Tested thresholds: 0.3, 0.5, 0.7, 0.9

For each threshold, we:

Counted the number of recommendations generated.

Displayed the top 5 recommendations (where available) for the anime 'Steins;Gate'.

Findings:

Threshold	Number of Recommendations	Top 5 Recommendations
0.3     	  2,136                   	Steins;Gate Movie: Fuka Ryouiki no Déjà vu, Steins;Gate: Oukoubakko no Poriomania, Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero, Steins;Gate 0, Under the Dog
0.5  	      320                       	Same as above
0.7	          46	                        Same as above
0.9        	  4	                        Steins;Gate Movie: Fuka Ryouiki no Déjà vu, Steins;Gate: Oukoubakko no Poriomania, Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero, Steins;Gate 0

Insights:

As the threshold increases, the number of recommendations decreases sharply, focusing only on highly similar anime.

At 0.9 threshold, only direct sequels and closely related spin-offs of Steins;Gate are recommended.

At lower thresholds (0.3), a much wider variety of anime is recommended, but with potentially lower direct relevance.

Conclusion:

Threshold tuning is essential to balance between broad suggestions and highly similar content.

A threshold around 0.5 to 0.7 offers a good compromise between recommendation count and similarity strength for this dataset."""



"Summary\nObjective:\nTo analyze how adjusting the cosine similarity threshold affects the number of recommendations for a given anime.\n\nMethod:\n\nTested thresholds: 0.3, 0.5, 0.7, 0.9\n\nFor each threshold, we:\n\nCounted the number of recommendations generated.\n\nDisplayed the top 5 recommendations (where available) for the anime 'Steins;Gate'.\n\nFindings:\n\nThreshold\tNumber of Recommendations\tTop 5 Recommendations\n0.3     \t  2,136                   \tSteins;Gate Movie: Fuka Ryouiki no Déjà vu, Steins;Gate: Oukoubakko no Poriomania, Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero, Steins;Gate 0, Under the Dog\n0.5  \t      320                       \tSame as above\n0.7\t          46\t                        Same as above\n0.9        \t  4\t                        Steins;Gate Movie: Fuka Ryouiki no Déjà vu, Steins;Gate: Oukoubakko no Poriomania, Steins;Gate: Kyoukaimenjou no Missing Link - Divide By Zero, Steins;Gate 0\n\nInsights:\n\nAs the threshold increases, 

In [None]:
### Evaluation

In [29]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

# --- 1. Load Dataset ---
anime_df = pd.read_csv('anime11.csv')

# --- 2. Data Cleaning ---
anime_df['genre'] = anime_df['genre'].fillna('Unknown')
anime_df['episodes'] = anime_df['episodes'].replace('Unknown', 0).astype(int)
anime_df['rating'] = anime_df['rating'].fillna(anime_df['rating'].mean())

# --- 3. Feature Extraction ---
anime_df['genre_list'] = anime_df['genre'].apply(lambda x: x.split(', '))

mlb = MultiLabelBinarizer()
genre_encoded = mlb.fit_transform(anime_df['genre_list'])
genre_df = pd.DataFrame(genre_encoded, columns=mlb.classes_)

scaler = MinMaxScaler()
numeric_features = anime_df[['rating', 'members']]
numeric_scaled = scaler.fit_transform(numeric_features)
numeric_df = pd.DataFrame(numeric_scaled, columns=['rating', 'members'])

features = pd.concat([genre_df, numeric_df], axis=1)

# --- 4. Split into Train and Test ---
train_df, test_df = train_test_split(anime_df, test_size=0.2, random_state=42)

train_features = features.iloc[train_df.index]
test_features = features.iloc[test_df.index]

# --- 5. Cosine Similarity ---
train_cosine_sim = cosine_similarity(train_features)

train_anime_indices = pd.Series(train_df.index, index=train_df['name']).drop_duplicates()

# --- 6. Evaluation Function ---
def evaluate_recommendations(test_data, similarity_matrix, index_mapping, top_n=5, threshold=0.5):
    y_true = []
    y_pred = []

    for anime_name in test_data['name']:
        if anime_name not in index_mapping:
            continue

        idx = index_mapping[anime_name]
        sim_scores = list(enumerate(similarity_matrix[idx]))
        sim_scores = [(i, score) for i, score in sim_scores if score >= threshold and i != idx]
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

        recommended_indices = [i for i, _ in sim_scores[:top_n]]
        recommended_genres = features.iloc[recommended_indices].idxmax(axis=1)
        target_genre = features.iloc[idx].idxmax()

        hit = any(target_genre == rec_genre for rec_genre in recommended_genres)

        y_true.append(1)  # assume relevance exists
        y_pred.append(1 if hit else 0)

    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)

    return precision, recall, f1

# --- 7. Run Evaluation ---
precision, recall, f1 = evaluate_recommendations(test_df, train_cosine_sim, train_anime_indices, top_n=5, threshold=0.5)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")


Precision: 0.00
Recall: 0.00
F1-score: 0.00


In [None]:
"""Reasons for Zero Scores:
Dominant Genre Matching is Insufficient:
Animes often have multiple genres, and using only the top genre can miss partial matches.

Imbalanced Data:
Some genres are rare, so their dominant genre may not be well-represented in recommendations.

Threshold or Top-N Might Be Too Restrictive:
A high similarity threshold or too few recommendations (top_n=5) may eliminate relevant animes."""



In [None]:
###  Interview Questions

In [None]:
"""1. Can you explain the difference between user-based and item-based collaborative filtering?
User-Based Collaborative Filtering:
Recommends items to a user based on preferences of similar users.
For example, if User A and User B have rated similar items similarly, the system recommends items liked by User B to User A.

Item-Based Collaborative Filtering:
Recommends items similar to those a user has already liked or rated highly.
For example, if a user likes Item X, and Item X is similar to Item Y (because other users rated them similarly), then Item Y is recommended.

Key Difference:

User-Based: Finds similar users first.

Item-Based: Finds similar items first.

2. What is collaborative filtering, and how does it work?
Collaborative Filtering:
A recommendation technique that suggests items based on the interactions or preferences of many users, without needing explicit content information about items.

How it Works:
It relies on the principle that users with similar behaviors or preferences tend to like similar items. Collaborative filtering builds a matrix of user-item interactions (like ratings) and:

Identifies either similar users (user-based) or similar items (item-based).

Recommends items based on these similarities.

There are two main approaches:

Memory-based: Directly uses rating data to find similarities.

Model-based: Uses machine learning models like matrix factorization (e.g., SVD) to predict ratings."""