### Import necessary library

In [1]:
# pandas for loading and manipulating the dataset
import pandas as pd

### Load the dataset

In [2]:
file_path = '/content/anime.csv'
data = pd.read_csv(file_path)

In [3]:
# Display the first 5 rows to understand the structure of the data
print("First 5 rows of the dataset:")
data.head()

First 5 rows of the dataset:


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


# **Data Preprocessing**

### Handle missing values

In [4]:
# Check for missing values in each column
missing_values = data.isnull().sum()

In [7]:
# Display the result to see which columns have missing values
print("Missing values in each column:")
missing_values

Missing values in each column:


Unnamed: 0,0
anime_id,0
name,0
genre,62
type,25
episodes,0
rating,230
members,0


### Explore the dataset

In [8]:
# Check the shape of the dataset (number of rows and columns)
shape = data.shape
print(f"\nDataset contains {shape[0]} rows and {shape[1]} columns.")


Dataset contains 12294 rows and 7 columns.


In [10]:
# Check data types of each column
data_types = data.dtypes
print("\nData types of each column:")
data_types


Data types of each column:


Unnamed: 0,0
anime_id,int64
name,object
genre,object
type,object
episodes,object
rating,float64
members,int64


In [12]:
# Summary statistics for numerical columns
summary_stats = data.describe()
print("\nSummary statistics for numerical columns:")
summary_stats


Summary statistics for numerical columns:


Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


# **Feature Extraction**

In [14]:
# Let's inspect the columns and choose some meaningful features.

# Display the column names to understand what features are available
print("\nAvailable columns in the dataset:")
data.columns


Available columns in the dataset:


Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members'], dtype='object')

### Selecting relevent features

In [15]:
# For the purpose of this task, we can assume that 'rating' and 'genre' columns might be useful for similarity computation.

# Let's assume 'genre' is a categorical feature and 'rating' is a numerical feature.
# If necessary, adjust this based on the actual dataset content.
selected_features = ['genre', 'rating']  # Replace these with actual relevant feature names from your dataset

### Handle categorical features

In [16]:
# Check if 'genre' is categorical (assuming 'genre' exists in the dataset)
if 'genre' in data.columns:
    # Perform one-hot encoding on the 'genre' column
    data_encoded = pd.get_dummies(data, columns=['genre'])
else:
    print("\n'genre' column not found in the dataset.")

### Normalizing numerical features

Importing library

In [17]:
# We'll use Min-Max scaling for normalization, which scales values to a range between 0 and 1.

from sklearn.preprocessing import MinMaxScaler

In [18]:
# Initialize the MinMaxScaler
scaler = MinMaxScaler()

In [19]:
# Check if 'rating' exists and normalize it
if 'rating' in data.columns:
    data['rating_normalized'] = scaler.fit_transform(data[['rating']])
else:
    print("\n'rating' column not found in the dataset.")

### Display final dataset with processed features

In [21]:
# After one-hot encoding and normalization, let's inspect the modified dataset.
print("\nModified dataset after feature extraction:")
data.head()


Modified dataset after feature extraction:


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,rating_normalized
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,0.92437
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665,0.911164
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262,0.909964
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572,0.90036
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266,0.89916


# **Recommendation System**

### Import necessary libraries for cosine similarity

In [22]:
# We will use cosine similarity from the sklearn library to calculate similarity between anime based on the selected features
from sklearn.metrics.pairwise import cosine_similarity

### Prepare feature set for similarity computation

In [24]:
# Let's extract the relevant feature columns (all one-hot encoded 'genre' columns + 'rating_normalized')
# Here, data_encoded contains the one-hot encoded columns for 'genre', and the normalized rating is already added to the 'data'
if 'rating_normalized' in data.columns:
    # Create a list of features (one-hot encoded columns and 'rating_normalized')
    feature_columns = [col for col in data.columns if 'genre_' in col]  # All genre-related columns
    feature_columns.append('rating_normalized')  # Add normalized rating

    # Extract the features for all anime into a new DataFrame
    feature_matrix = data[feature_columns]
else:
    print("\nNormalized 'rating' column not found. Please ensure that it exists.")

In [35]:
# Fill missing values in the feature matrix with 0
feature_matrix = feature_matrix.fillna(0)

### Cosine similarity calculation

In [36]:
# We'll use the cosine similarity function to compute similarity scores between the target anime and all others
# Function to recommend similar anime
def recommend_anime(target_anime_index, feature_matrix, threshold=0.5):
    """
    Recommend a list of anime based on cosine similarity.

    Parameters:
    target_anime_index (int): The index of the anime in the dataset to use as the target for recommendations.
    feature_matrix (DataFrame): A DataFrame of the feature set used for computing similarity.
    threshold (float): A threshold value for filtering recommendations (0 to 1). Higher values give fewer recommendations.

    Returns:
    A list of indices of similar anime based on the cosine similarity score.
    """

    # Calculate the cosine similarity between the target anime and all others
    # Cosine similarity is computed between the target anime's features and the features of every other anime
    similarity_scores = cosine_similarity([feature_matrix.iloc[target_anime_index]], feature_matrix).flatten()


    # Filter the results based on threshold
    # We can filter the similarity scores to only return anime with scores greater than the threshold
    similar_anime_indices = [i for i, score in enumerate(similarity_scores) if score > threshold and i != target_anime_index]

    return similar_anime_indices, similarity_scores

### Testing recommendation function

In [37]:
# Let's pick an example anime (say at index 0) and get a list of similar anime
target_anime_index = 0  # You can change this to test with different anime
threshold = 0.7  # You can adjust this to get more or fewer recommendations

In [38]:
# Get recommendations
similar_anime, similarity_scores = recommend_anime(target_anime_index, feature_matrix, threshold)

### Display recommended anime and their similarity scores

In [39]:
# For simplicity, we'll display the indices of the recommended anime.
print(f"\nAnime at index {target_anime_index} is similar to the following anime (indices):")
for anime_index in similar_anime:
    print(f"Anime Index: {anime_index}, Similarity Score: {similarity_scores[anime_index]}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Anime Index: 7063, Similarity Score: 1.0
Anime Index: 7064, Similarity Score: 1.0
Anime Index: 7065, Similarity Score: 1.0
Anime Index: 7066, Similarity Score: 1.0
Anime Index: 7067, Similarity Score: 1.0
Anime Index: 7068, Similarity Score: 1.0
Anime Index: 7069, Similarity Score: 1.0
Anime Index: 7070, Similarity Score: 1.0
Anime Index: 7071, Similarity Score: 1.0
Anime Index: 7072, Similarity Score: 1.0
Anime Index: 7073, Similarity Score: 1.0
Anime Index: 7074, Similarity Score: 1.0
Anime Index: 7075, Similarity Score: 1.0
Anime Index: 7076, Similarity Score: 1.0
Anime Index: 7077, Similarity Score: 1.0
Anime Index: 7078, Similarity Score: 1.0
Anime Index: 7079, Similarity Score: 1.0
Anime Index: 7080, Similarity Score: 1.0
Anime Index: 7081, Similarity Score: 1.0
Anime Index: 7082, Similarity Score: 1.0
Anime Index: 7083, Similarity Score: 1.0
Anime Index: 7084, Similarity Score: 1.0
Anime Index: 7085, Similarity Sco

# **Evaluation**

### Importing necessary libraries for evaluation

In [40]:
# We'll use train_test_split for splitting the data, and precision, recall, and F1-score metrics from sklearn for evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

### Split dataset into training and testing sets

In [42]:
# We'll split the dataset to evaluate the recommendation system. Typically, we use 80% for training and 20% for testing
# We split the dataset into train and test sets, where we'll "hide" some similar anime and check if the system can recommend them

# Let's assume 'anime_index' as the identifier for anime and create a random split
train_indices, test_indices = train_test_split(data.index, test_size=0.2, random_state=42)

### Simulating ground truth for evaluation

In [43]:
# In real systems, you might have user-item interactions or a ground-truth set of similar anime
# For simplicity, we simulate this by assuming the top k similar anime returned by the cosine similarity function
# are the "ground truth"

def get_ground_truth(anime_index, k=5):
    """
    Simulate ground truth by assuming the top k most similar anime are the relevant items for a given anime.
    """
    similar_anime, similarity_scores = recommend_anime(anime_index, feature_matrix, threshold=0)
    return similar_anime[:k]  # Return the top k similar anime as ground truth

### Evaluating recommended system

In [None]:
# We'll evaluate the system by checking how many of the top recommended anime are in the ground truth
# For each test anime, we compare the system's recommendations with the ground truth and compute precision, recall, and F1-score

In [44]:
def evaluate_recommendations(test_indices, k=5, threshold=0.7):
    """
    Evaluate the recommendation system on the test set using precision, recall, and F1-score.

    Parameters:
    test_indices (list): A list of test set anime indices.
    k (int): Number of relevant anime in the ground truth.
    threshold (float): The similarity threshold for recommendations.

    Returns:
    Precision, Recall, F1-score for the recommendations.
    """
    precisions = []
    recalls = []
    f1_scores = []

    for test_index in test_indices:
        # Get ground truth (top k similar anime)
        ground_truth = get_ground_truth(test_index, k=k)

        # Get recommendations (anime with similarity score above threshold)
        recommended_anime, _ = recommend_anime(test_index, feature_matrix, threshold=threshold)

        # Convert both ground truth and recommendations to sets for comparison
        ground_truth_set = set(ground_truth)
        recommended_set = set(recommended_anime)

        # Calculate precision, recall, and F1-score for this anime
        if recommended_set:
            precision = len(ground_truth_set & recommended_set) / len(recommended_set)
            recall = len(ground_truth_set & recommended_set) / len(ground_truth_set)
        else:
            precision = 0
            recall = 0

        if precision + recall > 0:
            f1 = 2 * (precision * recall) / (precision + recall)
        else:
            f1 = 0

        precisions.append(precision)
        recalls.append(recall)
        f1_scores.append(f1)

    # Return the average precision, recall, and F1-score across all test items
    return {
        "Precision": sum(precisions) / len(precisions),
        "Recall": sum(recalls) / len(recalls),
        "F1-score": sum(f1_scores) / len(f1_scores)
    }

### Run the evaluation

In [45]:
# Let's run the evaluation for our recommendation system on the test set.
evaluation_results = evaluate_recommendations(test_indices, k=5, threshold=0.7)

### Display evaluation results

In [47]:
print("\nEvaluation Results:")
print(f"Precision: {evaluation_results['Precision']:.4f}")
print(f"Recall: {evaluation_results['Recall']:.4f}")
print(f"F1-score: {evaluation_results['F1-score']:.4f}")


Evaluation Results:
Precision: 0.0004
Recall: 0.9809
F1-score: 0.0008


# **Interview Questions**

**1.Can you explain the difference between user-based and item-based collaborative filtering?**

User-based collaborative filtering recommends items to a user by identifying other users with similar preferences and suggesting items that those similar users liked. It relies on the idea that if users A and B have rated items similarly in the past, user A may enjoy an item that user B has liked but A hasn't tried yet. On the other hand, item-based collaborative filtering focuses on the relationships between items rather than users. It recommends items based on their similarity to items the user has already liked or rated highly. For example, if a user liked one movie, item-based filtering would suggest movies that are similar to the one they liked, based on the preferences of all users.

**2.What is collaborative filtering, and how does it work?**

Collaborative filtering is a recommendation technique that predicts a user's preferences based on the behavior and preferences of other users. It works by leveraging patterns of shared behavior—users who have similar tastes or preferences are grouped together, and recommendations are made by analyzing the actions of these similar users. Collaborative filtering can be user-based, where recommendations are made based on users with similar preferences, or item-based, where items similar to the ones a user has liked are recommended. It doesn’t rely on the content of the items but instead on user interactions, such as ratings, clicks, or purchases, to make predictions.