# Recommendation System

## Objective:
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset. 
## Dataset:
Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.


## Data Preprocessing:

Load the dataset into a suitable data structure (e.g., pandas DataFrame).
Handle missing values, if any.
Explore the dataset to understand its structure and attributes.



In [3]:
import pandas as pd

# Load the dataset
anime_df = pd.read_csv('anime.csv')

# Display basic information about the dataset
print(anime_df.info())
anime_df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [4]:
# Check for missing values
print(anime_df.isnull().sum())

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


In [5]:
anime_df.dropna(subset=['rating', 'genre'], inplace=True)

In [10]:
anime_df.isnull().sum()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

In [11]:
# Summary statistics
print(anime_df.describe())

# Explore unique genres
print(anime_df['genre'].unique())

           anime_id        rating       members
count  12017.000000  12017.000000  1.201700e+04
mean   13638.001165      6.478264  1.834888e+04
std    11231.076675      1.023857  5.537250e+04
min        1.000000      1.670000  1.200000e+01
25%     3391.000000      5.890000  2.250000e+02
50%     9959.000000      6.570000  1.552000e+03
75%    23729.000000      7.180000  9.588000e+03
max    34519.000000     10.000000  1.013917e+06
['Drama, Romance, School, Supernatural'
 'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen'
 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen' ...
 'Action, Comedy, Hentai, Romance, Supernatural' 'Hentai, Sports'
 'Hentai, Slice of Life']


### Feature Extraction:
##### a.Convert categorical features into numerical representations (e.g., using TF-IDF for genres):

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler

# Convert genres into a numerical representation using TF-IDF Vectorizer
tfidf = TfidfVectorizer(stop_words='english')
anime_df['genre'] = anime_df['genre'].fillna('')
tfidf_matrix = tfidf.fit_transform(anime_df['genre'])

# Normalize the rating column
scaler = MinMaxScaler()
anime_df['rating'] = scaler.fit_transform(anime_df[['rating']])


##### b.Combine the TF-IDF matrix with the normalized ratings:

In [13]:
from scipy.sparse import hstack

# Combine the TF-IDF matrix with the normalized ratings
features_matrix = hstack([tfidf_matrix, anime_df[['rating']]])


## Task 3: Recommendation System
.Compute cosine similarity and create a recommendation function:

In [20]:
# Compute cosine similarity matrix
cosine_sim = cosine_similarity(features_matrix, features_matrix)

# Function to get recommendations
def get_recommendations(title, cosine_sim=cosine_sim):
    try:
        # Get the index of the anime that matches the title
        idx = anime_df[anime_df['name'] == title].index[0]

        # Get the pairwise similarity scores of all anime with that anime
        sim_scores = list(enumerate(cosine_sim[idx]))

        # Sort the anime based on the similarity scores
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

        # Get the scores of the 10 most similar anime
        sim_scores = sim_scores[1:11]

        # Get the anime indices
        anime_indices = [i[0] for i in sim_scores]

        # Return the top 10 most similar anime
        return anime_df['name'].iloc[anime_indices]
    except IndexError:
        return "Anime not found in the dataset."



## Evaluation
##### a.Split the dataset into training and testing sets (for user-item interaction evaluation, we'll need to simulate this):

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np

# Simulate user-anime interaction data
np.random.seed(42)
user_anime_interactions = pd.DataFrame({
    'user_id': np.random.randint(1, 1001, 10000),
    'anime_id': np.random.randint(0, len(anime_df), 10000),
    'rating': np.random.randint(1, 6, 10000)
})

# Merge to get anime names
user_anime_interactions = user_anime_interactions.merge(anime_df[['anime_id', 'name']], left_on='anime_id', right_on='anime_id')

# Train-test split
train, test = train_test_split(user_anime_interactions, test_size=0.2, random_state=42)


##### b.Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.


In [22]:
# For each user in the test set, get the ground truth and the predicted recommendations
def evaluate_recommendations(train, test, k=10):
    y_true = []
    y_pred = []

    for user_id in test['user_id'].unique():
        # Get the anime watched by the user in the training set
        watched_anime_train = train[train['user_id'] == user_id]['name'].tolist()

        # Get the anime watched by the user in the test set (ground truth)
        watched_anime_test = test[test['user_id'] == user_id]['name'].tolist()
        y_true.extend([1] * len(watched_anime_test))

        # Get recommendations for each watched anime
        user_recommendations = []
        for anime in watched_anime_train:
            recommended_anime = get_recommendations(anime)
            user_recommendations.extend(recommended_anime)

        # Flatten and get the top k recommendations
        user_recommendations = list(set(user_recommendations))
        y_pred.extend([1 if anime in user_recommendations else 0 for anime in watched_anime_test])

    precision = precision_score(y_true, y_pred, average='binary')
    recall = recall_score(y_true, y_pred, average='binary')
    f1 = f1_score(y_true, y_pred, average='binary')

    return precision, recall, f1

# Evaluate the recommendations
precision, recall, f1 = evaluate_recommendations(train, test)
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")


Precision: 1.0
Recall: 0.0036231884057971015
F1-Score: 0.007220216606498195
