## 1. Building Basic Similiraty Matrix Model

A **Similarity Matrix Model** is a powerful tool used to measure the similarity between different data points or items, based on some metric or criterion. In this model, a similarity matrix (often denoted as a square matrix) is constructed where each element represents the similarity between two data points. This concept is widely applied in various domains like collaborative filtering for recommendation systems, clustering, text mining, and more.

### Key Components of a Similarity Matrix Model:

1. **Data Points/Items**: These are the entities between which similarity is measured. For example, in an anime recommender system, each data point could be an anime show, and the similarity matrix would measure how similar one anime is to another based on features such as genre, user ratings, etc.

2. **Similarity Metric**: The matrix is populated using a chosen similarity metric. Common metrics include:
   - **Cosine Similarity**: Measures the cosine of the angle between two vectors representing items. Useful when the magnitude of vectors isn’t as important as their direction.
   - **Euclidean Distance**: Measures the straight-line distance between two vectors. Lower values imply higher similarity.
   - **Pearson Correlation**: Measures how linearly related two items are.
   - **Jaccard Similarity**: Used for binary or set data, it measures the size of the intersection divided by the size of the union of the sets.

3. **Matrix Construction**: A square matrix is constructed where each element (i, j) represents the similarity between item \(i\) and item \(j\). The diagonal of the matrix typically has values of 1, since each item is identical to itself.

### Example Workflow for Building a Similarity Matrix Model:

1. **Preprocessing**:
   - Normalize or standardize the data (e.g., scaling feature vectors).
   - If dealing with textual data, you might use techniques like TF-IDF to vectorize the data.

2. **Compute Similarity**:
   - Use a similarity metric (e.g., cosine similarity) to compute the pairwise similarity between all items in your dataset.
   - This produces a square matrix where each element (i, j) shows how similar item \(i\) is to item \(j\).

3. **Post-Processing**:
   - You can threshold the similarity values to focus on the most relevant similarities (e.g., only considering items with a similarity score above 0.5).

4. **Usage**:
   - **Clustering**: Grouping similar items based on high similarity scores.
   - **Recommendation**: In a recommendation system, you can use the similarity matrix to suggest items that are most similar to what the user has previously interacted with.
   - **Visualization**: Heatmaps can be used to visualize similarity matrices, helping to understand item clusters or relationships.


### Applications:
- **Anime Recommender Systems**: Using anime features like genres, ratings, and tags to recommend similar shows.
- **Collaborative Filtering**: In a user-item matrix, a similarity matrix of users or items can be used to make recommendations by finding the most similar users or items.
- **Clustering**: You can apply algorithms like hierarchical clustering to a similarity matrix to group similar data points.


### 1.1 Creating Sparse Matrix
A **Sparse Matrix** is a matrix in which most of the elements are zero. In many applications involving large datasets, matrices can be extremely large but contain mostly zero values. Instead of storing all the zeros explicitly, sparse matrices allow us to store only the non-zero values, which can drastically reduce memory usage and computational costs.

### Key Properties of Sparse Matrices:
1. **Sparsity**: A matrix is considered sparse if the proportion of zero values to non-zero values is high. In other words, most elements are zero.
   
2. **Storage Efficiency**: Sparse matrices are stored using specific data structures that only store non-zero elements and their indices. This improves both memory efficiency and speed in operations.

3. **Performance**: Many linear algebra operations (like matrix multiplication or solving systems of equations) can be optimized for sparse matrices, making computations faster than with dense matrices.

### Applications:
- **Recommendation Systems**: User-item interaction matrices are often sparse because users rate or interact with only a small fraction of available items (e.g., movies, products, anime, etc.).
- **Graph Theory**: Adjacency matrices for large graphs are typically sparse, especially when the graph has fewer edges relative to the number of vertices.
- **Machine Learning**: Large feature matrices (like those used in natural language processing for text) are often sparse, as only a small number of features (e.g., words) are present in any given document.

### Representation of Sparse Matrices:
To efficiently store sparse matrices, different formats are used:

1. **Compressed Sparse Row (CSR)**:
   - Stores non-zero values row by row.
   - Three arrays are used: one for values, one for column indices, and one for row offsets.
   - Efficient for row slicing and matrix-vector multiplication.

2. **Compressed Sparse Column (CSC)**:
   - Similar to CSR, but stores non-zero elements column by column.
   - Efficient for column slicing.

3. **Coordinate List (COO)**:
   - Stores the row indices, column indices, and values of the non-zero elements.
   - Useful for constructing sparse matrices before converting them to CSR or CSC for more efficient computations.

4. **Dictionary of Keys (DOK)**:
   - Uses a dictionary where keys are tuples of (row, column) and values are the non-zero elements.
   - Efficient for incremental matrix construction, but slower for large matrices once built.


### Sparse Matrix in Recommendation Systems:
For example, in an **Anime Recommender System**, you could have a sparse matrix where:
- Rows represent users.
- Columns represent anime shows.
- Each entry (i, j) in the matrix is the rating given by user \(i\) to anime \(j\).
  
Most users won't have rated every anime, so this matrix is likely to be sparse. You can then use matrix factorization techniques (like SVD or collaborative filtering) on this sparse matrix to make recommendations.

### Advantages of Sparse Matrices:
- **Memory Efficiency**: Since we only store non-zero elements, it saves a lot of memory compared to dense matrices.
- **Computational Efficiency**: Operations like matrix multiplication, solving linear systems, and decompositions can be made more efficient by leveraging the sparsity of the matrix.

### Use Cases:
- **Large-scale Data Analysis**: Sparse matrices are frequently used in fields where the data size is very large but the number of meaningful (non-zero) entries is small.
- **Text Processing**: Term-document matrices in natural language processing are usually sparse, as any single document only contains a small subset of all possible words.
  
Would you like help with a specific use case of sparse matrices in your project?


Creating the sparse matrix both train and test dataset because it will be more efficient way to store such a huge matrix as most of the users has given very few rating.

#### 1.1.1 Creating Sparse Matrix for Training Dataset

In [3]:
import os
import random
import pandas as pd
import numpy as np
from tqdm import tqdm

import matplotlib 
# matplotlib.use('QtAgg')
# %matplotlib.useinline
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import seaborn as sns
sns.set_style('whitegrid')
from scipy import sparse
from scipy.sparse import csr_matrix

from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
import pickle
from scipy import sparse
from scipy.sparse.linalg import svds
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeRegressor
from lightgbm import LGBMRegressor


import surprise as sp
from surprise import accuracy
from surprise.prediction_algorithms.knns import KNNBasic
from surprise.prediction_algorithms.knns import KNNWithMeans
from surprise.prediction_algorithms.knns import KNNBaseline
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms.matrix_factorization import NMF

In [4]:
df=pd.read_csv('/Users/nishchal_mac/Desktop/Data_Science/Anime_Recommender_System/notebooks/merged_df.csv')

In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19073095 entries, 0 to 19073094
Data columns (total 14 columns):
 #   Column               Dtype 
---  ------               ----- 
 0   username             object
 1   anime_id             int64 
 2   my_watched_episodes  int64 
 3   my_score             int64 
 4   my_status            int64 
 5   my_last_updated      object
 6   title                object
 7   type                 object
 8   source               object
 9   episodes             int64 
 10  studio               object
 11  genre                object
 12  user_id              int64 
 13  gender               object
dtypes: int64(6), object(8)
memory usage: 2.0+ GB


In [49]:
df['user_id'] = df['user_id'].astype('object')
df['anime_id'] = df['anime_id'].astype('object')

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19073095 entries, 0 to 19073094
Data columns (total 14 columns):
 #   Column               Dtype 
---  ------               ----- 
 0   username             object
 1   anime_id             object
 2   my_watched_episodes  int64 
 3   my_score             int64 
 4   my_status            int64 
 5   my_last_updated      object
 6   title                object
 7   type                 object
 8   source               object
 9   episodes             int64 
 10  studio               object
 11  genre                object
 12  user_id              object
 13  gender               object
dtypes: int64(4), object(10)
memory usage: 2.0+ GB


In [51]:
# sorting the dataframe dased on my_last_updated column
df.sort_values(by='my_last_updated', inplace=True)

# spliting dataframe into train and test dataframe
df_train = df.iloc[:int(df.shape[0]*0.80)]
df_test = df.iloc[int(df.shape[0]*0.80):]

print("df_train shape :", df_train.shape)
print("df_test shape :", df_test.shape)

df_train shape : (15258476, 14)
df_test shape : (3814619, 14)


In [53]:
# Creating the Train Sparse Matrix of df_train with my_score, user_id and anime_user_id columns
sparse_matrix_train = sparse.csr_matrix((df_train.my_score.values, (df_train.user_id.values, df_train.anime_id.values)))  
sparse_matrix_train

<5533462x33559 sparse matrix of type '<class 'numpy.int64'>'
	with 15258476 stored elements in Compressed Sparse Row format>

In [54]:
# Printing the Sparsity of Train Sparse Matrix
row = len(np.unique(df_train.user_id))
column = len(np.unique(df_train.anime_id))
count = sparse_matrix_train.count_nonzero()

print('The Sparsity Of Train matrix : ', (1-count/(row*column))*100, '%')

The Sparsity Of Train matrix :  97.24943747451307 %


In [55]:
# saving the train sparse matrix
sparse.save_npz('sparse_matrix_train.npz', sparse_matrix_train)

#### 1.1.2 Creating Test Sparse Matrix

In [56]:
# Creating the Test Sparse Matrix of df_test with my_score, user_id and anime_user_id columns
sparse_matrix_test = sparse.csr_matrix((df_test.my_score.values, (df_test.user_id.values, df_test.anime_id.values)))
sparse_matrix_test

<7250031x37861 sparse matrix of type '<class 'numpy.int64'>'
	with 3814619 stored elements in Compressed Sparse Row format>

In [57]:
# Printing the Sparsity of Test Sparse Matrix
row = len(np.unique(df_train.user_id))
column = len(np.unique(df_train.anime_id))
count = sparse_matrix_test.count_nonzero()

print('The Sparsity Of Test matrix : ', (1-count/(row*column))*100, '%')

The Sparsity Of Test matrix :  99.31235936862826 %


In [59]:
# saving the test sparse matrix
sparse.save_npz('sparse_matrix_test.npz', sparse_matrix_test)

In [60]:
# Loading both Train and Test Sparse Matrix
sparse_matrix_train = sparse.load_npz('/Users/nishchal_mac/Desktop/Data_Science/Anime_Recommender_System/notebooks/sparse_matrix_train.npz')
sparse_matrix_test = sparse.load_npz('/Users/nishchal_mac/Desktop/Data_Science/Anime_Recommender_System/notebooks/sparse_matrix_test.npz')

### 1.2 Creating Anime - Anime Similarity Matrix
The goal is to compute the similarity between different anime shows based on certain features (e.g., genres, user ratings, synopsis, or other metadata). This matrix will help recommend similar anime shows to users.

we will use **cosine similarity** 

**Cosine Similarity** is a metric used to measure how similar two non-zero vectors are, regardless of their magnitude. It calculates the cosine of the angle between the two vectors in a multi-dimensional space. The value of cosine similarity ranges from -1 to 1:

- **1**: Indicates that the two vectors are identical (pointing in the same direction).
- **0**: Indicates that the two vectors are orthogonal (no similarity).
- **-1**: Indicates that the two vectors are diametrically opposed (pointing in opposite directions).

The formula for cosine similarity is:

$
\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}
$

where:
- $A$ and $B$ are the vectors,
- $A \cdot B$ is the dot product of the vectors,
- $\|A\|$ and $\|B\|$ are the magnitudes (norms) of the vectors.

Cosine similarity is commonly used in various applications, including text analysis, recommendation systems, and clustering, because it effectively captures the direction of the vectors, making it robust against differences in magnitude.



In [61]:
# Computing Anime Anime Similarity Matrix using Cosine Similarity on Train Sparse Matrix
anime_anime_similarity_matrix = cosine_similarity(X = sparse_matrix_train.T, dense_output = False)

# Getting the top 50 similar animes for each anime and storing them in the dictionary
similar_anime_dict = dict()
anime_id = np.unique(df_train['anime_id'].values)
for id in anime_id:
    similar_anime = anime_anime_similarity_matrix[id].toarray().ravel().argsort()[::-1][1:]
    similar_anime_dict[id] = similar_anime[:10]

# Getting the anime id and their anime title and storing it in the dictonary 
anime_id_and_title_dict = dict()
for id in tqdm(anime_id):
    title = df_train['title'][df_train['anime_id']==id].values
    if len(title) > 0:
        anime_id_and_title_dict[id] = title[0]
    else:
        anime_id_and_title_dict[id] = title

# saving anime_id_and_title_dict dictonary
a_file = open("anime_id_and_title_dict.pkl", "wb")
pickle.dump(anime_id_and_title_dict, a_file)
a_file.close()

100%|██████████| 5594/5594 [40:40<00:00,  2.29it/s]


In [62]:
# printing the Top 10 similar Anime for a particular anime
anime_id = 1
top_anime = similar_anime_dict[anime_id][:10]
print('In Anime-Anime Similarity Matrix the Top 10 similar Animes for \'{}\' Anime are :'.format(anime_id_and_title_dict[anime_id]))
for i in top_anime:
    print(anime_id_and_title_dict[i])

In Anime-Anime Similarity Matrix the Top 10 similar Animes for 'Cowboy Bebop' Anime are :
Cowboy Bebop: Tengoku no Tobira
Samurai Champloo
Trigun
Neon Genesis Evangelion
FLCL
Tengen Toppa Gurren Lagann
Fullmetal Alchemist
Akira
Black Lagoon
Ghost in the Shell


In [63]:
similar_anime_dict.keys()

dict_keys([1, 5, 6, 7, 8, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 173, 174, 175, 177, 178, 180, 181, 182, 183, 184, 185, 186, 187, 189, 190, 193, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 212, 216, 218, 219, 221, 222, 223, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 248, 249, 250, 251, 252, 25

In [64]:
def similar_anime_find(anime_id):
    # Retrieve the top 10 similar anime IDs
    top_anime = similar_anime_dict[anime_id][:10]
    
    # Print the title of the queried anime
    print(f"\n🌟 Top 10 Similar Anime for '{anime_id_and_title_dict[anime_id]}' 🌟")
    print("=" * 50)
    
    # Iterate through the top similar anime IDs and print their titles
    for i, similar_id in enumerate(top_anime, start=1):
        print(f"{i}. {anime_id_and_title_dict[similar_id]}")
    
    print("=" * 50)


In [66]:
similar_anime_find(5)


🌟 Top 10 Similar Anime for 'Cowboy Bebop: Tengoku no Tobira' 🌟
1. Cowboy Bebop
2. Ghost in the Shell
3. Akira
4. Trigun
5. Samurai Champloo
6. FLCL
7. Neon Genesis Evangelion
8. Neon Genesis Evangelion: The End of Evangelion
9. Ghost in the Shell: Stand Alone Complex
10. Mononoke Hime


### 1.3 Sampling Data
As the computational time required to compute even a simple matrix factorization on User-Anime matrix will be very high, so we will sample the data by randomly selecting 10k users.

In [67]:
df = pd.concat([df_train, df_test], ignore_index=True)

# Finding all unique user IDs
df_unique_user_id = pd.DataFrame()
df_unique_user_id['user_id'] = np.unique(df['user_id'].values)

# Sampling users randomly
df_user_sample = df_unique_user_id.sample(n=10000, random_state=42)  # Added random_state for reproducibility

# Creating a complete DataFrame for sampled users
df_sample = pd.merge(df_user_sample, df, on='user_id')

# Sorting the sample DataFrame with respect to 'my_last_updated'
df_sample.sort_values(by='my_last_updated', inplace=True)

# Splitting the sample data into train and test sample DataFrames
df_train_sample = df_sample.iloc[:int(df_sample.shape[0] * 0.80)]
df_test_sample = df_sample.iloc[int(df_sample.shape[0] * 0.80):]

# Printing the shapes of the train and test sample DataFrames
print('df_train_sample shape:', df_train_sample.shape)
print('df_test_sample shape:', df_test_sample.shape)


df_train_sample shape: (1435525, 14)
df_test_sample shape: (358882, 14)


In [68]:
df_train_sample.head(2)

Unnamed: 0,user_id,username,anime_id,my_watched_episodes,my_score,my_status,my_last_updated,title,type,source,episodes,studio,genre,gender
1070303,695,linoleumxx,430,1,10,2,2006-10-04,Fullmetal Alchemist: The Conqueror of Shamballa,Movie,Manga,1,Bones,"Military, Comedy, Historical, Drama, Fantasy, ...",Male
955588,912,Keiyori,120,26,7,2,2006-10-08,Fruits Basket,TV,Manga,26,Studio Deen,"Slice of Life, Comedy, Drama, Romance, Fantasy...",Male


In [69]:
df_test_sample.head(2)

Unnamed: 0,user_id,username,anime_id,my_watched_episodes,my_score,my_status,my_last_updated,title,type,source,episodes,studio,genre,gender
1663352,4538160,MashouMax,31414,24,5,2,2016-06-26,Nijiiro Days,TV,Manga,24,Production Reed,"Comedy, Romance, School, Shoujo, Slice of Life",Male
581427,4485194,scorpion905,31964,13,5,2,2016-06-26,Boku no Hero Academia,TV,Manga,13,Bones,"Action, Comedy, School, Shounen, Super Power",Male


### 1.4 Precision@k Metric Evaluation
**Precision@k** is a performance metric commonly used in information retrieval and recommendation systems to evaluate the relevance of the top $ k $ items recommended to a user. It measures the proportion of relevant items in the top $ k $ recommendations compared to the total number of items in that list.

### Formula:
The formula for Precision@k is:

$
\text{Precision@k} = \frac{\text{Number of Relevant Items in Top } k}{k}
$

Where:
- **Number of Relevant Items in Top $ k $**: This is the count of items among the top $ k $ recommended items that are relevant to the user.
- $ k $: The number of top recommendations being evaluated.

### Steps to Calculate Precision@k:
1. **Generate Recommendations**: Obtain the top $ k $ recommended items for a user.
2. **Identify Relevant Items**: Determine which of these $ k $ items are relevant based on the user’s preferences or past interactions.
3. **Count Relevant Items**: Count how many of the top $ k $ recommendations are actually relevant.
4. **Calculate Precision**: Use the formula to compute Precision@k.

### Example:
Suppose you have a user for whom the recommendation system suggests the following 5 items (i.e., $ k = 5 $):

- Recommended Items: [Item A, Item B, Item C, Item D, Item E]

Assuming the relevant items for this user are:

- Relevant Items: [Item A, Item C, Item F, Item G]

Among the recommended items, only **Item A** and **Item C** are relevant.

- Number of Relevant Items in Top $ k  = 2 $ (Item A and Item C)
- $ k = 5 $

Thus, the Precision@5 would be calculated as:

$
\text{Precision@5} = \frac{2}{5} = 0.4
$

### Interpretation:
- A higher Precision@k indicates that a larger proportion of the top $ k $ recommendations are relevant to the user, suggesting better performance of the recommendation system.
- Precision@k can be used alongside other metrics like **Recall@k** and **F1-Score** for a comprehensive evaluation of the recommendation system's effectiveness.



In [70]:
# function for computing Precision@k Evaluation metric function for machine learning model
# The code is modified but mainly taken from the oficial surprise library website : https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-compute-precision-k-and-recall-k
from collections import defaultdict

def ml_precision_recall_at_k(y, y_pred, user_list, k=10, threshold = 7):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for i in range(len(y)):
        user_est_true[user_list[i]].append((y_pred[i], y[i]))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():
        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        
        # Number of relevant items
        n_rel = np.sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = np.sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = np.sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

### 1.5 Error Metric Function

An **Error Metric Function** is a computational function that evaluates the accuracy of predictions made by a regression model by calculating various statistical metrics. These metrics provide insights into how well the model performs by quantifying the difference between the predicted values and the actual target values.

#### Common Error Metrics

1. **Mean Absolute Error (MAE)**:
   - **Definition**: MAE measures the average absolute difference between predicted values and actual values. It provides a straightforward interpretation of how far off predictions are from the actual outcomes.
   - **Formula**: 
     $
     \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_{i} - \hat{y}_{i}|
     $
   - **Where**:
     - $y_{i} $: Actual target value
     - $ \hat{y}_{i} $: Predicted target value
     - $ n $: Total number of predictions

2. **Mean Squared Error (MSE)**:
   - **Definition**: MSE calculates the average of the squares of the errors, giving more weight to larger errors. It is sensitive to outliers, making it useful when larger errors are particularly undesirable.
   - **Formula**: 
     $
     \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^2
     $

3. **Root Mean Squared Error (RMSE)**:
   - **Definition**: RMSE is the square root of MSE and provides a measure of the average magnitude of the errors in the same units as the target variable. It is useful for understanding how spread out the residuals are.
   - **Formula**:
     $
     \text{RMSE} = \sqrt{\text{MSE}}
     $

4. **R-squared (R²)**:
   - **Definition**: R² represents the proportion of variance in the dependent variable that can be explained by the independent variables. It provides an indication of goodness of fit and ranges from 0 to 1, where a value closer to 1 indicates a better fit.
   - **Formula**:
     $
     R^2 = 1 - \frac{\sum (y_{i} - \hat{y}_{i})^2}{\sum (y_{i} - \bar{y})^2}
     $
   - **Where**:
     - $ \bar{y} $: Mean of actual target values

### Purpose of the Error Metric Function
The purpose of an Error Metric Function is to:
- Quantify the performance of a regression model.
- Provide insights into the accuracy and reliability of the model's predictions.
- Assist in model evaluation and selection, helping to identify areas for improvement in predictive accuracy.



In [71]:
# function to get rmse and mape given actual and predicted ratings..
def get_error_metrics(y_true, y_pred):
    rmse = np.sqrt(np.mean([ (y_true[i] - y_pred[i])**2 for i in range(len(y_pred)) ]))
    mape = np.mean(np.abs( (y_true - y_pred)/y_true )) * 100
    return rmse, mape



## 2. Content-Based Filtering

In **Content-Based Filtering**, we will use various anime features to create detailed profiles. The features include:

- **Anime Type** (e.g., TV, Movie, OVA)
- **Anime Source** (e.g., Manga, Light Novel, Original)
- **Anime Studio** (e.g., Kyoto Animation, Madhouse)
- **Anime Genre** (e.g., Action, Romance, Fantasy)

### Step 1: Creating Anime Profiles
We will create **Anime Profiles** by applying **One-Hot Encoding** to these categorical features. This encoding helps convert each feature into a numerical vector, allowing us to represent each anime in a multi-dimensional feature space.

### Step 2: Building User Profiles
To create **User Profiles**, we will:
- Sum up the **anime profile vectors** for all the anime a user has watched.
- Multiply each anime vector by the **rating** given by the user to weigh their preferences.

This process ensures that the user's profile captures their preferences based on both the types of anime they watch and their ratings.

### Step 3: Finding Recommendations
Finally, we will apply **Cosine Similarity** between the **User Profile** and **Anime Profiles** to identify recommendations. This similarity score helps us find the anime that best align with each user's preferences, ensuring personalized recommendations.



### 2.1 Creating Anime Profile
Creating Anime Profile by computing one hot encoding for different categorical features of anime.

#### Mathematical Explanation of One-Hot Encoding

Let's assume that we have a dataset containing **N** anime shows, and we want to create feature vectors using the following categorical attributes:
- **Type**: (e.g., TV, Movie, OVA)
- **Source**: (e.g., Manga, Light Novel, Original)
- **Studio**: (e.g., Kyoto Animation, Madhouse)
- **Genre**: (e.g., Action, Romance, Fantasy)

##### One-Hot Encoding Process

For each categorical feature, **One-Hot Encoding** creates a binary vector that represents the presence or absence of each possible category. 

###### Step 1: Define Feature Space
Let's say:
- **Type** has $ T $ unique values: $ \{t_1, t_2, \ldots, t_T\} $.
- **Source** has $ S $ unique values: $ \{s_1, s_2, \ldots, s_S\} $.
- **Studio** has $ U $ unique values: $ \{u_1, u_2, \ldots, u_U\} $.
- **Genre** has $ G $ unique values: $ \{g_1, g_2, \ldots, g_G\} $.

###### Step 2: Encode Each Feature
- Each anime's **type** is encoded as a vector of length $ T $:
  $
  \text{Type Vector} = [x_1, x_2, \ldots, x_T]
  $
  where $ x_i = 1 $ if the anime's type matches $t_i $; otherwise, $ x_i = 0 $.

- Similarly, each anime's **source** is encoded as a vector of length $ S $:
  $
  \text{Source Vector} = [y_1, y_2, \ldots, y_S]
  $
  where $ y_j = 1 $ if the anime's source matches $ s_j $; otherwise, $ y_j = 0 $.

- Each anime's **studio** is encoded as a vector of length $ U $:
  $
  \text{Studio Vector} = [z_1, z_2, \ldots, z_U]
  $
  where $ z_k = 1 $ if the anime's studio matches $ u_k $; otherwise, $ z_k = 0 $.

- Each anime's **genre** is encoded as a vector of length $ G $:
  $
  \text{Genre Vector} = [w_1, w_2, \ldots, w_G]
  $
  where $ w_l = 1 $ if the anime has the genre $ g_l $; otherwise, $ w_l = 0 $.

#### Step 3: Combine Encoded Vectors
Each anime can be represented as a concatenated vector of all these encoded features:

$
\text{Anime Profile Vector} = [x_1, x_2, \ldots, x_T, y_1, y_2, \ldots, y_S, z_1, z_2, \ldots, z_U, w_1, w_2, \ldots, w_G]
$

- The length of this vector is $T + S + U + G $.
- Each anime in the dataset is now represented by a unique binary vector of this length.

#### Example:
If an anime is a **TV show** (type), based on a **Manga** (source), produced by **Kyoto Animation** (studio), and belongs to **Action** and **Fantasy** (genres), the one-hot encoded vectors would look like:

- Type Vector (for TV): $[1, 0, 0]$ if there are 3 types: TV, Movie, OVA.
- Source Vector (for Manga): $[1, 0, 0]$ if there are 3 sources: Manga, Light Novel, Original.
- Studio Vector (for Kyoto Animation): $[0, 1, 0, 0]$ if there are 4 studios: Madhouse, Kyoto Animation, Studio Ghibli, others.
- Genre Vector (for Action and Fantasy): $[1, 0, 1, 0]$ if there are 4 genres: Action, Romance, Fantasy, Comedy.

Thus, the complete **Anime Profile Vector** would be:

$
[1, 0, 0, \; 1, 0, 0, \; 0, 1, 0, 0, \; 1, 0, 1, 0]
$

This vector representation allows us to compute similarities between anime based on their types, sources, studios, and genres, using methods like **Cosine Similarity**. It effectively captures the multi-dimensional characteristics of each anime in a way that is suitable for content-based recommendations.

In [72]:
# selecting categorical features from df_train_sample dataframe
df_train_anime_profile = df_train_sample.drop(['user_id','username','my_status','my_score','my_last_updated','gender'], axis = 1)
df_train_anime_profile = df_train_anime_profile.drop_duplicates(subset = 'anime_id')
df_train_anime_profile.head(2)

Unnamed: 0,anime_id,my_watched_episodes,title,type,source,episodes,studio,genre
1070303,430,1,Fullmetal Alchemist: The Conqueror of Shamballa,Movie,Manga,1,Bones,"Military, Comedy, Historical, Drama, Fantasy, ..."
955588,120,26,Fruits Basket,TV,Manga,26,Studio Deen,"Slice of Life, Comedy, Drama, Romance, Fantasy..."


In [73]:
# selecting categorical features from df_test_sample dataframe
df_test_anime_profile = df_test_sample[['anime_id', 'title', 'type', 'source', 'studio', 'genre', 'episodes']]

# Concatenating df_test_anime_profile with df_train_anime_profile
df_test_anime_profile = pd.concat([df_test_anime_profile, df_train_anime_profile], ignore_index=True)

# Dropping duplicates based on the 'anime_id' column
df_test_anime_profile = df_test_anime_profile.drop_duplicates(subset='anime_id')

# Displaying the first 2 rows
df_test_anime_profile.head(2)


Unnamed: 0,anime_id,title,type,source,studio,genre,episodes,my_watched_episodes
0,31414,Nijiiro Days,TV,Manga,Production Reed,"Comedy, Romance, School, Shoujo, Slice of Life",24,
1,31964,Boku no Hero Academia,TV,Manga,Bones,"Action, Comedy, School, Shounen, Super Power",13,


In [74]:
# creating categorical encoding on 'type' feature
type_vectorizer = CountVectorizer(lowercase=False)
type_vectorizer.fit(df_train_anime_profile['type'].values)

# Using the fitted CountVectorizer to convert the text to vectors
train_type_enc = type_vectorizer.transform(df_train_anime_profile['type'].values)
test_type_enc = type_vectorizer.transform(df_test_anime_profile['type'].values)
user_train_type_enc = type_vectorizer.transform(df_train_sample['type'].values)

print("After vectorizations:")
print("Train type encoding shape:", train_type_enc.shape)
print("Test type encoding shape:", test_type_enc.shape)
print("User train type encoding shape:", user_train_type_enc.shape)
print("Feature names:", type_vectorizer.get_feature_names_out())

After vectorizations:
Train type encoding shape: (5180, 6)
Test type encoding shape: (6166, 6)
User train type encoding shape: (1435525, 6)
Feature names: ['Movie' 'Music' 'ONA' 'OVA' 'Special' 'TV']


In [75]:
# creating categorical encoding on 'source' feature
source_vectorizer = CountVectorizer(lowercase = False, token_pattern="[\w\-\w\s]+")
source_vectorizer.fit(df_train_anime_profile['source'].values)

# we use the fitted CountVectorizer to convert the text to vector
train_source_enc = source_vectorizer.transform(df_train_anime_profile['source'].values)
test_source_enc = source_vectorizer.transform(df_test_anime_profile['source'].values)
user_train_source_enc = source_vectorizer.transform(df_train_sample['source'].values)

print("After vectorizations :")
print(train_source_enc.shape)
print(test_source_enc.shape)
print(user_train_source_enc.shape)
print(source_vectorizer.get_feature_names_out())

After vectorizations :
(5180, 15)
(6166, 15)
(1435525, 15)
['4-koma manga' 'Book' 'Card game' 'Digital manga' 'Game' 'Light novel'
 'Manga' 'Music' 'Novel' 'Original' 'Other' 'Picture book' 'Radio'
 'Visual novel' 'Web manga']


In [76]:
# creating categorical encoding on 'studio' feature
studio_vectorizer = CountVectorizer(lowercase = False, token_pattern = '[^,\s][^\,]*[^,\s]+')
studio_vectorizer.fit(df_train_anime_profile['studio'].values)

# we use the fitted CountVectorizer to convert the text to vector
train_studio_enc = studio_vectorizer.transform(df_train_anime_profile['studio'].values)
test_studio_enc = studio_vectorizer.transform(df_test_anime_profile['studio'].values)
user_train_studio_enc = studio_vectorizer.transform(df_train_sample['studio'].values)

print("After vectorizations :")
print(train_studio_enc.shape)
print(test_studio_enc.shape)
print(user_train_studio_enc.shape)
print(studio_vectorizer.get_feature_names_out())

After vectorizations :
(5180, 356)
(6166, 356)
(1435525, 356)
['10Gauge' '3xCube' '81 Produce' '8bit' 'A-1 Pictures' 'A-Real' 'A.C.G.T.'
 'ACC Production' 'AIC' 'AIC A.S.T.A.' 'AIC Build' 'AIC Classic'
 'AIC Frontier' 'AIC Plus+' 'AIC Spirits' 'AIC Takarazuka' 'APPP' 'AT-2'
 'AXsiZ' 'Actas' 'Agent 21' 'Ajia-Do' 'Amber Film Works' 'Amuse'
 'An DerCen' 'Animaruya' 'Animate Film' 'Animation Do'
 'Anime Antenna Iinkai' 'Anime R' 'Annapuru' 'Anpro' 'Arcs Create' 'Arms'
 'Artland' 'Artmic' 'Asahi Production' 'Ascension' 'Ashi Production'
 'Asread' 'Aubec' 'Bandai Namco Pictures' 'Barnum Studio' 'BeSTACK'
 'Bee Media' 'Bee Train' 'Beijing Huihuang Animation Company' 'Big Bang'
 'Blue Cat' 'Bones' 'Brain&#039;s Base' 'BreakBottle' 'Bridge' 'Buemon'
 'C-Station' 'C2C' 'Chaos Project' 'Charaction' 'ChuChu' 'Circle Tribute'
 'CoMix Wave Films' 'Code' 'Collaboration Works' 'Connect'
 'Cookie Jar Entertainment' 'Creators in Pack' 'D.A.S.T.' 'DAX Production'
 'DLE' 'Daewon Media' 'Dai Nippon Printin

In [77]:
# creating categorical encoding on 'genre' feature
genre_vectorizer = CountVectorizer(lowercase = False, token_pattern = '[^,\s][^\,]*[^,\s]*')
genre_vectorizer.fit(df_train_anime_profile['genre'].values)

# we use the fitted CountVectorizer to convert the text to vector
train_genre_enc = genre_vectorizer.transform(df_train_anime_profile['genre'].values)
test_genre_enc = genre_vectorizer.transform(df_test_anime_profile['genre'].values)
user_train_genre_enc = genre_vectorizer.transform(df_train_sample['genre'].values)

print("After vectorizations :")
print(train_genre_enc.shape)
print(test_genre_enc.shape)
print(user_train_genre_enc.shape)
print(genre_vectorizer.get_feature_names_out())

After vectorizations :
(5180, 43)
(6166, 43)
(1435525, 43)
['Action' 'Adventure' 'Cars' 'Comedy' 'Dementia' 'Demons' 'Drama' 'Ecchi'
 'Fantasy' 'Game' 'Harem' 'Hentai' 'Historical' 'Horror' 'Josei' 'Kids'
 'Magic' 'Martial Arts' 'Mecha' 'Military' 'Music' 'Mystery' 'Parody'
 'Police' 'Psychological' 'Romance' 'Samurai' 'School' 'Sci-Fi' 'Seinen'
 'Shoujo' 'Shoujo Ai' 'Shounen' 'Shounen Ai' 'Slice of Life' 'Space'
 'Sports' 'Super Power' 'Supernatural' 'Thriller' 'Vampire' 'Yaoi' 'Yuri']


In [78]:
# merging sparse matrices
from scipy.sparse import hstack
train_anime_encoded = hstack((train_type_enc, train_source_enc, train_studio_enc, train_genre_enc)).tocsr()
test_anime_encoded = hstack((test_type_enc, test_source_enc, test_studio_enc, test_genre_enc)).tocsr()
user_train_anime_encoded = hstack((user_train_type_enc, user_train_source_enc, user_train_studio_enc, user_train_genre_enc)).tocsr()

from scipy import sparse
sparse.save_npz("train_anime_encoded.npz", train_anime_encoded)
sparse.save_npz("test_anime_encoded.npz", test_anime_encoded)
sparse.save_npz("user_train_anime_encoded.npz", user_train_anime_encoded)

print("Final Data matrix shape :")
print(train_anime_encoded.shape)
print(test_anime_encoded.shape)
print(user_train_anime_encoded.shape)

Final Data matrix shape :
(5180, 420)
(6166, 420)
(1435525, 420)


### 2.2 Creating User Profile
Creating User Profile by multiplying given user's rating for a particular anime to that anime's Anime Profile and then adding all anime profile that a user has watched in training dataset.

In [95]:
sample_user_list = np.unique(df_train_sample['user_id'].values) #finding all the unique users in training dataframe
user_profile = dict() 

for user in tqdm(sample_user_list):
    # finding index of all the the rows that are related to a particular user in df_train_sample dataframe
    users_watched_anime_index = df_train_sample[df_train_sample['user_id'] == user].index 

    # storing all the rows that are related to a particular user in a seperate dataframe
    user_df = df_train_sample[df_train_sample['user_id'] == user]
    user_rating = user_df['my_score'].values #storing rating given by user

    user_vec = np.zeros(user_train_anime_encoded.shape[1]) # initializing the user profile array
    for ind,val in enumerate(users_watched_anime_index):
        # adding all the anime profile for a particular user by multiplying it with given user rating
        user_vec += user_train_anime_encoded[val].toarray()[0]*int(user_rating[ind]) 
    user_profile[user] = user_vec #storing user profile vector 

# saving anime_id_and_title_dict dictonary
a_file = open("user_profile.pkl", "wb")
pickle.dump(user_profile, a_file)
a_file.close()

print('\n',len(user_profile))
print(len(user_profile[list(user_profile.keys())[0]]))

100%|██████████| 9358/9358 [15:46<00:00,  9.89it/s]


 9358
420





In [80]:
df_train_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1435525 entries, 0 to 1435524
Data columns (total 14 columns):
 #   Column               Non-Null Count    Dtype 
---  ------               --------------    ----- 
 0   user_id              1435525 non-null  object
 1   username             1435525 non-null  object
 2   anime_id             1435525 non-null  object
 3   my_watched_episodes  1435525 non-null  int64 
 4   my_score             1435525 non-null  int64 
 5   my_status            1435525 non-null  int64 
 6   my_last_updated      1435525 non-null  object
 7   title                1435525 non-null  object
 8   type                 1435525 non-null  object
 9   source               1435525 non-null  object
 10  episodes             1435525 non-null  int64 
 11  studio               1435525 non-null  object
 12  genre                1435525 non-null  object
 13  gender               1435525 non-null  object
dtypes: int64(4), object(10)
memory usage: 153.3+ MB


### 2.3 Computing Content Based Filtering

In [96]:
sample_train_user_list = np.unique(df_train_sample['user_id'].values)
train_anime_id_index = df_train_anime_profile['anime_id'].values 
train_user_matrix = []

for user in tqdm(sample_train_user_list):
    user_profile_vec = user_profile[user] #getting the user profile vector for given user
    user_profile_normalize = normalize(user_profile_vec.reshape(1,-1), norm = 'l2') #normalizing the user profile vector
    # computing cosine similarity between normalize user profile vector and anime profile matrix 
    similarity_vec = cosine_similarity(user_profile_normalize, train_anime_encoded)[0] 
    scaler = MinMaxScaler(feature_range=(1, 10))
    train_user_matrix.append(scaler.fit_transform(similarity_vec.reshape(-1, 1)).ravel())

train_user_matrix = np.array(train_user_matrix)

train_content_based_df = pd.DataFrame(train_user_matrix, index = sample_train_user_list, columns = train_anime_id_index) 
train_content_based_df.head()

100%|██████████| 9358/9358 [00:06<00:00, 1410.95it/s]


Unnamed: 0,430,120,740,32,30,66,1117,190,110,1278,...,31762,33487,32248,32900,33486,32886,33157,32380,31845,32483
428,5.931717,8.55953,6.386029,4.386866,7.228165,6.847352,4.268894,6.721202,9.569857,5.362233,...,2.050114,8.268426,3.424702,5.303054,9.076635,1.945635,3.93577,5.140764,8.512641,5.741527
695,7.321112,5.606562,3.892753,5.02816,5.306478,2.853213,6.559739,5.217733,6.967034,4.396349,...,3.169461,4.79596,3.853817,5.145888,6.532351,2.107744,2.353541,4.823348,5.563852,2.986037
912,7.245228,9.006248,6.795368,4.162287,6.140316,5.797765,6.090947,7.55352,9.361696,6.320187,...,2.62389,8.39838,3.402755,4.732173,8.587942,2.051528,3.620208,6.061532,7.866729,5.050527
989,6.602171,7.341199,5.513319,4.580462,6.805389,5.159543,5.253944,6.055821,8.838747,4.952259,...,2.584631,6.918514,3.609645,4.634405,8.360091,2.283742,2.990375,4.947609,7.0936,4.649763
1051,6.283988,7.081221,5.856252,3.119775,5.250646,4.678558,5.844137,6.112891,8.266858,5.008335,...,1.961269,6.565186,3.366351,4.896213,7.679534,1.509858,2.544086,6.571741,7.238567,3.732427


In [97]:
# getting predicted value for train dataset
sample_train_user = df_train_sample['user_id'].values
sample_train_anime = df_train_sample['anime_id'].values
y_train_pred_content_based = []

for i in tqdm(range(len(sample_train_user))):
    y_train_pred_content_based.append(train_content_based_df[sample_train_anime[i]].loc[sample_train_user[i]])


100%|██████████| 1435525/1435525 [00:08<00:00, 171778.90it/s]


In [45]:
df_test_sample.info()

<class 'pandas.core.frame.DataFrame'>
Index: 358882 entries, 1663352 to 601493
Data columns (total 14 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   user_id              358882 non-null  int64 
 1   username             358882 non-null  object
 2   anime_id             358882 non-null  int64 
 3   my_watched_episodes  358882 non-null  int64 
 4   my_score             358882 non-null  int64 
 5   my_status            358882 non-null  int64 
 6   my_last_updated      358882 non-null  object
 7   title                358882 non-null  object
 8   type                 358882 non-null  object
 9   source               358882 non-null  object
 10  episodes             358882 non-null  int64 
 11  studio               358882 non-null  object
 12  genre                358882 non-null  object
 13  gender               358882 non-null  object
dtypes: int64(6), object(8)
memory usage: 41.1+ MB


In [105]:
unique_test_user_list = np.unique(df_test_sample['user_id'].values)
test_anime_id_index = df_test_anime_profile['anime_id'].values 
test_user_list = df_test_sample['user_id'].values


# List to keep track of users for whom ratings are generated
users_with_profiles = []

# Iterate through each unique test user and compute their predicted ratings for each anime
test_user_matrix = []
for user in tqdm(unique_test_user_list):
    if user in user_profile:  # Check if the user has a profile
        user_profile_vec = user_profile[user]  # Get the user profile vector
        user_profile_normalize = normalize(user_profile_vec.reshape(1, -1), norm='l2')  # Normalize the user profile vector

        # Compute cosine similarity between normalized user profile vector and anime profile matrix 
        similarity_vec = cosine_similarity(user_profile_normalize, test_anime_encoded)[0] 
        scaler = MinMaxScaler(feature_range=(1, 10))
        test_user_matrix.append(scaler.fit_transform(similarity_vec.reshape(-1, 1)).ravel())

        # Track the user ID for which the profile was used
        users_with_profiles.append(user)

# Convert to numpy array
test_user_matrix = np.array(test_user_matrix)

# Create the dataframe containing the predicted ratings
test_content_based_df = pd.DataFrame(test_user_matrix, index=users_with_profiles, columns=test_anime_id_index)
test_content_based_df.head()



100%|██████████| 5376/5376 [00:03<00:00, 1739.63it/s]


Unnamed: 0,31414,31964,30988,31498,31798,8876,10048,31376,31430,31478,...,32388,7915,31972,32666,9855,5022,32852,32851,32707,30923
695,3.946936,6.532351,2.353541,2.853213,4.832925,3.567933,4.615015,4.83978,5.42896,5.535432,...,2.169716,2.290292,3.060559,3.238727,1.13169,5.351572,1.394933,1.0,3.939599,5.237314
1407,9.253653,8.817487,5.516589,8.364878,7.399338,5.758249,4.285059,8.748518,6.863551,8.504037,...,2.370214,3.772245,1.628267,4.896734,1.500461,6.744111,1.601196,1.0,5.251518,3.466653
1969,4.989311,7.279861,2.914194,4.182975,6.745306,3.84445,5.08764,5.759184,6.844267,6.624397,...,2.735955,3.027036,1.637079,4.026966,1.424719,7.782899,1.321509,1.033708,4.438736,4.006742
2560,6.602021,7.765791,3.520494,5.60614,7.875374,4.223619,4.875935,7.010855,7.578374,7.754976,...,2.505099,2.857333,1.896329,4.126445,1.412984,8.34121,1.697407,1.0,4.467425,4.138681
2699,8.789539,8.569408,5.289699,7.076413,6.681221,6.694208,5.144757,8.084158,6.300544,7.94758,...,1.906666,4.298936,1.670342,4.242636,3.070106,5.363914,1.797539,1.0,4.528042,2.876957


In [107]:
# getting predicted values for test dataset
# Getting predicted values for the test dataset
new_sample_test_user = df_test_sample['user_id'].values
new_sample_test_anime = df_test_sample['anime_id'].values

y_test_pred_content_based = []
for i in tqdm(range(len(new_sample_test_user))):
    user = new_sample_test_user[i]
    anime = new_sample_test_anime[i]
    
    # Check if the user and anime exist in the predicted ratings DataFrame
    if user in test_content_based_df.index and anime in test_content_based_df.columns:
        # If both exist, append the predicted rating
        y_test_pred_content_based.append(test_content_based_df.at[user, anime])
    else:
        # If either the user or anime is not found, use a default value (e.g., the mean rating)
        y_test_pred_content_based.append(test_content_based_df.values.mean())

# Convert to a numpy array if needed for further processing
y_test_pred_content_based = np.array(y_test_pred_content_based)


100%|██████████| 358882/358882 [10:40<00:00, 560.69it/s] 


In [110]:
# getting train and test rmse and mape value
y_train=df_train_sample['my_score'].values
y_test=df_test_sample['my_score'].values
print("In Content Based Filtering Model applied with Cosine Similarity : ")
rmse_train, mape_train = get_error_metrics(y_train.astype(float), y_train_pred_content_based)
print('Train RMSE : ', rmse_train)
print('Train MAPE : ', mape_train)

rmse_test, mape_test = get_error_metrics(y_test.astype(float), y_test_pred_content_based)
print('\nTest RMSE : ', rmse_test)
print('Test MAPE : ', mape_train)

In Content Based Filtering Model applied with Cosine Similarity : 
Train RMSE :  2.2542872200013564
Train MAPE :  27.515183176461594

Test RMSE :  2.6149221390471076
Test MAPE :  27.515183176461594


In [111]:
# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = ml_precision_recall_at_k(y_train.astype(float), y_train_pred_content_based, sample_train_user, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = ml_precision_recall_at_k(y_train.astype(float), y_train_pred_content_based, sample_train_user, k=5, threshold = 8)
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('In Content Based Filtering Model applied with Cosine Similarity : ')
print('Train precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('Train precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = ml_precision_recall_at_k(y_test.astype(float), y_test_pred_content_based, new_sample_test_user, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = ml_precision_recall_at_k(y_test.astype(float), y_test_pred_content_based, new_sample_test_user, k=5, threshold = 8)
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('\nTest precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('Test precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

In Content Based Filtering Model applied with Cosine Similarity : 
Train precisions@5 : 0.7129105221913514
Train precisions@10 : 0.7011613015940863

Test precisions@5 : 0.5029575892857143
Test precisions@10 : 0.4979884879298942


## 3. Collaborative Filtering
Collaborative filtering is a technique for recommendation systems that suggests items based on user preferences or behaviors. There are two main types:

1. **User-Based Collaborative Filtering**: Recommends items to a user based on the preferences of similar users.

2. **Item-Based Collaborative Filtering**: Recommends items similar to those the user has already liked.

**Key Steps**:
- Calculate similarity (e.g., using cosine similarity or Pearson correlation).
- Find similar users/items.
- Predict ratings for unseen items.
- Recommend items with high predicted ratings.

**Challenges** include the cold start problem, data sparsity, and scalability. Combining it with other methods (hybrid approach) can improve accuracy.

### Precision Metric Function for Surprise Library

In [112]:
# function for computing Precision@k Evaluation metric function 
# The code is taken from the oficial surprise library website : https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-compute-precision-k-and-recall-k
from collections import defaultdict

def precision_recall_at_k(predictions, k=10, threshold = 7):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
   
    for uid, user_ratings in user_est_true.items():
        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        
        # Number of relevant items
        n_rel = np.sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = np.sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = np.sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

### 3.1 Memory Based Collaborative Filtering
Memory-based collaborative filtering is a type of recommendation system that relies on historical user-item interactions to make predictions about user preferences.

#### 3.1.1 KNNBasic using Surprise Library
In the context of collaborative filtering, a basic **K-Nearest Neighbors (KNN) algorithm** can be used to predict how a user might rate a specific anime based on the ratings of similar anime that the user has already rated. This method involves finding similar items (in this case, anime) by computing the similarity between them and using this information to predict the rating.

### Predicted Rating of KNNBasic (Anime-Anime Similarity)

In KNN-based collaborative filtering, we predict the rating of a user $ u $ for an anime $ i $ by analyzing the set of **K similar anime** (neighbors) that the user has previously rated. The similarity between anime is typically measured using a metric like the **Pearson correlation coefficient**, which quantifies how similarly two items (anime) are rated across users.

#### Steps:

1. **Find K similar anime:**
   - For the anime $i $ that user $u $ has not rated, identify **K similar anime** that have already been rated by user $ u $. These similar anime are referred to as the **neighbors** of $ i $.
   - The similarity between anime $i $ and another anime $ j $ is computed using the **Pearson correlation coefficient**, which reflects how users tend to rate these two anime similarly.
   
   $
   \text{sim}(i, j) = \frac{\sum_{v \in U} (r_{v,i} - \bar{r_i})(r_{v,j} - \bar{r_j})}{\sqrt{\sum_{v \in U}(r_{v,i} - \bar{r_i})^2} \sqrt{\sum_{v \in U}(r_{v,j} - \bar{r_j})^2}}
   $
   
   - Here, $ \text{sim}(i, j) $ is the similarity between anime $ i $ and $j $, $ r_{v,i} $ is the rating given by user $ v $ to anime $ i $, and $ \bar{r_i} $ and $ \bar{r_j} $ are the average ratings for anime $ i $ and $ j $ across all users in set $U $, respectively.

2. **Weighted rating prediction:**
   - Once the K most similar anime have been identified, we predict the rating $ \hat{r_{u,i}} $ that user $ u $ would give to anime $i $ by taking a **weighted average** of user $ u $'s ratings for the similar anime $ j $.
   - The weight of each rating is determined by the similarity between anime $ i $ and $ j $ (i.e., $ \text{sim}(i, j) $.

   $
   \hat{r_{u,i}} = \frac{\sum_{j \in K} \text{sim}(i, j) \cdot r_{u,j}}{\sum_{j \in K} |\text{sim}(i, j)|}
   $

   - Here, $ r_{u,j} $ is the rating given by user $ u $ to anime $j $, and the summation is over the set of K similar anime.

### Pearson Correlation Coefficient

The **Pearson correlation coefficient** is commonly used in collaborative filtering because it captures how two anime are similarly rated by multiple users. It normalizes the ratings by subtracting the mean, ensuring that differences in user rating scales are taken into account. This allows the algorithm to focus on the patterns in user preferences rather than absolute rating values.


In [114]:
reader = sp.reader.Reader(rating_scale = (1, 10)) # reading scales
train_data = sp.Dataset.load_from_df(df_train_sample[['user_id', 'anime_id', 'my_score']], reader)

trainset = train_data.build_full_trainset()
testset = list(zip(df_test_sample.user_id.values, df_test_sample.anime_id.values, df_test_sample.my_score.values.astype(float)))

knn_basic = KNNBasic(sim_options = {'user_based' : False, 'name': 'pearson_baseline'})
knn_basic.fit(trainset)

train_predictions = knn_basic.test(trainset.build_testset())
test_predictions = knn_basic.test(testset)

y_train = []
y_train_pred_knnbasic = []
for uid, _, true_r, est, _ in train_predictions:
    y_train.append(true_r)
    y_train_pred_knnbasic.append(est)

y_test = []
y_test_pred_knnbasic = []
for uid, _, true_r, est, _ in test_predictions:
    y_test.append(true_r)
    y_test_pred_knnbasic.append(est)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


In [115]:
reader = sp.reader.Reader(rating_scale = (1, 10)) # reading scales
train_data = sp.Dataset.load_from_df(df_train_sample[['user_id', 'anime_id', 'my_score']], reader)

trainset = train_data.build_full_trainset()
testset = list(zip(df_test_sample.user_id.values, df_test_sample.anime_id.values, df_test_sample.my_score.values.astype(float)))

knn_basic = KNNBasic(sim_options = {'user_based' : False, 'name': 'cosine'})
knn_basic.fit(trainset)

train_predictions = knn_basic.test(trainset.build_testset())
test_predictions = knn_basic.test(testset)

y_train = []
y_train_pred_knnbasic = []
for uid, _, true_r, est, _ in train_predictions:
    y_train.append(true_r)
    y_train_pred_knnbasic.append(est)

y_test = []
y_test_pred_knnbasic = []
for uid, _, true_r, est, _ in test_predictions:
    y_test.append(true_r)
    y_test_pred_knnbasic.append(est)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [116]:
# getting train and test rmse and mape value
print('In KNNBasic with Anime-Anime Similarity using Surprise Library : ')
rmse_train, mape_train = get_error_metrics(np.array(y_train).astype(float), y_train_pred_knnbasic)
print('Train RMSE : ', rmse_train)
print('Train MAPE : ', mape_train)

rmse_test, mape_test = get_error_metrics(np.array(y_test).astype(float), y_test_pred_knnbasic)
print('\nTest RMSE : ', rmse_test)
print('Test MAPE : ', mape_train)

In KNNBasic with Anime-Anime Similarity using Surprise Library : 
Train RMSE :  1.2210911290558444
Train MAPE :  16.288323101477214

Test RMSE :  1.648345303068194
Test MAPE :  16.288323101477214


In [117]:
# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = precision_recall_at_k(train_predictions, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = precision_recall_at_k(train_predictions, k=5, threshold = 8) 
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('In KNNBasic with Anime-Anime Similarity using Surprise Library : ')
print('Train precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('Train precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = precision_recall_at_k(test_predictions, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = precision_recall_at_k(test_predictions, k=5, threshold = 8)
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('\nTest precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('Test precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

In KNNBasic with Anime-Anime Similarity using Surprise Library : 
Train precisions@5 : 0.7961690532164992
Train precisions@10 : 0.7873868636291164

Test precisions@5 : 0.5176091269841271
Test precisions@10 : 0.5122459313586545


#### 3.1.2 KNNBaseline usine Surprise Library
In the context of collaborative filtering, the **KNNBaseline** algorithm is an enhancement of the basic K-Nearest Neighbors (KNN) approach. It incorporates a **baseline rating** into the prediction to account for the inherent biases in user and item ratings. This baseline adjusts for factors like a user's general tendency to rate items higher or lower than average, and certain anime being rated more highly across the board.

### Predicted Rating of KNNBaseline (Anime-Anime Similarity)

The goal of **KNNBaseline** is to predict the rating a user $ u $ would give to an anime $ i $, by considering both the similarity between anime and adjusting for biases in user and anime rating behavior. The baseline rating is factored into the prediction, which helps in producing more accurate recommendations.

#### Steps:

1. **Baseline estimate:**
   - The first step is to compute a **baseline estimate** $ b_{u,i} $ for the rating of anime $ i $ by user $ u $. This baseline is calculated as:

   $
   b_{u,i} = \mu + b_u + b_i
   $

   - Where:
     - $ \mu $ is the **overall average rating** across all users and all anime.
     - $ b_u $ is the **user bias**, which represents how much user $ u $ tends to rate higher or lower than the overall average.
     - $ b_i $ is the **item bias** for anime $ i $, which reflects how anime $ i $ is generally rated by users compared to the average.

2. **Find K similar anime:**
   - Once the baseline rating is determined, we proceed to find the **K nearest neighbors** for anime $ i $, which are the most similar anime that user $ u $ has already rated.
   - The similarity between anime $ i $ and $ j $ is computed using the **Pearson correlation coefficient**, which reflects how similarly these two anime are rated by users.

   $
   \text{sim}(i, j) = \frac{\sum_{v \in U} (r_{v,i} - \bar{r_i})(r_{v,j} - \bar{r_j})}{\sqrt{\sum_{v \in U}(r_{v,i} - \bar{r_i})^2} \sqrt{\sum_{v \in U}(r_{v,j} - \bar{r_j})^2}}
   $

   - Here, $ \text{sim}(i, j) $ is the Pearson correlation coefficient between anime $ i $ and $ j $, $ r_{v,i} $ and $ r_{v,j} $ are the ratings given by user $ v $ to anime $ i $ and $ j $, and $\bar{r_i} $ and $ \bar{r_j} $ are the average ratings for anime $ i $ and $ j $.

3. **Weighted rating prediction:**
   - After identifying the K most similar anime, the predicted rating $ \hat{r_{u,i}} $ for anime $i $ by user $ u $ is calculated by combining the baseline rating $ b_{u,i} $ with the weighted deviations from the baseline for the similar anime.
   
   $
   \hat{r_{u,i}} = b_{u,i} + \frac{\sum_{j \in K} \text{sim}(i, j) \cdot (r_{u,j} - b_{u,j})}{\sum_{j \in K} |\text{sim}(i, j)|}
   $

   - Here, $ r_{u,j} $ is the rating given by user $ u $ to anime $ j $, and $ b_{u,j} $ is the baseline estimate for anime $ j $.
   - The term $ (r_{u,j} - b_{u,j}) $ represents how much the actual rating deviates from the baseline, and the similarity $ \text{sim}(i, j) $ is used as a weight.

### Pearson Correlation Coefficient

The **Pearson correlation coefficient** is used to measure the similarity between two anime by looking at how similarly they are rated by users. It adjusts for differences in rating scales and focuses on how the ratings vary from the mean, making it a robust measure for collaborative filtering tasks.



In [118]:
reader = sp.reader.Reader(rating_scale = (1, 10))
train_data = sp.Dataset.load_from_df(df_train_sample[['user_id', 'anime_id', 'my_score']], reader)

trainset = train_data.build_full_trainset()
testset = list(zip(df_test_sample.user_id.values, df_test_sample.anime_id.values, df_test_sample.my_score.values.astype(float)))

knn_baseline = KNNBaseline(sim_options = {'user_based' : False, 'name': 'pearson_baseline'})
knn_baseline.fit(trainset)
train_predictions = knn_baseline.test(trainset.build_testset())
test_predictions = knn_baseline.test(testset)

y_train = []
y_train_pred_knnbaseline = []
for uid, _, true_r, est, _ in train_predictions:
    y_train.append(true_r)
    y_train_pred_knnbaseline.append(est)

y_test = []
y_test_pred_knnbaseline = []
for uid, _, true_r, est, _ in test_predictions:
    y_test.append(true_r)
    y_test_pred_knnbaseline.append(est)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


In [119]:
# getting train and test rmse and mape value
print('In KNNBaseline with Anime-Anime Similarity using Surprise Library : ')
rmse_train, mape_train = get_error_metrics(np.array(y_train).astype(float), y_train_pred_knnbaseline)
print('Train RMSE : ', rmse_train)
print('Train MAPE : ', mape_train)

rmse_test, mape_test = get_error_metrics(np.array(y_test).astype(float), y_test_pred_knnbaseline)
print('\nTest RMSE : ', rmse_test)
print('Test MAPE : ', mape_train)

In KNNBaseline with Anime-Anime Similarity using Surprise Library : 
Train RMSE :  0.8269366969918814
Train MAPE :  10.453006031667678

Test RMSE :  1.4929552473623746
Test MAPE :  10.453006031667678


In [120]:
# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = precision_recall_at_k(train_predictions, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = precision_recall_at_k(train_predictions, k=5, threshold = 8) 
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('In KNNBaseline with Anime-Anime Similarity using Surprise Library : ')
print('Trainprecisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('Train precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = precision_recall_at_k(test_predictions, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = precision_recall_at_k(test_predictions, k=5, threshold = 8)
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('\nTest precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('Test precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

In KNNBaseline with Anime-Anime Similarity using Surprise Library : 
Trainprecisions@5 : 0.9873904680487283
Train precisions@10 : 0.9848784844136416

Test precisions@5 : 0.7638950892857143
Test precisions@10 : 0.7495219051162132


#### 3.1.3 KNNwithMeans using Surprise Library
In collaborative filtering, the **KNNWithMeans** algorithm builds upon basic KNN by incorporating the mean rating of each user into the prediction process. This approach helps in adjusting for individual differences in rating scales, as some users tend to give higher or lower ratings on average. By considering how much a user’s rating deviates from their personal average, the algorithm aims to generate more accurate recommendations.

### Predicted Rating of KNNWithMeans (Anime-Anime Similarity)

The **KNNWithMeans** algorithm predicts the rating that a user $ u $ would give to an anime $ i $ by looking at the **K similar anime** that user $ u $ has already rated. It adjusts the prediction based on how much user $ u $'s ratings of those similar anime deviate from their personal mean rating.

#### Steps:

1. **Calculate the mean rating**:
   - First, the mean rating $ \bar{r_u} $ of user $ u $ across all the anime they have rated is calculated:
     $
     \bar{r_u} = \frac{\sum_{i \in I_u} r_{u,i}}{|I_u|}
     $
     - Where $I_u $ is the set of anime rated by user $ u $, and $ r_{u,i} $ is the rating user $ u $ gave to anime $ i $.

2. **Find K similar anime**:
   - For the anime $ i $ that user $ u $ hasn’t rated, identify the **K similar anime** (neighbors) that user $ u $ has rated. The similarity between anime $ i $ and another anime $ j $ is computed using the **Pearson correlation coefficient**:
     $
     \text{sim}(i, j) = \frac{\sum_{v \in U} (r_{v,i} - \bar{r_i})(r_{v,j} - \bar{r_j})}{\sqrt{\sum_{v \in U}(r_{v,i} - \bar{r_i})^2} \sqrt{\sum_{v \in U}(r_{v,j} - \bar{r_j})^2}}
     $
     - Where $ \text{sim}(i, j) $ is the similarity between anime $ i $ and $ j $, and $ \bar{r_i} $, $ \bar{r_j} $ are the average ratings for anime $ i $ and $ j $, respectively.

3. **Weighted rating prediction**:
   - The predicted rating $ \hat{r_{u,i}} $ for anime $ i $ by user $ u $ is computed by combining user $ u $'s mean rating $ \bar{r_u} $ with a weighted sum of the deviations from the mean ratings for the K similar anime:
     $
     \hat{r_{u,i}} = \bar{r_u} + \frac{\sum_{j \in K} \text{sim}(i, j) \cdot (r_{u,j} - \bar{r_u})}{\sum_{j \in K} |\text{sim}(i, j)|}
     $
     - Here, $ r_{u,j} $ is the rating that user $ u $ gave to anime $ j $, and $ \bar{r_u} $ is the mean rating for user $ u $.
     - The term $(r_{u,j} - \bar{r_u}) $ represents how much user $ u $’s rating for anime $ j $ deviates from their personal average, and $ \text{sim}(i, j) $ is used to weight these deviations based on the similarity between anime $ i $ and $ j $.


In [121]:
reader = sp.reader.Reader(rating_scale = (1, 10))
train_data = sp.Dataset.load_from_df(df_train_sample[['user_id', 'anime_id', 'my_score']], reader)

trainset = train_data.build_full_trainset()
testset = list(zip(df_test_sample.user_id.values, df_test_sample.anime_id.values, df_test_sample.my_score.values.astype(float)))

knn_with_means = KNNWithMeans(sim_options = {'user_based' : False, 'name': 'pearson_baseline'})
knn_with_means.fit(trainset)
train_predictions = knn_with_means.test(trainset.build_testset())
test_predictions = knn_with_means.test(testset)

y_train = []
y_train_pred_knnwithmeans = []
for uid, _, true_r, est, _ in train_predictions:
    y_train.append(true_r)
    y_train_pred_knnwithmeans.append(est)

y_test = []
y_test_pred_knnwithmeans = []
for uid, _, true_r, est, _ in test_predictions:
    y_test.append(true_r)
    y_test_pred_knnwithmeans.append(est)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


In [122]:
# getting train and test rmse and mape value
print('In KNNwithMeans with Anime-Anime Similarity using Surprise Library : ')
rmse_train, mape_train = get_error_metrics(np.array(y_train).astype(float), y_train_pred_knnwithmeans)
print('Train RMSE : ', rmse_train)
print('Train MAPE : ', mape_train)

rmse_test, mape_test = get_error_metrics(np.array(y_test).astype(float), y_test_pred_knnwithmeans)
print('\nTest RMSE : ', rmse_test)
print('Test MAPE : ', mape_train)

In KNNwithMeans with Anime-Anime Similarity using Surprise Library : 
Train RMSE :  0.8288563883232001
Train MAPE :  10.45231319076021

Test RMSE :  1.6421681080832946
Test MAPE :  10.45231319076021


In [123]:
# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = precision_recall_at_k(train_predictions, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = precision_recall_at_k(train_predictions, k=5, threshold = 8) 
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('In KNNwithMeans with Anime-Anime Similarity using Surprise Library : ')
print('precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = precision_recall_at_k(test_predictions, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = precision_recall_at_k(test_predictions, k=5, threshold = 8)
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('\nTest precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('Test precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

In KNNwithMeans with Anime-Anime Similarity using Surprise Library : 
precisions@5 : 0.9870966018379995
precisions@10 : 0.9848349769486763

Test precisions@5 : 0.63515625
Test precisions@10 : 0.6359549201625094


### 3.2 Model Based Collaborative Filtering

### 3.2.1 Singular Value Decomposition (SVD) using Surprise Library
In the **Surprise Library**, **Singular Value Decomposition (SVD)** is a matrix factorization technique used in collaborative filtering. The **SVD model** aims to predict the rating a user $ u $ would give to an anime $ i $ by factoring both users and items (anime) into lower-dimensional latent factor spaces. It combines the overall average rating, user and item biases, and latent factors to generate predictions.

### Predicted Rating of SVD in Surprise Library

The predicted rating $ \hat{r_{u,i}} $ for user $ u $ on anime $ i $ in the **SVD model** is expressed as:

$
\hat{r_{u,i}} = \mu + b_u + b_i + q_i^T p_u
$

Where:
- $\mu $ is the **global average rating** across all users and items in the training data.
- $ b_u $ is the **user bias**, representing how much user $ u $ tends to rate items higher or lower compared to the average.
- $ b_i $ is the **item bias** (anime bias), representing how much anime $i $ tends to be rated higher or lower than the average.
- $ p_u $ is the **user vector** representing user $ u $'s preferences in a **latent factor space**.
- $ q_i $ is the **item vector** representing anime $ i $ in the same latent factor space.

#### Explanation:

1. **Global Average ($ \mu $)**:
   - $ \mu $ is the **overall mean rating** of all items in the training data, providing a baseline prediction when no other information is available.

2. **User Bias ($ b_u $)**:
   - Each user has their own bias, which reflects how their ratings deviate from the global average. For example, some users may tend to rate items more generously (positive bias), while others may rate more harshly (negative bias).

3. **Item Bias ($ b_i $)**:
   - Each anime (item) also has a bias, reflecting how it is generally rated across all users. Popular anime may have a positive bias, while less popular anime may have a negative bias.

4. **Latent Factor Interaction ($ q_i^T p_u $)**:
   - The core of the SVD model involves **latent factors**. The vector $ p_u $ represents user $ u $'s preferences, and the vector $ q_i $ represents anime $ i $'s characteristics. The dot product $ q_i^T p_u $ measures how well the anime’s characteristics align with the user’s preferences in the latent factor space.



In [125]:
reader = sp.reader.Reader(rating_scale = (1, 10))
train_data = sp.Dataset.load_from_df(df_train_sample[['user_id', 'anime_id', 'my_score']], reader)

trainset = train_data.build_full_trainset()
testset = list(zip(df_test_sample.user_id.values, df_test_sample.anime_id.values, df_test_sample.my_score.values.astype(float)))

svd = SVD()
svd.fit(trainset)
train_predictions = svd.test(trainset.build_testset())
test_predictions = svd.test(testset)
test_predictions = svd.test(testset)

y_train = []
y_train_pred_svd = []
for uid, _, true_r, est, _ in train_predictions:
    y_train.append(true_r)
    y_train_pred_svd.append(est)

y_test = []
y_test_pred_svd = []
for uid, _, true_r, est, _ in test_predictions:
    y_test.append(true_r)
    y_test_pred_svd.append(est)

In [126]:
# getting train and test rmse and mape value
print('In SVD using Surprise Library on Training Dataset : ')
rmse_train, mape_train = get_error_metrics(np.array(y_train).astype(float), y_train_pred_svd)
print('Train RMSE : ', rmse_train)
print('Train MAPE : ', mape_train)

rmse_test, mape_test = get_error_metrics(np.array(y_test).astype(float), y_test_pred_svd)
print('\nTest RMSE : ', rmse_test)
print('Test MAPE : ', mape_train)

In SVD using Surprise Library on Training Dataset : 
Train RMSE :  0.793124235789393
Train MAPE :  9.790630605637768

Test RMSE :  1.489476886006374
Test MAPE :  9.790630605637768


In [127]:
# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = precision_recall_at_k(train_predictions, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = precision_recall_at_k(train_predictions, k=5, threshold = 8)
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('In SVD using Surprise Library on Training Dataset : ')
print('precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = precision_recall_at_k(test_predictions, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = precision_recall_at_k(test_predictions, k=5, threshold = 8)
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('\nTest precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('Test precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

In SVD using Surprise Library on Training Dataset : 
precisions@5 : 0.9885000356201468
precisions@10 : 0.9850103637665083

Test precisions@5 : 0.7590680803571429
Test precisions@10 : 0.744277771872638


### Creating X_train and X_test Dataset for Machine Learning Model
Creating X_train and X_test Dataset for Machine Learning Model with the help of Singular Value Decomposition (SVD) and representing each User id and Anime id as vector extracted from SVD.

In [128]:
# converting sample train dataframe to pivoted matrix dataframe with user id as rows and anime id as columns
users_anime_pivot_matrix_df = df_train_sample.pivot(index='user_id', columns='anime_id', values='my_score').fillna(0)
users_anime_pivot_matrix_df.head()

anime_id,1,5,6,7,8,15,16,17,18,19,...,33247,33338,33352,33358,33372,33417,33420,33421,33486,33487
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
428,8.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
912,4.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
989,9.0,8.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1051,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [129]:
# storing all the anime ids in sequence as per df_train_sample dataframe
anime_ids = list(users_anime_pivot_matrix_df.columns) 

# storing all the user ids in sequence as per df_train_sample dataframe
users_ids = list(users_anime_pivot_matrix_df.index)

# storing users_anime_pivot_matrix_df dataframe in matrix form
users_anime_matrix = users_anime_pivot_matrix_df.astype(float)

# converting the sparse 'users_anime_pivot_matrix' matrix in csr matrix
users_anime_sparse_matrix = csr_matrix(users_anime_matrix)
users_anime_sparse_matrix

<9358x5180 sparse matrix of type '<class 'numpy.float64'>'
	with 1435525 stored elements in Compressed Sparse Row format>

In [130]:
#The number of factors to factor the user-item matrix.
number_of_factor = 50
U, sigma, Vt = svds(users_anime_sparse_matrix, k = number_of_factor)
sigma = np.diag(sigma)
print('The Shape of U, sigma and Vt are :')
print('u : ', U.shape)
print('sigma : ', sigma.shape)
print('Vt : ', Vt.shape)

The Shape of U, sigma and Vt are :
u :  (9358, 50)
sigma :  (50, 50)
Vt :  (50, 5180)


In [131]:
# storing user_id, anime_id and rating from train and test dataframe in list
train_sample_user_list = df_train_sample['user_id'].values
train_sample_anime_list = df_train_sample['anime_id'].values
y_train = df_train_sample['my_score'].values

test_sample_user_list = df_test_sample['user_id'].values
test_sample_anime_list = df_test_sample['anime_id'].values
y_test = df_test_sample['my_score'].values

In [132]:
# representing each user_id and anime_id as vector in training dataset
train_user_vec = []
train_anime_vec = []
Vt_trans = Vt.T
for ind in tqdm(range(len(train_sample_user_list))):
    user_vec = U[users_ids.index(train_sample_user_list[ind])]
    anime_vec = Vt_trans[anime_ids.index(train_sample_anime_list[ind])]
    train_user_vec.append(user_vec)
    train_anime_vec.append(anime_vec)

train_user_vec = np.array(train_user_vec)
train_anime_vec = np.array(train_anime_vec)

# checking the shape of user and anime_id vector representation in testing dataset
print("train_user_vec shape", train_user_vec.shape)
print("train_anime_vec shape", train_anime_vec.shape)

100%|██████████| 1435525/1435525 [01:09<00:00, 20702.98it/s]


train_user_vec shape (1435525, 50)
train_anime_vec shape (1435525, 50)


In [135]:
# Representing each user_id and anime_id as a vector in the testing dataset
test_user_vec = []
test_anime_vec = []
Vt_trans = Vt.T
train_sample_anime_list_unique = np.unique(train_sample_anime_list)

# Iterate through each test user and anime
for ind in tqdm(range(len(test_sample_user_list))):
    # Get user_id and anime_id
    user_id = test_sample_user_list[ind]
    anime_id = test_sample_anime_list[ind]

    # Check if the user exists in the training user list
    if user_id in users_ids:
        user_vec = U[users_ids.index(user_id)]
    else:
        user_vec = np.zeros(U.shape[1])  # Assign a zero vector if the user was not in the training set

    # Check if the anime exists in the training anime list
    if anime_id in train_sample_anime_list_unique:
        anime_vec = Vt_trans[anime_ids.index(anime_id)]
    else:
        anime_vec = np.zeros(Vt_trans.shape[1])  # Assign a zero vector if the anime was not in the training set

    test_user_vec.append(user_vec)
    test_anime_vec.append(anime_vec)

# Convert lists to arrays
test_user_vec = np.array(test_user_vec)
test_anime_vec = np.array(test_anime_vec)

# Checking the shape of user and anime_id vector representation in testing dataset
print("test_user_vec shape:", test_user_vec.shape)
print("test_anime_vec shape:", test_anime_vec.shape)


100%|██████████| 358882/358882 [00:55<00:00, 6427.86it/s]


test_user_vec shape: (358882, 50)
test_anime_vec shape: (358882, 50)


In [136]:
# Normalizing User Vector for both train and test dataset
normalizer = Normalizer()
normalizer.fit(train_user_vec)

train_user_vec_norm = normalizer.transform(train_user_vec)
test_user_vec_norm = normalizer.transform(test_user_vec)

# Normalizing Anime Vector for both train and test dataset
normalizer = Normalizer()
normalizer.fit(train_anime_vec)

train_anime_vec_norm = normalizer.transform(train_anime_vec)
test_anime_vec_norm = normalizer.transform(test_anime_vec)

# merging user vec and anime vec to create X_train and X_test
X_train = np.hstack((train_user_vec_norm, train_anime_vec_norm))
X_test = np.hstack((test_user_vec_norm, test_anime_vec_norm))

# saving train, test numpy array
np.save('X_train', X_train)
np.save('X_test', X_test)
np.save('y_train', y_train)
np.save('y_test', y_test)

In [137]:
# loading train, test numpy array
X_train = np.load('X_train.npy')
X_test = np.load('X_test.npy')
y_train = np.load('y_train.npy', allow_pickle = True)
y_test = np.load('y_test.npy', allow_pickle = True)

#### Evaluation Metrics for Collaborative Filtering Machine Learning Model

In [138]:
# function to get rmse and mape given actual and predicted ratings
def get_rmse_metrics(y_true, y_pred):
    rmse = np.sqrt(np.mean([ (y_true[i] - y_pred[i])**2 for i in range(len(y_pred)) ]))
    return rmse

### 3.2.2 Linear Regression on Model Based Collaborative Filtering

In [139]:
# Training Linear Regression Model and finding RMSE and MAPE value
linear_reg = LinearRegression(n_jobs = -1)
linear_reg.fit(X_train, y_train)
y_train_pred = linear_reg.predict(X_train)

print("In Linear Regression Model :")
rmse_train, mape_train = get_error_metrics(y_train, y_train_pred)
print('Train RMSE : ', rmse_train)
print('Train MAPE : ', mape_train)

y_test_pred = linear_reg.predict(X_test)
rmse_test, mape_test = get_error_metrics(y_true = y_test, y_pred = y_test_pred)
print('\nTest RMSE : ', rmse_test)
print('Test MAPE : ', mape_train)

In Linear Regression Model :
Train RMSE :  1.5014612917286474
Train MAPE :  20.38889403864671

Test RMSE :  1.7230795338635085
Test MAPE :  20.38889403864671


In [140]:
# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = ml_precision_recall_at_k(y_train.astype(float), y_train_pred.astype(float), train_sample_user_list, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = ml_precision_recall_at_k(y_train.astype(float), y_train_pred, train_sample_user_list, k=5, threshold = 8)
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('In Linear Regression Model applied on Collaborative Filtering : ')
print('Train precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('Train precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

# calculating precision@10 and precision@5 metric for Training Dataset
precisions_at_10, recalls_at_10 = ml_precision_recall_at_k(y_test.astype(float), y_test_pred.astype(float), test_sample_user_list, k=10, threshold = 8)
precisions_at_5, recalls_at_5 = ml_precision_recall_at_k(y_test.astype(float), y_test_pred, test_sample_user_list, k=5, threshold = 8)
precisions_at_10 = np.array(list(precisions_at_10.values()))
precisions_at_5 = np.array(list(precisions_at_5.values()))
print('\nTest precisions@5 :', np.sum(precisions_at_5)/len(precisions_at_5))
print('Test precisions@10 :', np.sum(precisions_at_10)/len(precisions_at_10))

In Linear Regression Model applied on Collaborative Filtering : 
Train precisions@5 : 0.8566485003918216
Train precisions@10 : 0.84411822326708

Test precisions@5 : 0.7151227678571429
Test precisions@10 : 0.7018003295068027


### 3.2.3 Support Vector Regression (SVR) on Model Based Collaborative Filtering

In [None]:
# finding best parameter for Support Vector Regression Machine with the help of RandomizedSearchCV
params = {'C' : [0.001, 0.01, 0.1, 1, 10, 100]}
svm = LinearSVR(random_state = 42)
random_search = RandomizedSearchCV(svm, param_distributions = params, scoring = make_scorer(get_rmse_metrics, greater_is_better = False), cv = 3, n_jobs = -1)
random_search.fit(X_train, y_train)
print('Best Params: ', random_search.best_params_)