# Item-based Models for Netflix

Written by Jaume

In this file we are going to proceed with the building in the models. Machine learning models to build recommender systems. The models that are going to be built are Collaborative filtering using Item-based rating prediction (ItemKNN) and Item-based classification (ItemKNN).

Collaborative filtering is a technique used in recommendation systems to predict or classify items based on the preferences or behavior of similar users or items. Item-based Collaborative Filtering (CF) focuses on the similarity between items rather than users. There are two main approaches within item-based CF: rating prediction and classification.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # to make jupyter print all outputs, not just the last one
from IPython.core.display import HTML # to pretty print pandas df and be able to copy them over (e.g. to ppt slides)

We import the necessary libraries for the modelling:

- Pandas: Data manipulation and analysis library in Python, primarily for structured data using Series and DataFrame.

- NumPy: Fundamental library for numerical computing with support for large arrays and mathematical functions.

- os: Python module for interacting with the operating system, used for file and directory operations.

In [2]:
# We import the necessary libraries
import pandas as pd
import numpy as np
import os

In [3]:
# We print the directory where the file is located
print(os.getcwd())

c:\Users\Jaume\Documents\MDDB\SDM\SDfM---Jaume-and-Stijn 2\SDfM---Jaume-and-Stijn


In [4]:
# We set the directory to the cleaned folder
os.listdir(os.path.join('.', 'cleaned'))

['final_sample30_parquet',
 'final_sample50_parquet',
 'final_sample5_parquet',
 'final_strat_sample_parquet',
 'movielens_parquet',
 'movielens_parquet_small',
 'netflix_parquet',
 'sample_tenth_netflix',
 'strat_sample_movielens',
 'strat_sample_netflix']

In [5]:
# We read the final_sample file and store it in a dataframe
df = pd.read_parquet('cleaned/strat_sample_netflix')

In [6]:
# We print shape of the dataframe
df.shape

(80, 7)

In [7]:
# We print the first 5 rows of the dataframe
df.head()

Unnamed: 0,movieId,review_data,genres,year,title,review_stratum,num_reviews
234,235,"[{'date': 2004-03-22, 'rating': 4.0, 'userId':...","[Drama, Romance]",1996,Unhook the Stars,Q2,442
95,96,"[{'date': 2005-07-25, 'rating': 4.0, 'userId':...",[Documentary],2000,Inside the Space Station,Q2,488
1685,1686,"[{'date': 2004-03-22, 'rating': 5.0, 'userId':...",,1998,Riding the Rails: American Experience,Q2,551
1047,1048,"[{'date': 2002-06-29, 'rating': 3.0, 'userId':...","[Documentary, Music]",1997,Year of the Horse: Neil Young & Crazy Horse Live,Q2,804
1447,1448,"[{'date': 2004-05-02, 'rating': 4.0, 'userId':...","[Animation, Action, Sci-Fi]",1999,Blue Submarine 6,Q2,503


In [8]:
# We print the columns of the dataframe
df.columns

Index(['movieId', 'review_data', 'genres', 'year', 'title', 'review_stratum',
       'num_reviews'],
      dtype='object')

In [9]:
# Explode the review_data column
exploded_df = df.explode('review_data').reset_index(drop=True)

In [10]:
exploded_df.head()

exploded_df.shape

# We get the first row of exploded_df
exploded_df.iloc[0]

Unnamed: 0,movieId,review_data,genres,year,title,review_stratum,num_reviews
0,235,"{'date': 2004-03-22, 'rating': 4.0, 'userId': ...","[Drama, Romance]",1996,Unhook the Stars,Q2,442
1,235,"{'date': 2004-10-09, 'rating': 1.0, 'userId': ...","[Drama, Romance]",1996,Unhook the Stars,Q2,442
2,235,"{'date': 2005-01-26, 'rating': 4.0, 'userId': ...","[Drama, Romance]",1996,Unhook the Stars,Q2,442
3,235,"{'date': 2005-10-26, 'rating': 4.0, 'userId': ...","[Drama, Romance]",1996,Unhook the Stars,Q2,442
4,235,"{'date': 2005-04-06, 'rating': 3.0, 'userId': ...","[Drama, Romance]",1996,Unhook the Stars,Q2,442


(530497, 7)

movieId                                                         235
review_data       {'date': 2004-03-22, 'rating': 4.0, 'userId': ...
genres                                             [Drama, Romance]
year                                                           1996
title                                              Unhook the Stars
review_stratum                                                   Q2
num_reviews                                                     442
Name: 0, dtype: object

In [11]:
# Convert the exploded dictionary column into separate columns
expanded_review_data = pd.json_normalize(exploded_df['review_data'])


In [12]:
expanded_review_data.head()

Unnamed: 0,date,rating,userId
0,2004-03-22,4.0,1575876
1,2004-10-09,1.0,1862608
2,2005-01-26,4.0,1447354
3,2005-10-26,4.0,466862
4,2005-04-06,3.0,429382


In [13]:
# Concatenate the expanded data with the original DataFrame
df = pd.concat([exploded_df.drop('review_data', axis=1), expanded_review_data], axis=1)

# Verify the shape of the final DataFrame
print("DataFrame shape:", df.shape)


DataFrame shape: (530497, 9)


In [14]:
df.head()

Unnamed: 0,movieId,genres,year,title,review_stratum,num_reviews,date,rating,userId
0,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2004-03-22,4.0,1575876
1,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2004-10-09,1.0,1862608
2,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2005-01-26,4.0,1447354
3,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2005-10-26,4.0,466862
4,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2005-04-06,3.0,429382


In [15]:
# Check if there are userIDs that haven't rated movies at all
df['userId'].value_counts().sort_values(ascending=False)

# Check if there are movieIDs that haven't been rated at all
df['movieId'].value_counts().sort_values(ascending=False)

userId
305344     80
387418     79
2439493    73
1664010    72
2118461    72
           ..
361149      1
1978374     1
2451648     1
51372       1
1188337     1
Name: count, Length: 204662, dtype: int64

movieId
1561    61019
658     44001
1267    42113
1571    40841
143     38362
        ...  
1337      108
1152       96
119        96
1764       94
317        94
Name: count, Length: 80, dtype: int64

In [16]:
# Check if there are NaN values in the rating column
df['rating'].isna().sum()

0

In [17]:
df.head()

Unnamed: 0,movieId,genres,year,title,review_stratum,num_reviews,date,rating,userId
0,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2004-03-22,4.0,1575876
1,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2004-10-09,1.0,1862608
2,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2005-01-26,4.0,1447354
3,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2005-10-26,4.0,466862
4,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2005-04-06,3.0,429382


Now we get rid of the cold start problem, we are not gonna give recommendation to the users if they haven't reviewed at least 3 movies.

In [18]:
# Group the DataFrame by 'userId' and count the number of ratings per user
user_rating_counts = df.groupby('userId').size()

# Filter out users who have rated less than 5 movies
active_users = user_rating_counts[user_rating_counts >= 5].index.tolist()

# Select rows from the original DataFrame for active users
df = df[df['userId'].isin(active_users)]

# Display the filtered DataFrame
df.head()


Unnamed: 0,movieId,genres,year,title,review_stratum,num_reviews,date,rating,userId
0,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2004-03-22,4.0,1575876
2,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2005-01-26,4.0,1447354
3,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2005-10-26,4.0,466862
5,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2004-07-14,3.0,2259898
6,235,"[Drama, Romance]",1996,Unhook the Stars,Q2,442,2004-01-27,4.0,1977959


In [19]:
# Verify the shape of the final DataFrame
print("DataFrame shape:", df.shape)

DataFrame shape: (220335, 9)


We check the number of different movies that we have in the dataset and the number of unique users we have.

In [20]:
# We print the number of unique users and movies
print(df['userId'].nunique())
print(df['movieId'].nunique())

30349
80


We see that we have `20770` unique users and `80` movies, a significant number to train the model and recommend apropiate movies.

In [21]:
df.columns

Index(['movieId', 'genres', 'year', 'title', 'review_stratum', 'num_reviews',
       'date', 'rating', 'userId'],
      dtype='object')

In [22]:
# We print the number of unique users and movies
print(df['userId'].nunique())
print(df['movieId'].nunique())

30349
80


In [23]:
df['genres']

0                          [Drama, Romance]
2                          [Drama, Romance]
3                          [Drama, Romance]
5                          [Drama, Romance]
6                          [Drama, Romance]
                        ...                
530485    [Crime, Drama, Mystery, Thriller]
530486    [Crime, Drama, Mystery, Thriller]
530487    [Crime, Drama, Mystery, Thriller]
530491    [Crime, Drama, Mystery, Thriller]
530493    [Crime, Drama, Mystery, Thriller]
Name: genres, Length: 220335, dtype: object

In [24]:
df.head

<bound method NDFrame.head of         movieId                             genres  year             title  \
0           235                   [Drama, Romance]  1996  Unhook the Stars   
2           235                   [Drama, Romance]  1996  Unhook the Stars   
3           235                   [Drama, Romance]  1996  Unhook the Stars   
5           235                   [Drama, Romance]  1996  Unhook the Stars   
6           235                   [Drama, Romance]  1996  Unhook the Stars   
...         ...                                ...   ...               ...   
530485      330  [Crime, Drama, Mystery, Thriller]  1998       Wild Things   
530486      330  [Crime, Drama, Mystery, Thriller]  1998       Wild Things   
530487      330  [Crime, Drama, Mystery, Thriller]  1998       Wild Things   
530491      330  [Crime, Drama, Mystery, Thriller]  1998       Wild Things   
530493      330  [Crime, Drama, Mystery, Thriller]  1998       Wild Things   

       review_stratum  num_review

We have to `One-Hot-Encode` the genre columns to have 0 and 1 in them if the movie has that genre. We also have to distinguish the columns that are part of the genres from the ones that aren't since we are going to also use the genres in this model to calculate the similarity between items.

In [25]:
# One-hot encode the 'genres' column
df_genres = df['genres'].str.join('|').str.get_dummies()

# Rename the columns with 'genre_' prefix
df_genres = df_genres.add_prefix('genre_')

# Concatenate the encoded columns with the original DataFrame
df = df.join(df_genres)

# Drop the original 'genres' column
df.drop('genres', axis=1, inplace=True)

# Display the modified DataFrame
df.head()

Unnamed: 0,movieId,year,title,review_stratum,num_reviews,date,rating,userId,genre_Action,genre_Adventure,...,genre_Horror,genre_Music,genre_Musical,genre_Mystery,genre_Romance,genre_Sci-Fi,genre_Short,genre_Sport,genre_Thriller,genre_Western
0,235,1996,Unhook the Stars,Q2,442,2004-03-22,4.0,1575876,0,0,...,0,0,0,0,1,0,0,0,0,0
2,235,1996,Unhook the Stars,Q2,442,2005-01-26,4.0,1447354,0,0,...,0,0,0,0,1,0,0,0,0,0
3,235,1996,Unhook the Stars,Q2,442,2005-10-26,4.0,466862,0,0,...,0,0,0,0,1,0,0,0,0,0
5,235,1996,Unhook the Stars,Q2,442,2004-07-14,3.0,2259898,0,0,...,0,0,0,0,1,0,0,0,0,0
6,235,1996,Unhook the Stars,Q2,442,2004-01-27,4.0,1977959,0,0,...,0,0,0,0,1,0,0,0,0,0


In [26]:
# We print the number of unique users and movies
print(df['userId'].nunique())
print(df['movieId'].nunique())

30349
80


# Item-based Rating Prediction (ItemKNN)

### Step 1.1: User-Item matrix construction
The first thing we have to do is build the user-item matrix:

- Choose a similarity metric to calculate the similarity between items. Common metrics include cosine similarity, Pearson correlation coefficient, and Jaccard similarity. We are going to use the first one of them, the cosine similarity.

##### Libraries Used
- **pandas**: A powerful data manipulation library in Python.
- **numpy**: A library for numerical computing in Python.

In [27]:
import pandas as pd
import numpy as np

# Convert userId column to integer type if needed
if df['userId'].dtype != int:
    df['userId'] = df['userId'].astype(int)

# Get all unique user IDs and movie IDs
all_user_ids = np.unique(df['userId'])
all_movie_ids = np.unique(df['movieId'])

# Determine the number of unique users and movies
num_users = len(all_user_ids)
num_movies = len(all_movie_ids)

# Create a dictionary to store ratings for each user
user_ratings = {}

# Iterate through the DataFrame and populate the user_ratings dictionary
for _, row in df.iterrows():
    user_id = row['userId']
    movie_id = row['movieId']
    rating = row['rating']
    
    # Initialize the user's ratings list if not already present
    if user_id not in user_ratings:
        user_ratings[user_id] = np.zeros(num_movies)
    
    # Assign the rating to the corresponding movie index
    user_ratings[user_id][np.where(all_movie_ids == movie_id)[0][0]] = rating

# Create user-item matrix from the dictionary
item_user_matrix = pd.DataFrame(user_ratings.values(), index=user_ratings.keys(), columns=all_movie_ids)

# Convert user-item matrix to NumPy array for faster computation
item_user_array = item_user_matrix.to_numpy()


In [28]:
item_user_matrix.head()

Unnamed: 0,39,96,119,143,160,181,188,208,227,235,...,1589,1686,1730,1764,1819,1820,1852,1857,1902,1949
1575876,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1447354,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
466862,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2259898,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1977959,0.0,0.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,4.0,...,4.0,4.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,3.0


In [29]:
item_user_array

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 2., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

### Train-test split

The train-val-test split is a technique used in machine learning to evaluate the performance of a model. It involves dividing the dataset into three subsets: the training set, the validation set, and the test set.

In this model we are going to devide the user-item interaction matrix into three since in this one, we already have all the items and we can already calculate the similarity for all the items with the train set.

The training set is used to train the model and optimize its parameters.
The validation set is used to fine-tune the model and select the best hyperparameters. The test set is used to evaluate the final performance of the model on unseen data.

By using a train-val-test split, we can assess the model's performance on unseen data and ensure that it generalizes well to new examples. It helps prevent overfitting and provides a more reliable estimate of the model's performance in real-world scenarios.

For the split we will use the `sklearn` library, more specifacally the `train_test_split` function. This function is commonly used in machine learning workflows to split a dataset into two subsets: one for training the model and another for testing the model's performance.

In [30]:
from sklearn.model_selection import train_test_split
# Split the data into training and test sets
train_val, test = train_test_split(item_user_array, test_size=0.2, random_state=42)

# Split the training set into training and validation sets
train, val = train_test_split(train_val, test_size=0.2, random_state=42)

# Print the shapes of the datasets
print("Training set shape:", train.shape)
print("Validation set shape:", val.shape)
print("Test set shape:", test.shape)

# We are also going to do the split for the matrix df
# Split the user-item matrix into training and test sets
train_val_matrix, test_matrix = train_test_split(item_user_matrix, test_size=0.2, random_state=42)

# Split the training set matrix into training and validation sets
train_matrix, val_matrix = train_test_split(train_val_matrix, test_size=0.2, random_state=42)

# We print a ' ' to give some space inbetween lines
print(' ')

# Print the shapes of the matrix datasets
print("Training set matrix shape:", train_matrix.shape)
print("Validation set matrix shape:", val_matrix.shape)
print("Test set matrix shape:", test_matrix.shape)


Training set shape: (19423, 80)
Validation set shape: (4856, 80)
Test set shape: (6070, 80)
 
Training set matrix shape: (19423, 80)
Validation set matrix shape: (4856, 80)
Test set matrix shape: (6070, 80)


In [31]:
train_matrix

Unnamed: 0,39,96,119,143,160,181,188,208,227,235,...,1589,1686,1730,1764,1819,1820,1852,1857,1902,1949
843201,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
531747,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2143797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2153721,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
834933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1547655,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1408468,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2355940,0.0,0.0,0.0,4.0,0.0,0.0,0.0,5.0,0.0,0.0,...,0.0,0.0,3.0,0.0,4.0,0.0,0.0,0.0,4.0,0.0
428168,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [32]:
train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 4., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 3., 0.]])

We define mapping dictionaries that will be used to locate where the movies and users are in the different matrices created (train, df, test, item similarity...)

In [33]:
# Adjusting the mapping to start indexing from 0
user_id_to_index = {user_id: i for i, user_id in enumerate(item_user_matrix.index)}
index_to_user_id = {i: user_id for i, user_id in enumerate(item_user_matrix.index)}

# Create a mapping from movie IDs to indices
movie_id_to_index = {movie_id: i for i, movie_id in enumerate(item_user_matrix.columns)}
index_to_movie_id = {i: movie_id for i, movie_id in enumerate(item_user_matrix.columns)}

### Step 1.2: Item-Genre matrix

In the following code snippet we will construct the item-genre matrix, that will consist of all the items in the dataset (we will input the ones in train to not use the test set, since we have all of them in train) and the genres of each of the items, with this matrix we will be able to compute using the cosine similarity the item-item similarity matrix based on the genres.

In [34]:
import numpy as np

# We create item-genre matrix
def create_item_genre_matrix(df, train):
    """
    Create item-genre matrix based on genre information in the dataframe.

    Parameters:
        df (pandas.DataFrame): DataFrame containing movie genre information.
        train (pandas.DataFrame): DataFrame containing user-item ratings in the train set.

    Returns:
        numpy.ndarray: Item-genre matrix.
    """
    # Extract genre columns from the dataframe
    genre_columns = [col for col in df.columns if col.startswith('genre_')]
    
    # Extract item IDs
    item_ids = train.columns
    
    # Initialize item-genre matrix with zeros
    item_genre_matrix = np.zeros((len(genre_columns), len(item_ids)))
    
    # Fill the matrix with genre information
    for i, row in df.iterrows():
        item_id = row['movieId']  # Assuming 'movieId' is the column containing item IDs
        if item_id in item_ids:
            item_index = np.where(item_ids == item_id)[0][0]
            genres = row[genre_columns].values
            item_genre_matrix[:, item_index] = genres
    
    return item_genre_matrix

# We create the item_genre matrix with the train set and the original df
item_genre_matrix = create_item_genre_matrix(df, train_matrix)


In [35]:
item_genre_matrix

item_genre_matrix.shape

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

(21, 80)

### Step 2: Calculate Similarities

In this step of the model we are going to define the `cosine_similarity` function, that will take in two vectors and will calculate the similarity between them.

In [36]:
# Calculate cosine similarity between items using NumPy functions
def cosine_similarity(a, b):
    """
    Calculate cosine similarity between two vectors.

    Parameters:
        a (numpy.ndarray): First vector.
        b (numpy.ndarray): Second vector.

    Returns:
        float: Cosine similarity between the two vectors.
    """
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    similarity = dot_product / (norm_a * norm_b)
    return similarity



#### Collaborative Filtering with Cosine Similarity

This code snippet demonstrates how to perform item-based collaborative filtering using cosine similarity. Collaborative filtering is a technique commonly used in recommendation systems to predict a user's preferences for items based on the preferences of similar users/items.


In [37]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def calculate_item_similarity_matrix(data):
    """
    Calculate item-item similarity matrix based on user-item ratings using cosine similarity.

    Parameters:
        data (numpy.ndarray or scipy.sparse matrix): User-item ratings matrix.

    Returns:
        numpy.ndarray: Item-item similarity matrix.
    """
    # Calculate cosine similarity between items
    item_similarity_matrix = cosine_similarity(data.T)
    np.fill_diagonal(item_similarity_matrix, 1)  # Set diagonal values to 1
    return item_similarity_matrix

# We calculate the item similarity using the train user-item matrix data
item_similarity_matrix_train = calculate_item_similarity_matrix(train)
print(item_similarity_matrix_train)


[[1.         0.01768747 0.0116242  ... 0.02849381 0.03972115 0.01476601]
 [0.01768747 1.         0.11724002 ... 0.04434947 0.07198231 0.0609452 ]
 [0.0116242  0.11724002 1.         ... 0.02106872 0.02703408 0.10719682]
 ...
 [0.02849381 0.04434947 0.02106872 ... 1.         0.11826213 0.06285302]
 [0.03972115 0.07198231 0.02703408 ... 0.11826213 1.         0.04448452]
 [0.01476601 0.0609452  0.10719682 ... 0.06285302 0.04448452 1.        ]]


Now we have to repeat the same process but for the item-genres matrix to have a matrix with the similarity between items.

In [38]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

def calculate_item_similarity_genres(item_genre_matrix):
    """
    Calculate item-item similarity based on genre information using cosine similarity.

    Parameters:
        item_genre_matrix (numpy.ndarray): Matrix containing item-genre information.

    Returns:
        numpy.ndarray: Matrix containing item-item similarity scores.
    """
    # Convert the item-genre matrix to a sparse representation
    sparse_item_genre_matrix = sparse.csr_matrix(item_genre_matrix)
    
    # Calculate cosine similarity between items
    item_similarity_matrix = cosine_similarity(sparse_item_genre_matrix.T)
    
    # Set diagonal values to 1
    np.fill_diagonal(item_similarity_matrix, 1)
    
    return item_similarity_matrix

# Example usage:
# Assuming item_genre_matrix is the item-genre matrix
item_similarity_genres_matrix = calculate_item_similarity_genres(item_genre_matrix)
print(item_similarity_genres_matrix)


[[1. 0. 0. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]]


In [39]:
item_similarity_genres_matrix.shape

(80, 80)

In [40]:
movie_id_to_index

{39: 0,
 96: 1,
 119: 2,
 143: 3,
 160: 4,
 181: 5,
 188: 6,
 208: 7,
 227: 8,
 235: 9,
 263: 10,
 273: 11,
 303: 12,
 317: 13,
 330: 14,
 339: 15,
 358: 16,
 386: 17,
 398: 18,
 426: 19,
 437: 20,
 443: 21,
 516: 22,
 521: 23,
 560: 24,
 567: 25,
 582: 26,
 599: 27,
 636: 28,
 650: 29,
 658: 30,
 675: 31,
 720: 32,
 739: 33,
 749: 34,
 819: 35,
 855: 36,
 945: 37,
 986: 38,
 996: 39,
 1048: 40,
 1057: 41,
 1063: 42,
 1089: 43,
 1126: 44,
 1129: 45,
 1152: 46,
 1194: 47,
 1195: 48,
 1233: 49,
 1262: 50,
 1267: 51,
 1284: 52,
 1296: 53,
 1298: 54,
 1337: 55,
 1349: 56,
 1352: 57,
 1358: 58,
 1364: 59,
 1373: 60,
 1376: 61,
 1377: 62,
 1413: 63,
 1448: 64,
 1489: 65,
 1523: 66,
 1561: 67,
 1571: 68,
 1575: 69,
 1589: 70,
 1686: 71,
 1730: 72,
 1764: 73,
 1819: 74,
 1820: 75,
 1852: 76,
 1857: 77,
 1902: 78,
 1949: 79}

In [41]:
print(index_to_user_id)

{0: 1575876, 1: 1447354, 2: 466862, 3: 2259898, 4: 1977959, 5: 507921, 6: 1224299, 7: 933168, 8: 1027819, 9: 786312, 10: 917679, 11: 587812, 12: 1733406, 13: 1509395, 14: 1537854, 15: 1931026, 16: 921345, 17: 1777211, 18: 380354, 19: 2604976, 20: 789969, 21: 1927580, 22: 1070629, 23: 664887, 24: 938804, 25: 2370879, 26: 1030857, 27: 1662714, 28: 478176, 29: 1203445, 30: 2170930, 31: 357482, 32: 519965, 33: 2035299, 34: 1792741, 35: 164845, 36: 2237185, 37: 187322, 38: 596474, 39: 2439493, 40: 1047830, 41: 1863627, 42: 1395967, 43: 1850615, 44: 2305067, 45: 2415589, 46: 238740, 47: 174783, 48: 1386053, 49: 2439537, 50: 817339, 51: 1795069, 52: 930113, 53: 1449738, 54: 2243690, 55: 2389367, 56: 2077564, 57: 684876, 58: 2488357, 59: 302344, 60: 1473980, 61: 570159, 62: 657641, 63: 1531029, 64: 2057248, 65: 937813, 66: 322451, 67: 156622, 68: 1621533, 69: 738062, 70: 1045053, 71: 1219323, 72: 1903324, 73: 2365581, 74: 818752, 75: 1322507, 76: 1380915, 77: 2198070, 78: 1060057, 79: 524142, 

In [42]:
print(user_id_to_index)

{1575876: 0, 1447354: 1, 466862: 2, 2259898: 3, 1977959: 4, 507921: 5, 1224299: 6, 933168: 7, 1027819: 8, 786312: 9, 917679: 10, 587812: 11, 1733406: 12, 1509395: 13, 1537854: 14, 1931026: 15, 921345: 16, 1777211: 17, 380354: 18, 2604976: 19, 789969: 20, 1927580: 21, 1070629: 22, 664887: 23, 938804: 24, 2370879: 25, 1030857: 26, 1662714: 27, 478176: 28, 1203445: 29, 2170930: 30, 357482: 31, 519965: 32, 2035299: 33, 1792741: 34, 164845: 35, 2237185: 36, 187322: 37, 596474: 38, 2439493: 39, 1047830: 40, 1863627: 41, 1395967: 42, 1850615: 43, 2305067: 44, 2415589: 45, 238740: 46, 174783: 47, 1386053: 48, 2439537: 49, 817339: 50, 1795069: 51, 930113: 52, 1449738: 53, 2243690: 54, 2389367: 55, 2077564: 56, 684876: 57, 2488357: 58, 302344: 59, 1473980: 60, 570159: 61, 657641: 62, 1531029: 63, 2057248: 64, 937813: 65, 322451: 66, 156622: 67, 1621533: 68, 738062: 69, 1045053: 70, 1219323: 71, 1903324: 72, 2365581: 73, 818752: 74, 1322507: 75, 1380915: 76, 2198070: 77, 1060057: 78, 524142: 79, 

In [43]:
item_user_array.shape

(30349, 80)

In [44]:
item_user_matrix.shape

(30349, 80)

In [45]:
item_similarity_matrix_train.shape

(80, 80)

### Matrices Summation:

Once we have both similarity matrices (the one from the ratings and the one from the genres) we can add them with a weighted sumation.

In [46]:
import numpy as np

def combine_similarity_matrices(similarity_matrix1, similarity_matrix2, weight1=0.5, weight2=0.5):
    # Check if the matrices have the same shape
    if similarity_matrix1.shape != similarity_matrix2.shape:
        raise ValueError("Matrices must have the same shape.")
    
    # Combine matrices using weighted summation directly without creating an intermediate array
    np.multiply(similarity_matrix1, weight1, out=similarity_matrix1)
    np.multiply(similarity_matrix2, weight2, out=similarity_matrix2)
    np.add(similarity_matrix1, similarity_matrix2, out=similarity_matrix1)
    return similarity_matrix1

# Define weights for each matrix
weight_ratings = 0.5
weight_genres = 0.5

# Combine the similarity matrices
item_similarity_matrix_train = combine_similarity_matrices(item_similarity_matrix_train, item_similarity_genres_matrix, weight_ratings, weight_genres)

# Print or use the combined similarity matrix as needed
print("Combined Similarity Matrix:\n", item_similarity_matrix_train)


Combined Similarity Matrix:
 [[1.         0.00884374 0.0058121  ... 0.01424691 0.01986057 0.00738301]
 [0.00884374 1.         0.55862001 ... 0.02217473 0.03599116 0.0304726 ]
 [0.0058121  0.55862001 1.         ... 0.01053436 0.01351704 0.05359841]
 ...
 [0.01424691 0.02217473 0.01053436 ... 1.         0.05913106 0.03142651]
 [0.01986057 0.03599116 0.01351704 ... 0.05913106 1.         0.02224226]
 [0.00738301 0.0304726  0.05359841 ... 0.03142651 0.02224226 1.        ]]


### Step 3: Neighborhood Selection
- Determine the neighborhood size, i.e., the number of most similar items to consider when predicting ratings for a target item.
- Select the most similar items for each item in the dataset based on their calculated similarities. This forms the neighborhood for each item.

#### Item-Based Neighborhoods and Ratings Aggregation

This code following snippet enhances the previous item-based collaborative filtering approach by considering ratings aggregation within the item neighborhoods.

##### Steps:

1. **Defining Neighborhood Size**:
   - The variable `neighborhood_size` determines the number of most similar items to consider in the neighborhood.

2. **Initializing Data Structure**:
   - An empty dictionary `item_neighborhoods` is initialized to store the neighborhoods for each item.

3. **Iterating Over Items**:
   - For each movie in the dataset:
     - All ratings for the current movie are extracted from the DataFrame (`df`).
     - Ratings aggregation is performed. In this example, the average rating for the movie is computed, but other aggregation methods can be used.
     - The similarity scores for the current movie are retrieved from the precomputed `item_similarity_matrix`.
     - Similarity scores are sorted in descending order, and the indices of the most similar items (excluding itself) are obtained.
     - These indices are converted back to movie IDs, forming the neighborhood for the current item.
     - The neighborhood for the current item is stored in the `item_neighborhoods` dictionary.

4. **Output**:
   - `item_neighborhoods`: A dictionary where keys are movie IDs, and values are lists of movie IDs representing the neighborhood of each item. Each movie's neighborhood includes movies with similar ratings and content.

##### Note:
- This approach considers both similarity in ratings and content (as captured by cosine similarity) when building item neighborhoods.
- Aggregating ratings within item neighborhoods helps in providing more personalized recommendations.


In [47]:
import numpy as np

def neighborhood_selection(item_user_array, index_to_movie_id, item_similarity_matrix, neighborhood_size=5):
    """
    Selects item neighborhoods based on similarity scores.

    Parameters:
        item_user_array (numpy.ndarray): Matrix containing user-item ratings.
        index_to_movie_id (dict): Dictionary mapping item indices to movie IDs.
        item_similarity_matrix (numpy.ndarray): Matrix containing item-item similarity scores.
        neighborhood_size (int): Number of most similar items to include in the neighborhood.

    Returns:
        dict: Dictionary containing item neighborhoods for each movie.
    """
    # Initialize an empty dictionary to store item neighborhoods
    item_neighborhoods = {}

    # Iterate over each item (movie) index in the dataset
    for movie_index in range(item_user_array.shape[1]):
        # Convert the item index to movie ID
        movie_id = index_to_movie_id[movie_index]

        # Extract all ratings for the current movie
        movie_ratings = item_user_array[:, movie_index]

        # Aggregate ratings (e.g., compute the mean rating)
        movie_avg_rating = np.mean(movie_ratings)

        # Retrieve similarity scores for the current movie
        similarity_scores = item_similarity_matrix[movie_index]

        # Sort similarity scores in descending order and get indices of most similar items
        most_similar_indices = np.argsort(similarity_scores)[::-1][1:neighborhood_size+1]

        # Convert indices back to movie IDs to form the neighborhood
        neighborhood = [index_to_movie_id[idx] for idx in most_similar_indices]

        # Store the neighborhood for the current item in the item_neighborhoods dictionary
        item_neighborhoods[movie_id] = neighborhood

    return item_neighborhoods

# Example usage:
item_neighborhoods_train = neighborhood_selection(item_user_array, index_to_movie_id, item_similarity_matrix_train)
print(item_neighborhoods_train)


{39: [1296, 986, 739, 1364, 650], 96: [1764, 119, 1063, 720, 1575], 119: [1764, 96, 1063, 720, 1575], 143: [330, 658, 996, 599, 1571], 160: [386, 516, 1902, 582, 986], 181: [1296, 516, 986, 1561, 227], 188: [720, 96, 119, 1764, 1063], 208: [1902, 636, 1571, 1267, 658], 227: [1561, 1852, 1489, 636, 1296], 235: [1194, 303, 398, 516, 986], 263: [1730, 437, 1561, 658, 1267], 273: [1561, 636, 330, 1262, 143], 303: [235, 398, 1194, 516, 986], 317: [567, 1152, 739, 1364, 303], 330: [143, 599, 1262, 1298, 1571], 339: [1575, 1764, 303, 119, 945], 358: [599, 1364, 330, 1589, 317], 386: [516, 986, 235, 599, 1129], 398: [303, 1194, 235, 516, 986], 426: [1267, 749, 1089, 1561, 227], 437: [1337, 1730, 675, 263, 1820], 443: [1129, 1358, 658, 1352, 516], 516: [986, 398, 181, 235, 386], 521: [1949, 119, 1852, 1820, 303], 560: [1902, 658, 1819, 1857, 1267], 567: [317, 1152, 1364, 739, 358], 582: [1352, 516, 1561, 1358, 986], 599: [330, 143, 1262, 358, 1589], 636: [1561, 1262, 273, 749, 227], 650: [1377,

### Step 4: Rating Prediction

- For each target item and user pair where the user hasn't rated the target item:
Identify the neighborhood of similar items to the target item.
- Predict the rating for the target item using a weighted average of the ratings of the items in its neighborhood, where the weights are the similarities between the items and the target item.
- Adjust the prediction based on the user's average rating or other normalization techniques, if necessary.

In [48]:
# Initialize an empty DataFrame to store predicted ratings
predicted_ratings_train = pd.DataFrame(index=train_matrix.index, columns=train_matrix.columns)

# Iterate over each user-item pair in the training set
for user_id, user_ratings in train_matrix.iterrows():
    for movie_id, rating in user_ratings.items():
        
        # Check if the movie has a neighborhood defined
        if movie_id in item_neighborhoods_train:
            neighborhood = item_neighborhoods_train[movie_id]
            
            # Filter out movies from the neighborhood that the user has rated in the training set
            filtered_neighborhood = [neighbor_movie_id for neighbor_movie_id in neighborhood if neighbor_movie_id in user_ratings.index and user_ratings.loc[neighbor_movie_id] != 0]
            
            # Check if there are valid indices in the filtered neighborhood
            if len(filtered_neighborhood) > 0:
                # Calculate the predicted rating for the target movie based on the neighborhood
                neighbor_ratings = [user_ratings.loc[neighbor_movie_id] for neighbor_movie_id in filtered_neighborhood]
                
                # Calculate the mean rating using only the movies in the neighborhood that have been rated by the user
                predicted_rating = np.mean(neighbor_ratings)
            else:
                # If the filtered neighborhood is empty, assign the mean rating of all movies rated by the user
                predicted_rating = user_ratings[user_ratings != 0].mean()
            
            # Assign the predicted rating to the corresponding cell in the DataFrame
            predicted_ratings_train.at[user_id, movie_id] = predicted_rating

# Fill NaN values with mean ratings across all users
predicted_ratings_train.fillna(predicted_ratings_train.mean().mean(), inplace=True)

# Display the head of the predicted_ratings DataFrame
predicted_ratings_train.head()


Unnamed: 0,39,96,119,143,160,181,188,208,227,235,...,1589,1686,1730,1764,1819,1820,1852,1857,1902,1949
843201,3.857143,4.0,4.0,4.666667,3.857143,4.0,4.0,3.333333,3.5,3.857143,...,4.5,3.857143,4.0,4.0,5.0,3.857143,4.0,4.0,5.0,3.857143
531747,4.0,4.0,4.0,4.5,4.0,2.0,4.0,4.333333,3.0,4.0,...,5.0,4.0,3.5,4.0,4.0,4.0,2.0,4.0,4.5,4.0
2143797,4.0,4.0,4.0,3.333333,3.777778,3.777778,4.0,3.5,3.0,3.777778,...,3.0,3.777778,3.777778,4.0,4.0,3.777778,3.777778,3.5,3.0,3.777778
2153721,2.692308,4.0,4.0,2.5,3.0,4.0,4.0,1.75,1.0,4.0,...,2.75,2.692308,2.0,4.0,2.333333,2.692308,2.692308,1.5,2.75,3.0
834933,3.0,5.0,5.0,2.666667,3.0,3.0,5.0,2.666667,2.0,3.0,...,2.5,3.0,3.0,5.0,3.0,3.0,3.0,2.5,3.0,3.0


In [49]:
# We count the NaN values in the predicted_ratings_train dataframe
predicted_ratings_train.isnull().sum().sum()

0

We are now going to define a function that gets the top recomendations per user using the predicted ratings.

In [50]:
def get_top_recommendations(user_id, predicted_ratings, df):
    """
    Get top movie recommendations for a given user using predicted ratings.

    Parameters:
        user_id (int): ID of the user for whom recommendations are to be generated.
        predicted_ratings (pd.DataFrame): DataFrame containing predicted ratings for users and movies.
        df (pd.DataFrame): DataFrame containing movie ratings.

    Returns:
        list: Top movie titles recommended for the user.
    """
    # Ensure that predicted_ratings is a DataFrame
    if not isinstance(predicted_ratings, pd.DataFrame):
        raise ValueError("predicted_ratings must be a pandas DataFrame")

    # Check if the user ID exists in the predicted ratings DataFrame's index
    if user_id not in predicted_ratings.index:
        # If the user ID doesn't exist, return an empty list
        return ["The user doesn't exist"]

    # Get the predicted ratings for the user
    user_predicted_ratings = predicted_ratings.loc[user_id]
    
    # Filter out the movies that the user has already seen
    seen_movies = set(df[df['userId'] == user_id]['movieId'])
    unseen_movies = [movie_id for movie_id in user_predicted_ratings.index if movie_id not in seen_movies]
    
    # Check if there are unseen movies with predicted ratings
    if not unseen_movies:
        # If all movies have been seen, return an empty list
        return []
    
    # Sort the unseen movies by predicted rating in descending order
    sorted_unseen_movies = user_predicted_ratings[unseen_movies].sort_values(ascending=False)
    
    # Get the top 5 movie titles
    top_movie_ids = sorted_unseen_movies.head(5).index.tolist()
    
    # Get the unique movie titles corresponding to the top movie IDs
    top_movie_titles = set(df[df['movieId'].isin(top_movie_ids)]['title'])
    
    return list(top_movie_titles)[:5]  # Return only the first 5 unique movie titles


And we do a loop through all the users in the train data to get their recommended movies.

In [51]:
top_recommendations = []

# Iterate over each row (user) in the train_matrix
for user_id in train_matrix.index:
    # Get the actual user ID corresponding to the user index
    recommendations = get_top_recommendations(user_id, predicted_ratings_train, df)
    top_recommendations.append(recommendations)

# Output: List of top movie recommendations for each user in the train matrix
top_recommendations

# print size of top_recommendations
len(top_recommendations)


[['The Faculty',
  'Madonna: The Girlie Show: Live Down Under',
  'Species II',
  'The Bat / House on Haunted Hill',
  'Yojimbo'],
 ['The Land Before Time VII: Stone of Cold Fire',
  'Doctor Who: Ghost Light',
  'The Duel at Silver Creek',
  'Yojimbo',
  'Shoot to Kill'],
 ['Empires: Martin Luther',
  'Love Reinvented',
  'Princess Caraboo',
  'The Duel at Silver Creek',
  'Robin Hood: Prince of Thieves'],
 ['Unhook the Stars',
  'Year of the Horse: Neil Young & Crazy Horse Live',
  'The Trip',
  'In the Realm of the Senses',
  'Hemp Revolution'],
 ['Travel the World by Train: Africa',
  'Dead Birds',
  'Finest Hour: The Battle of Britain',
  'Year of the Horse: Neil Young & Crazy Horse Live',
  'Inside the Space Station'],
 ['Stir Crazy',
  'Dr. Dolittle 2',
  'Foul King',
  'Mutiny on the Bounty',
  'No Ordinary Love'],
 ['Fall',
  'Uncovered: The Whole Truth About the Iraq War',
  'Princess Caraboo',
  'In the Realm of the Senses',
  'No Ordinary Love'],
 ['Taxi',
  'Uncovered: The 

19423

### Step 5: Model Evaluation

We evaluate the performance of the ItemKNN algorithm using appropriate evaluation metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE)

In [52]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

def evaluate_model(train_matrix, predicted_ratings):
    """
    Evaluate the model's performance on the training data.

    Parameters:
        train_matrix (numpy.ndarray): Item-user matrix from the training data.
        predicted_ratings (pandas.DataFrame or numpy.ndarray): Predicted ratings DataFrame or array for the training data.

    Returns:
        float: Mean Absolute Error (MAE) on the training data.
        float: Root Mean Squared Error (RMSE) on the training data.
    """
    # Convert predicted_ratings to a numpy array if it's a DataFrame
    if isinstance(predicted_ratings, pd.DataFrame):
        predicted_ratings = predicted_ratings.to_numpy()

    # Ensure train_matrix and predicted_ratings have the same shape
    assert train_matrix.shape == predicted_ratings.shape, "Shapes of train_matrix and predicted_ratings are not consistent."

    # Initialize lists to store true and predicted ratings
    true_ratings = []
    pred_ratings = []

    # Iterate over each user and their ratings
    for user_id, user_ratings in enumerate(train_matrix):
        for movie_id, rating in enumerate(user_ratings):
            # Skip unrated movies
            if rating == 0:
                continue

            # Check if the indices are within the bounds of the predicted ratings array
            if user_id < predicted_ratings.shape[0] and movie_id < predicted_ratings.shape[1]:
                # Get the predicted rating for the corresponding movie
                pred_rating = predicted_ratings[user_id, movie_id]

                # Append the true and predicted ratings
                true_ratings.append(rating)
                pred_ratings.append(pred_rating)

    # Convert lists to numpy arrays
    true_ratings = np.array(true_ratings)
    pred_ratings = np.array(pred_ratings)

    # Calculate evaluation metrics
    mae = mean_absolute_error(true_ratings, pred_ratings)
    rmse = np.sqrt(mean_squared_error(true_ratings, pred_ratings))

    return mae, rmse

# Assuming train is the numpy array or DataFrame representing the item-user matrix
# Assuming predicted_ratings_train is the DataFrame or numpy array representing the predicted ratings for the training data

# Evaluate the model
mae, rmse = evaluate_model(train, predicted_ratings_train)

print("Mean Absolute Error (MAE) on Training Data:", mae)
print("Root Mean Squared Error (RMSE) on Training Data:", rmse)


Mean Absolute Error (MAE) on Training Data: 0.8400665344382344
Root Mean Squared Error (RMSE) on Training Data: 1.1193579613185807


### Step 6: Parameter Tuning
- Experiment with different parameters such as similarity threshold, neighborhood size, and similarity metric to optimize the performance of the ItemKNN algorithm.
- Use cross-validation or other techniques to tune these parameters and avoid overfitting.

1. **Import Necessary Libraries:**
   - We import the required libraries for performing grid search cross-validation (`GridSearchCV`), creating custom scorers (`make_scorer`), and utilizing the `NearestNeighbors` algorithm.

2. **Define Cosine Similarity Function:**
   - We define a custom function `cosine_similarity` to compute the cosine similarity between two vectors. This function calculates the dot product of the vectors and divides it by the product of their norms.

3. **Define Custom Scorer:**
   - We create a custom scorer `cosine_similarity_scorer` using `make_scorer`, which enables us to use cosine similarity as the scoring metric during grid search cross-validation.

4. **Define Parameter Grid:**
   - We specify a parameter grid `param_grid` containing the hyperparameters to be tuned. In this case, we're tuning the number of neighbors (`n_neighbors`) and the distance metric (`metric`) for the `NearestNeighbors` algorithm.

5. **Initialize NearestNeighbors Model:**
   - We initialize the `NearestNeighbors` model without specifying any hyperparameters.

6. **Create GridSearchCV Object:**
   - We create a `GridSearchCV` object named `grid_search` with the specified parameter grid, cross-validation strategy (5-fold cross-validation), and custom scoring metric (`cosine_similarity_scorer`).

7. **Fit the Data:**
   - We fit the `item_user_matrix` data to the `grid_search` object to perform hyperparameter tuning. `item_user_matrix` typically contains the item-item similarity matrix computed using collaborative filtering techniques.

8. **Get Best Hyperparameters:**
   - After fitting the data, we retrieve the best hyperparameters selected by the grid search using the `best_params_` attribute of the `grid_search` object.

9. **Print Best Parameters:**
   - Finally, we print the best hyperparameters obtained from the grid search.



`scikit-learn (sklearn)`:

- Scikit-learn is a popular machine learning library in Python that provides simple and efficient tools for data analysis and modeling.
- It includes various modules for tasks such as classification, regression, clustering, dimensionality reduction, and model selection.
- The GridSearchCV class from scikit-learn is used for hyperparameter tuning through grid search along with cross-validation.
- The make_scorer function allows us to create a custom scoring function for use with GridSearchCV.
- The NearestNeighbors class provides functionality for unsupervised nearest neighbors learning, which can be used for tasks such as finding k-nearest neighbors for a given data point.

In [53]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import make_scorer
import numpy as np

# Define a custom scorer based on cosine similarity defined above
cosine_similarity_scorer = make_scorer(cosine_similarity)

# Define the parameter grid
param_grid = {
    'n_neighbors': [5, 10, 15, 30, 40],
    'metric': ['cosine', 'euclidean']
}

# Initialize NearestNeighbors model
knn_model = NearestNeighbors()

# Create the GridSearchCV object
grid_search = GridSearchCV(knn_model, param_grid, cv=5, scoring=cosine_similarity_scorer)

# Fit the data to perform hyperparameter tuning
grid_search.fit(train_matrix)  # train_matrix contains item-user matrix for the train set

# Get the best hyperparameters
best_params = grid_search.best_params_

print("Best parameters:", best_params)


Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() 

Best parameters: {'metric': 'cosine', 'n_neighbors': 5}


This code sets up a GridSearchCV object to perform hyperparameter tuning using the train set (train_matrix). It explores different combinations of hyperparameters specified in the param_grid, evaluates them using 5-fold cross-validation (cv=5), and uses the cosine_similarity scorer to optimize the model's performance based on cosine similarity. Finally, it prints the best hyperparameters found during the search.

The best parameter found is the one we trained the model with, so we don't have to train the model again.

We have to perform the calculation of the predicted ratings for the validation set to generate predicted ratings in order to evaluate the performance of our trained model on a validation dataset. 

If we computed the similarity matrix again specifically for the validation set, it would essentially mean that we are using a different set of similarity measures for predicting ratings compared to what we used during training. This approach could lead to inconsistencies and potentially degrade the performance of our model. Here's what could happen:

1. **Inconsistency**: The similarity measures computed for the train_val set might differ from those computed for the training set due to variations in the data. As a result, the predicted ratings based on these new similarity measures may not align well with the predictions made during training, leading to inconsistency in the model's behavior.

2. **Overfitting**: Computing a new similarity matrix specifically for the train_val set might lead to overfitting on the validation data. The model may capture noise or idiosyncrasies present in the train_val set, which may not generalize well to unseen data.

3. **Increased Complexity**: Computing the similarity matrix again for the train_val set adds computational complexity and redundancy, especially if the similarity computation process is resource-intensive. This can result in longer training times and increased resource utilization.

Overall, it's generally recommended to use the same similarity measures or neighborhood definitions for both training and validation sets to ensure consistency and generalizability of the model.

In [54]:
# Initialize an empty DataFrame to store predicted ratings for train_val
predicted_ratings_val = pd.DataFrame(index=val_matrix.index, columns=val_matrix.columns)

# Iterate over each user and their ratings in the train_val set
for user_id, user_ratings in val_matrix.iterrows():
    for movie_id, rating in user_ratings.items():
        
        # Check if the movie has a neighborhood defined
        if movie_id in item_neighborhoods_train:
            neighborhood = item_neighborhoods_train[movie_id]
            
            # Filter out movies from the neighborhood that the user has rated in the train set
            filtered_neighborhood = [neighbor_movie_id for neighbor_movie_id in neighborhood if neighbor_movie_id in user_ratings.index and user_ratings.loc[neighbor_movie_id] != 0]
            
            # Check if there are valid indices in the filtered neighborhood
            if len(filtered_neighborhood) > 0:
                # Calculate the predicted rating for the target movie based on the neighborhood
                neighbor_ratings = [user_ratings.loc[neighbor_movie_id] for neighbor_movie_id in filtered_neighborhood]
                
                # Calculate the mean rating using only the movies in the neighborhood that have been rated by the user
                predicted_rating = np.mean(neighbor_ratings)
            else:
                # If the filtered neighborhood is empty, assign the mean rating of all movies rated by the user
                predicted_rating = user_ratings[user_ratings != 0].mean()
            
            # Assign the predicted rating to the corresponding cell in the DataFrame
            predicted_ratings_val.at[user_id, movie_id] = predicted_rating

# Fill NaN values with mean ratings across all users
predicted_ratings_val.fillna(predicted_ratings_val.mean().mean(), inplace=True)


In [55]:
predicted_ratings_val.head()

predicted_ratings_val.shape

Unnamed: 0,39,96,119,143,160,181,188,208,227,235,...,1589,1686,1730,1764,1819,1820,1852,1857,1902,1949
2551487,4.0,4.0,4.0,4.0,4.0,5.0,4.0,2.5,5.0,4.0,...,4.333333,4.0,4.0,4.0,5.0,4.0,5.0,4.0,4.0,4.0
2230701,4.0,3.666667,3.666667,5.0,4.0,3.0,3.666667,5.0,2.0,4.0,...,4.0,3.666667,2.0,3.666667,4.333333,3.666667,2.0,5.0,4.0,3.666667
1842201,3.5,3.5,3.5,4.0,4.0,2.0,3.5,4.25,3.0,3.5,...,4.0,3.5,3.5,3.5,3.5,3.5,2.0,4.0,4.0,3.5
300026,2.5,4.0,4.0,2.5,2.0,2.0,4.0,2.0,2.0,2.0,...,3.0,2.5,2.0,4.0,3.0,2.5,2.0,2.5,3.0,2.5
2237945,4.8,4.8,4.8,4.0,4.8,5.0,4.8,4.0,5.0,4.8,...,4.8,4.8,5.0,4.8,4.0,5.0,5.0,4.0,4.8,4.8


(4856, 80)

Now we evaluate the model with the Validation item-user matrix with the recently calculated predicted ratings for the validation set.

In [56]:
# Now, we can evaluate the model using the evaluate_model function
mae, rmse = evaluate_model(val_matrix.to_numpy(), predicted_ratings_val)

print("Mean Absolute Error (MAE) on Validation Data:", mae)
print("Root Mean Squared Error (RMSE) on Validation Data:", rmse)


Mean Absolute Error (MAE) on Validation Data: 0.83611864627124
Root Mean Squared Error (RMSE) on Validation Data: 1.1194509428839041


The results show that the model performs even better than the training set. This can be due to the fact the the overall opinion in the dataset is the same as the one in the training dataset. Meaning that the predicted ratings in the validation set are very similar to the true ones in the val set because there is like a popularity bias of the items users like the same items. 

### Retraining of the model with the Train-Val set

Once the have calculated and evaluated the model on the validation data, we have retrain the similarity matrix for the train-val data. This means that we have to recalculate the similarity between items now using the reviews in train and validation and also the genre similarity matrix between items which remains the same, since the items in `item_similarity_genres_matrix` don't change.

In [57]:
# After the model evaluation on the validation set, we recalculate the similarity matrix
# of the model on the train-val data
item_similarity_matrix_train_val = calculate_item_similarity_matrix(train_val)

# We combine the similarity matrices (the retrained one and the genres)
item_similarity_matrix_train_val = combine_similarity_matrices(item_similarity_matrix_train_val, item_similarity_genres_matrix)

# We calculate the item neighborhoods for the train_val data 
item_neighborhoods_train_val = neighborhood_selection(item_user_array, index_to_movie_id, item_similarity_matrix_train_val)

### Step 7: Deployment

Once we have come up with the best parameters possible and trained the model with the whole train_validation set, we will test it on the test set.

In [58]:
test_matrix

Unnamed: 0,39,96,119,143,160,181,188,208,227,235,...,1589,1686,1730,1764,1819,1820,1852,1857,1902,1949
1415733,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
75976,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0
2102053,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2418087,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
61472,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1153098,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0
141745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0
279485,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0
1448220,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [59]:
# Check if there are NaN values in the test matrix
test_matrix.isnull().sum().sum()

0

In [60]:
# Now, we can evaluate the model's predicted ratings on the test set

# Initialize an empty DataFrame to store predicted ratings
predicted_ratings_test = pd.DataFrame(index=test_matrix.index, columns=test_matrix.columns)

# Iterate over each user-item pair in the training set
for user_id, user_ratings in test_matrix.iterrows():
    for movie_id, rating in user_ratings.items():
    
        
        # Check if the movie has a neighborhood defined
        if movie_id in item_neighborhoods_train_val:
            neighborhood = item_neighborhoods_train_val[movie_id]
            
            # Filter out movies from the neighborhood that the user has rated in the training set
            filtered_neighborhood = [neighbor_movie_id for neighbor_movie_id in neighborhood if neighbor_movie_id in user_ratings.index and user_ratings.loc[neighbor_movie_id] != 0]
            
            # Check if there are valid indices in the filtered neighborhood
            if len(filtered_neighborhood) > 0:
                # Calculate the predicted rating for the target movie based on the neighborhood
                neighbor_ratings = [user_ratings.loc[neighbor_movie_id] for neighbor_movie_id in filtered_neighborhood]
                
                # Calculate the mean rating using only the movies in the neighborhood that have been rated by the user
                predicted_rating = np.mean(neighbor_ratings)
            else:
                # If the filtered neighborhood is empty, assign the mean rating of all movies rated by the user
                predicted_rating = user_ratings[user_ratings != 0].mean()
            
            # Assign the predicted rating to the corresponding cell in the DataFrame
            predicted_ratings_test.at[user_id, movie_id] = predicted_rating

# Fill NaN values with mean ratings across all users
predicted_ratings_test.fillna(predicted_ratings_test.mean().mean(), inplace=True)

# Now, we can evaluate the model's predicted ratings on the test set
mae, rmse = evaluate_model(test_matrix.to_numpy(), predicted_ratings_test.to_numpy())

print("Mean Absolute Error (MAE) on Test Data using the Ratings given to the Test Set:", mae)
print("Root Mean Squared Error (RMSE) on Test Data using the Ratings given to the Test Set:", rmse)


Mean Absolute Error (MAE) on Test Data using the Ratings given to the Test Set: 0.8367954827827105
Root Mean Squared Error (RMSE) on Test Data using the Ratings given to the Test Set: 1.1063145099909513


Overall the performance of the model seems very good, as it is performing very well on predicting data for unseen users showcasing a small MAE and RMSE.

In [61]:
top_recommendations_test = []

# Iterate over each row (user) in the test_matrix
for user_id in test_matrix.index:
    # Get the actual user ID corresponding to the user index
    recommendations = get_top_recommendations(user_id, predicted_ratings_test, df)
    top_recommendations_test.append(recommendations)

# Output: List of top movie recommendations for each user in the test matrix
top_recommendations_test

# print size of top_recommendations
print("Size of top_recommendations_test:", len(top_recommendations_test))



[['Dead Birds',
  'Finest Hour: The Battle of Britain',
  'Year of the Horse: Neil Young & Crazy Horse Live',
  'Inside the Space Station',
  'The Life and Times of Hank Greenberg'],
 ['Fall',
  "What's the Matter with Helen? / Whoever Slew Auntie Roo? (Double Feature)",
  'Princess Caraboo',
  'Dynasty: Season 1',
  'Voltage Fighter Gowcaizer'],
 ['Year of the Horse: Neil Young & Crazy Horse Live',
  'The Life and Times of Hank Greenberg',
  'Sea of Love',
  'Hemp Revolution',
  'The Program'],
 ['Jo Jo Dancer, Your Life is Calling',
  'American Adobo',
  'Mutiny on the Bounty',
  'Monsoon Wedding',
  "Recess: School's Out"],
 ['Dragon Ball: Piccolo Jr. Saga: Part 1',
  'The Duel at Silver Creek',
  'In the Realm of the Senses',
  'Dragon Ball: Tournament Saga',
  'Monsoon Wedding'],
 ['Charisma',
  "Recess: School's Out",
  'The Greatest Lover',
  'The Last Shot',
  'Treading Water'],
 ['Empires: Martin Luther',
  'Unhook the Stars',
  'The Trip',
  'In the Realm of the Senses',
  'N

Size of top_recommendations_test: 6070


# Item-based Classification (ItemKNN)

### Step 1: Data preparation

We have created already the user-item interactions matrix (`item_user_matrix`) in the first part of the Regression model above.

Therefore, this step is already done in this model.

For this step we are going to resuse the `item_user_matrix` that we created in the first model built. 

### Step 2: Compute Item Similarity

The similarity matrices both for the genres (`item_similarity_genres_matrix`) and for the ratings (`item_similarity_matrix_train` and `item_similarity_matrix_train_val`) were already created in Step 2 of the Regression model above.


### Step 3: Define items neighborhood

- Determine the neighborhood size, i.e., the number of most similar items to consider when predicting ratings for a target item.
- Select the most similar items for each item in the dataset based on their calculated similarities. This forms the neighborhood for each item.

#### Item-Based Neighborhoods and Ratings Aggregation

This code following snippet enhances the previous item-based collaborative filtering approach by considering ratings aggregation within the item neighborhoods.

##### Steps:

1. **Defining Neighborhood Size**:
   - The variable `k` determines the number of most similar items to consider in the neighborhood.

2. **Initializing Data Structure**:
   - An empty dictionary `item_neighborhoods_classification_train` is initialized to store the neighborhoods for each item.

3. **Iterating Over Items**:
   - For each movie in the dataset:
     - All ratings for the current movie are extracted from the DataFrame (`df`).
     - Ratings aggregation is performed. In this example, the average rating for the movie is computed, but other aggregation methods can be used.
     - The similarity scores for the current movie are retrieved from the precomputed `item_similarity_matrix_train`.
     - Similarity scores are sorted in descending order, and the indices of the most similar items (excluding itself) are obtained.
     - These indices are converted back to movie IDs, forming the neighborhood for the current item.
     - The neighborhood for the current item is stored in the `item_neighborhoods` dictionary.

4. **Output**:
   - `item_neighborhoods`: A dictionary where keys are movie IDs, and values are lists of movie IDs representing the neighborhood of each item. Each movie's neighborhood includes movies with similar ratings and content.

##### Note:
- This approach considers both similarity in ratings and content (as captured by cosine similarity) when building item neighborhoods.
- Aggregating ratings within item neighborhoods helps in providing more personalized recommendations.

In [62]:
import numpy as np

def item_neighborhood_selection(similarity_matrix, k=None, threshold=None, item_ids=None):
    """
    Select a subset of similar items for each item based on similarity matrix.

    Parameters:
        similarity_matrix (numpy.ndarray): Item-item similarity matrix.
        k (int): Number of similar items to select (optional).
        threshold (float): Similarity threshold for selecting similar items (optional).
        item_ids (list): List of item IDs corresponding to rows/columns of the similarity matrix.

    Returns:
        dict: Dictionary containing similar items for each item.
    """
    num_items = similarity_matrix.shape[0]
    item_neighborhood = {}

    for i in range(num_items):
        if k is not None:
            # Select top-k similar items (excluding the item itself)
            similar_items_indices = np.argsort(similarity_matrix[i])[::-1][:k]
        elif threshold is not None:
            # Select items with similarity above threshold (excluding the item itself)
            similar_items_indices = np.where(similarity_matrix[i] > threshold)[0]

        # Remove the item itself from the neighborhood
        similar_items_indices = similar_items_indices[similar_items_indices != i]

        # Get the item ID corresponding to the current index
        current_item_id = item_ids[i] if item_ids is not None else i

        # Get the item IDs for the neighborhood
        neighborhood_item_ids = [item_ids[index] for index in similar_items_indices]

        # Store the similar items with the item's ID as the key
        item_neighborhood[current_item_id] = neighborhood_item_ids

    return item_neighborhood

# We have an item-item similarity matrix 'item_similarity_matrix_train'
# and we want to select top-5 similar items for each item
k = 5
item_neighborhoods_classification_train = item_neighborhood_selection(item_similarity_matrix_train, k=k, item_ids=index_to_movie_id)

# Check if there are any items with empty neighborhoods
empty_neighborhoods = []

for item_id, neighborhood in item_neighborhoods_classification_train.items():
    if not neighborhood:
        empty_neighborhoods.append(item_id)

if empty_neighborhoods:
    print("Items with empty neighborhoods:", empty_neighborhoods)
else:
    print("No items with empty neighborhoods found.")


No items with empty neighborhoods found.


In [63]:
len(item_neighborhoods_classification_train)

80

### Step 4: Ratings classification selection

This is the step that differs the most from the regression model. 

In the code below, repeats the process from the rating prediction model, but in this case, we select the most similar movies for each item and we try selecting the most similar rating available, this is where the classification part comes from. We will classify the items per neighborhood and select the most similar item in that neighborhood that has an available rating.

In [64]:
import numpy as np
import pandas as pd

def recommend_movies(train_matrix, item_neighborhoods, movie_id_to_index, num_movies=5):
    """
    Recommend top movies for each user based on item neighborhood.

    Parameters:
        train_matrix (np.ndarray or pd.DataFrame): Matrix containing user-item ratings in the train set.
        item_neighborhoods (dict): Dictionary containing similar items for each item.
        movie_id_to_index (dict): Dictionary mapping movie IDs to their corresponding indices in the train_matrix.
        num_movies (int): Number of movies to recommend for each user.

    Returns:
        pd.DataFrame: DataFrame containing predicted ratings for recommended movies for each user.
    """
    # Convert train_matrix to DataFrame if it's a numpy array
    if isinstance(train_matrix, np.ndarray):
        train_matrix = pd.DataFrame(train_matrix, index=np.arange(train_matrix.shape[0]), columns=np.arange(train_matrix.shape[1]))

    # Initialize an empty DataFrame to store predicted ratings
    predicted_ratings = pd.DataFrame(index=train_matrix.index, columns=train_matrix.columns)

    # Iterate over each user-item pair in the training set
    for user_id, user_ratings in train_matrix.iterrows():
        for movie_id, rating in user_ratings.items():
            
            # Skip if the rating is non-zero (indicating a rating given by the user)
            if rating != 0:
                continue
            
            # Check if the movie has a neighborhood defined
            if movie_id in item_neighborhoods:
                neighborhood = item_neighborhoods[movie_id]
                
                # Initialize a flag to track if the rating has been predicted
                rating_predicted = False
                
                # Iterate over the items in the neighborhood
                for neighbor_movie_id in neighborhood:
                    # Check if the neighbor movie has been rated by the user
                    if neighbor_movie_id in user_ratings.index and user_ratings.loc[neighbor_movie_id] != 0:
                        # Use the rating of the closest item in the neighborhood
                        predicted_rating = user_ratings.loc[neighbor_movie_id]
                        rating_predicted = True
                        break
                
                # If none of the items in the neighborhood have been rated, use the user's average rating
                if not rating_predicted:
                    if user_ratings[user_ratings != 0].empty:
                        # If the user hasn't rated any movies, use the global mean rating
                        predicted_rating = train_matrix[train_matrix != 0].mean().mean()
                    else:
                        # Use the average of the ratings given by the user
                        predicted_rating = user_ratings[user_ratings != 0].mean()
                
                # Assign the predicted rating to the corresponding cell in the DataFrame
                predicted_ratings.at[user_id, movie_id] = predicted_rating

    # Fill NaN values with mean ratings across all users
    predicted_ratings.fillna(predicted_ratings.mean().mean(), inplace=True)

    # Set the index of the DataFrame as the user ID of the train set
    predicted_ratings.index = train_matrix.index

    # Display the head of the predicted_ratings DataFrame
    print(predicted_ratings.head())

    return predicted_ratings


# Recommend movies for each user in the train set
predicted_ratings_classification_train = recommend_movies(train_matrix, item_neighborhoods_classification_train, movie_id_to_index)

# Print the shape of the predicted ratings matrix
print("Shape of predicted ratings matrix:", predicted_ratings_classification_train.shape)


             39    96    119       143       160       181   188   208   227   \
843201   3.857143   4.0   4.0  5.000000  3.857143  4.000000   4.0   3.0   4.0   
531747   4.000000   4.0   4.0  5.000000  4.000000  2.000000   4.0   4.0   2.0   
2143797  4.000000   4.0   4.0  3.000000  3.777778  3.777778   4.0   3.0   3.0   
2153721  2.692308   4.0   4.0  3.490495  4.000000  4.000000   4.0   2.0   1.0   
834933   3.000000   5.0   5.0  1.000000  3.000000  3.000000   5.0   3.0   2.0   

             235   ...  1589      1686      1730  1764  1819      1820  \
843201   3.857143  ...   5.0  3.857143  4.000000   4.0   5.0  3.857143   
531747   4.000000  ...   5.0  4.000000  2.000000   4.0   4.0  4.000000   
2143797  3.777778  ...   3.0  3.777778  3.777778   4.0   4.0  3.777778   
2153721  4.000000  ...   3.0  2.692308  2.692308   4.0   2.0  2.692308   
834933   3.000000  ...   4.0  3.000000  3.000000   5.0   3.0  3.000000   

             1852  1857      1902      1949  
843201   4.000000   5.

  predicted_ratings.fillna(predicted_ratings.mean().mean(), inplace=True)


In [65]:
type(train_matrix)

pandas.core.frame.DataFrame

In [66]:
predicted_ratings_classification_train

Unnamed: 0,39,96,119,143,160,181,188,208,227,235,...,1589,1686,1730,1764,1819,1820,1852,1857,1902,1949
843201,3.857143,4.000000,4.000000,5.000000,3.857143,4.000000,4.000000,3.000000,4.00,3.857143,...,5.0,3.857143,4.000000,4.000000,5.000000,3.857143,4.000000,5.00,5.000000,3.857143
531747,4.000000,4.000000,4.000000,5.000000,4.000000,2.000000,4.000000,4.000000,2.00,4.000000,...,5.0,4.000000,2.000000,4.000000,4.000000,4.000000,2.000000,4.00,5.000000,4.000000
2143797,4.000000,4.000000,4.000000,3.000000,3.777778,3.777778,4.000000,3.000000,3.00,3.777778,...,3.0,3.777778,3.777778,4.000000,4.000000,3.777778,3.777778,4.00,3.000000,3.777778
2153721,2.692308,4.000000,4.000000,3.490495,4.000000,4.000000,4.000000,2.000000,1.00,4.000000,...,3.0,2.692308,2.692308,4.000000,2.000000,2.692308,2.692308,2.00,3.490495,3.000000
834933,3.000000,5.000000,5.000000,1.000000,3.000000,3.000000,5.000000,3.000000,2.00,3.000000,...,4.0,3.000000,3.000000,5.000000,3.000000,3.000000,3.000000,3.00,3.490495,3.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1547655,3.500000,4.000000,4.000000,5.000000,3.500000,3.500000,4.000000,3.500000,3.50,3.000000,...,3.5,3.500000,3.500000,4.000000,3.500000,3.500000,3.500000,3.50,3.500000,3.500000
1408468,3.166667,3.166667,3.166667,3.000000,3.166667,4.000000,3.166667,3.000000,4.00,3.166667,...,3.0,3.166667,4.000000,3.166667,3.000000,3.166667,4.000000,3.00,3.000000,3.166667
2355940,4.333333,4.333333,4.333333,3.490495,4.000000,5.000000,4.333333,3.490495,5.00,4.333333,...,4.0,4.333333,3.490495,4.333333,3.490495,4.333333,5.000000,5.00,3.490495,4.333333
428168,3.750000,5.000000,5.000000,3.490495,4.000000,4.000000,5.000000,3.750000,3.75,4.000000,...,3.0,3.750000,3.750000,5.000000,3.490495,3.750000,3.750000,3.75,3.000000,3.750000


### Recommendations Generation

In [67]:
import pandas as pd

# Convert predicted_ratings_classification_train to a pandas DataFrame if it's a numpy array
if isinstance(predicted_ratings_classification_train, np.ndarray):
    # Use the length of train_matrix as the number of rows and items as columns
    num_users, num_items = train_matrix.shape
    predicted_ratings_classification_train = pd.DataFrame(predicted_ratings_classification_train, index=range(num_users), columns=range(num_items))

# Initialize a dictionary to store top recommendations for each user
top_recommendations_per_user = {}

# Iterate over each user in the train matrix
for user_id in train_matrix.index:
    # Get the top recommendations for the current user using predicted_ratings_classification_train
    recommendations = get_top_recommendations(user_id, predicted_ratings_classification_train, df)
    
    # Store the recommendations in the dictionary
    top_recommendations_per_user[user_id] = recommendations

# Print the top 5 recommendations for each user
for user_id, recommendations in top_recommendations_per_user.items():
    print(f"User {user_id} recommendations:", recommendations)


User 843201 recommendations: ['The Faculty', 'Species II', 'Hellraiser V: Inferno', 'The Bat / House on Haunted Hill', 'All in the Family: Season 4']
User 531747 recommendations: ['The Land Before Time VII: Stone of Cold Fire', 'The Duel at Silver Creek', 'Star Trek: Enterprise: Season 3', 'Mutiny on the Bounty', 'Yojimbo']
User 2143797 recommendations: ['Love Reinvented', 'Year of the Horse: Neil Young & Crazy Horse Live', 'Inside the Space Station', 'Hemp Revolution', 'The Life and Times of Hank Greenberg']
User 2153721 recommendations: ['Year of the Horse: Neil Young & Crazy Horse Live', 'The Trip', 'Inside the Space Station', 'Hemp Revolution', 'No Ordinary Love']
User 834933 recommendations: ['Travel the World by Train: Africa', 'Year of the Horse: Neil Young & Crazy Horse Live', 'Uncovered: The Whole Truth About the Iraq War', 'Inside the Space Station', 'Hemp Revolution']
User 1871358 recommendations: ['Rabbit-Proof Fence', 'Jo Jo Dancer, Your Life is Calling', 'The Duel at Silv

### Step 5: Model Evaluation (Item KNN Classification)

In [68]:
import numpy as np
import pandas as pd

# Ensure predicted_ratings_classification_train is a numpy array
if isinstance(predicted_ratings_classification_train, pd.DataFrame):
    predicted_ratings_classification_train = predicted_ratings_classification_train.to_numpy()

# Assuming train_matrix is the numpy array representing the item-user matrix

# Ensure train_matrix is a 2D array
if isinstance(train_matrix, pd.DataFrame):
    train_matrix = train_matrix.to_numpy()

if train_matrix.ndim == 1:
    train_matrix = np.expand_dims(train_matrix, axis=0)

# Evaluate the model
mae, rmse = evaluate_model(train, predicted_ratings_classification_train)

print("Mean Absolute Error (MAE) on Training Data:", mae)
print("Root Mean Squared Error (RMSE) on Training Data:", rmse)


Mean Absolute Error (MAE) on Training Data: 0.884425308764031
Root Mean Squared Error (RMSE) on Training Data: 1.057199366144889


### Step 6: Parameter Tuning

1. **Import Necessary Libraries:**
   - We import the required libraries for performing grid search cross-validation (`GridSearchCV`), creating custom scorers (`make_scorer`), and utilizing the `NearestNeighbors` algorithm.

2. **Define Cosine Similarity Function:**
   - We define a custom function `cosine_similarity` to compute the cosine similarity between two vectors. This function calculates the dot product of the vectors and divides it by the product of their norms.

3. **Define Custom Scorer:**
   - We create a custom scorer `cosine_similarity_scorer` using `make_scorer`, which enables us to use cosine similarity as the scoring metric during grid search cross-validation.

4. **Define Parameter Grid:**
   - We specify a parameter grid `param_grid` containing the hyperparameters to be tuned. In this case, we're tuning the number of neighbors (`n_neighbors`) and the distance metric (`metric`) for the `NearestNeighbors` algorithm.

5. **Initialize NearestNeighbors Model:**
   - We initialize the `NearestNeighbors` model without specifying any hyperparameters.

6. **Create GridSearchCV Object:**
   - We create a `GridSearchCV` object named `grid_search` with the specified parameter grid, cross-validation strategy (5-fold cross-validation), and custom scoring metric (`cosine_similarity_scorer`).

7. **Fit the Data:**
   - We fit the `item_user_matrix` data to the `grid_search` object to perform hyperparameter tuning. `item_user_matrix` typically contains the item-item similarity matrix computed using collaborative filtering techniques.

8. **Get Best Hyperparameters:**
   - After fitting the data, we retrieve the best hyperparameters selected by the grid search using the `best_params_` attribute of the `grid_search` object.

9. **Print Best Parameters:**
   - Finally, we print the best hyperparameters obtained from the grid search.



In [69]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import make_scorer
import numpy as np

# Define a custom scorer based on cosine similarity defined above
cosine_similarity_scorer = make_scorer(cosine_similarity)

# Define the parameter grid
param_grid = {
    'n_neighbors': [5, 10, 15, 30, 40],
    'metric': ['cosine', 'euclidean']
}

# Initialize NearestNeighbors model
knn_model = NearestNeighbors()

# Create the GridSearchCV object
grid_search = GridSearchCV(knn_model, param_grid, cv=5, scoring=cosine_similarity_scorer)

# Fit the data to perform hyperparameter tuning
grid_search.fit(train_matrix)  # val_matrix contains item-user matrix

# Get the best hyperparameters
best_params = grid_search.best_params_

print("Best parameters:", best_params)


Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() 

Best parameters: {'metric': 'cosine', 'n_neighbors': 5}


The hyperparameter tuning gives the same results as for the Ratings prediction model as expected. We are going to now evaluate the model with the best found parameters (which we have already trained to model on) on the validation set.

We are now going to obtain again the ratings for the full val set to evaluate the performance. 

In [70]:
# Function to recalculate predictions for a given matrix using the trained model
def recalculate_predictions(matrix, item_neighborhoods):
    """
    Recalculate predictions for a given matrix using the trained model.

    Parameters:
        matrix (pd.DataFrame): Matrix containing user-item ratings.

    Returns:
        pd.DataFrame: DataFrame containing recalculated predicted ratings for each user.
    """
    # Initialize an empty DataFrame to store recalculated predicted ratings
    recalculated_predictions = pd.DataFrame(index=matrix.index, columns=matrix.columns)

    # Iterate over each user-item pair in the matrix
    for user_id, user_ratings in matrix.iterrows():
        for movie_id, rating in user_ratings.items():
            # Skip if the rating is non-zero (indicating a rating given by the user)
            if rating != 0:
                continue

            # Check if the movie has a neighborhood defined
            if movie_id in item_neighborhoods:
                neighborhood = item_neighborhoods[movie_id]

                # Initialize a flag to track if the rating has been predicted
                rating_predicted = False

                # Iterate over the items in the neighborhood
                for neighbor_movie_id in neighborhood:
                    # Check if the neighbor movie has been rated by the user
                    if neighbor_movie_id in user_ratings.index and user_ratings.loc[neighbor_movie_id] != 0:
                        # Use the rating of the closest item in the neighborhood
                        predicted_rating = user_ratings.loc[neighbor_movie_id]
                        rating_predicted = True
                        break

                # If none of the items in the neighborhood have been rated, use the user's average rating
                if not rating_predicted:
                    if user_ratings[user_ratings != 0].empty:
                        # If the user hasn't rated any movies, use the global mean rating
                        predicted_rating = matrix[matrix != 0].mean().mean()
                    else:
                        # Use the average of the ratings given by the user
                        predicted_rating = user_ratings[user_ratings != 0].mean()

                # Assign the predicted rating to the corresponding cell in the DataFrame
                recalculated_predictions.at[user_id, movie_id] = predicted_rating

    # Fill NaN values with mean ratings across all users
    recalculated_predictions.fillna(recalculated_predictions.mean().mean(), inplace=True)

    return recalculated_predictions


# Recalculate predictions for the validation set (train_val_matrix)
recalculated_val_predictions = recalculate_predictions(val_matrix, item_neighborhoods_classification_train)


  recalculated_predictions.fillna(recalculated_predictions.mean().mean(), inplace=True)


In [71]:
recalculated_val_predictions

Unnamed: 0,39,96,119,143,160,181,188,208,227,235,...,1589,1686,1730,1764,1819,1820,1852,1857,1902,1949
2551487,4.000000,4.000000,4.000000,3.488925,4.000000,5.000,4.000000,2.000000,5.000000,4.000000,...,5.000000,4.000000,5.000000,4.000000,5.000000,4.000000,5.000000,4.000000,3.000000,4.000000
2230701,4.000000,3.666667,3.666667,3.488925,3.666667,4.000,3.666667,5.000000,2.000000,3.666667,...,4.000000,3.666667,2.000000,3.666667,3.488925,3.666667,2.000000,5.000000,4.000000,3.666667
1842201,3.500000,3.500000,3.500000,3.000000,4.000000,2.000,3.500000,3.488925,2.000000,3.500000,...,5.000000,3.500000,2.000000,3.500000,3.500000,3.500000,2.000000,4.000000,3.488925,3.500000
300026,2.500000,4.000000,4.000000,3.488925,2.000000,2.000,4.000000,2.000000,2.000000,2.000000,...,3.000000,2.500000,2.000000,4.000000,3.000000,2.500000,2.000000,2.500000,3.000000,2.500000
2237945,4.800000,4.800000,4.800000,4.800000,4.800000,5.000,4.800000,4.000000,5.000000,4.800000,...,4.800000,4.800000,5.000000,4.800000,4.000000,5.000000,5.000000,4.000000,4.800000,4.800000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2006011,4.200000,4.200000,4.200000,3.488925,4.200000,4.200,4.200000,5.000000,4.200000,4.200000,...,4.000000,4.200000,4.200000,4.200000,4.000000,4.200000,4.200000,5.000000,5.000000,4.200000
2603700,3.642857,4.000000,4.000000,3.000000,3.000000,3.000,4.000000,3.000000,4.000000,3.000000,...,3.488925,3.642857,4.000000,4.000000,4.000000,3.642857,4.000000,4.000000,3.488925,3.642857
822391,2.000000,1.833333,1.833333,1.833333,1.000000,2.000,1.833333,3.488925,1.833333,1.000000,...,1.833333,1.833333,1.833333,1.833333,1.833333,1.833333,1.833333,1.833333,1.833333,1.833333
1988919,3.125000,3.125000,3.125000,3.488925,3.000000,3.125,3.125000,3.000000,4.000000,3.125000,...,4.000000,3.125000,3.125000,3.125000,3.000000,3.125000,3.125000,3.000000,3.488925,3.125000


In [72]:
# Evaluate the model
mae, rmse = evaluate_model(val, recalculated_val_predictions)

print("Mean Absolute Error (MAE) on Validation Data:", mae)
print("Root Mean Squared Error (RMSE) on Validation Data:", rmse)

Mean Absolute Error (MAE) on Validation Data: 0.8852248027335127
Root Mean Squared Error (RMSE) on Validation Data: 1.0611505968899115


### Step 7: Deployment

Once we have all the metrics, we have to recalculate the ratings for the test set using the calculated `item_similarity_matrix_train_val`, which we created in the Ratings prediction model. After getting the latest item neighborhoods we are going to repeat the same process (get the closest rating available) for the test set and evaluate the model.

In [73]:
# We have to reobtain the neighborhood for the train_val data
# We will use the already calculated item-item similarity matrix for train_val data
item_neighborhoods_classification_train_val = item_neighborhood_selection(item_similarity_matrix_train_val, k=5, item_ids=index_to_movie_id)

# Recalculate predictions for the validation set (test matrix)
recalculated_test_predictions = recalculate_predictions(test_matrix, item_neighborhoods_classification_train_val)

  recalculated_predictions.fillna(recalculated_predictions.mean().mean(), inplace=True)


We also have to evaluate the metrics for the test set.

In [74]:
# Evaluate the model
mae, rmse = evaluate_model(test, recalculated_test_predictions)

print("Mean Absolute Error (MAE) on Test Data:", mae)
print("Root Mean Squared Error (RMSE) on Test Data:", rmse)

Mean Absolute Error (MAE) on Test Data: 0.8897540405278962
Root Mean Squared Error (RMSE) on Test Data: 1.0637357424496807


## Conclusions

After thoroughly conducting the process of training, validating, and testing both models, it is evident that the regression model outperforms the alternative. This conclusion is drawn based on its ability to utilize the averages of available ratings effectively, enabling it to generate accurate predictions for movie recommendations. 

Through comprehensive evaluation, the regression model consistently demonstrated superior performance metrics compared to the Classification model considered in this study. Its prediction works better than the selection of the clossest neighbor in the classification model and it offers a robust mechanism for predicting user preferences and recommending relevant movies. Therefore, in the context of this study, the regression model emerges as the preferred choice for movie recommendation systems.

### Summary Results:

| Netflix | MAE Validation | RMSE Validation | MAE Testing | RMSE Testing |
|-----------------|-----------------|-----------------|-----------------|-----------------|
| Model with Genres - Rating Prediction | 0.83611864627124 | 1.1194509428839041 | 0.8367954827827105 | 1.1063145099909513 |
| Model with Genres - Classification | 0.8852248027335127 | 1.0611505968899115 | 0.8897540405278962 | 1.0637357424496807 |
| Model without Genres - Rating Prediction | 0.8629111739590473 | 1.0295637022864497 | 0.8606621073281191 | 1.0307602501482604 |
| Model without Genres - Classification| 0.8628879023045837 | 1.0295661879159739 | 0.8606565845456754 | 1.0307653650000774 |

### Interpreting the summary results:

The results present performance metrics for different models trained and tested on the Netflix dataset.

Model with Genres - Rating Prediction: This model achieves the lowest MAE but slightly higher RMSE compared to the Classification Model with Genres in both validation and testing phases. While it performs well in terms of MAE, indicating lower average prediction errors, the higher RMSE suggests that the predictions may be more variable or less accurate in capturing the spread of the actual ratings. Further analysis is needed to understand the trade-offs between MAE and RMSE.

Model with Genres - Classification: While this model performs reasonably well, it has slightly higher MAE values compared to the rating prediction model with genres and than both of the ones without genres. Overall, it isn't the best performing one in any of the categories.

Model without Genres - Rating Prediction: Despite lacking genre information, this model still performs competitively, with MAE and RMSE values close to those of the model with genres in both validation and testing phases.

Model without Genres - Classification: Despite lacking genre information, this model still performs competitively, with MAE and RMSE values close to those of the model with genres in both validation and testing phases.

Based on the summary results, the model with genre information for rating prediction appears to be the most effective in predicting ratings on the Netflix dataset. However, the model without genre information also shows competitive performance, indicating that genre information may not be essential for accurate rating prediction. Further analysis and experimentation may be required to determine the most suitable model for specific use cases or datasets, but overall, it seems that the genres aren't providing as much improvement as they have for the Movielens dataset.