# Item-based Models

In this file we are going to proceed with the building in the models. Machine learning models to build recommender systems. The models that are going to be built are Collaborative filtering using Item-based rating prediction (ItemKNN) and Item-based classification (ItemKNN).

Collaborative filtering is a technique used in recommendation systems to predict or classify items based on the preferences or behavior of similar users or items. Item-based Collaborative Filtering (CF) focuses on the similarity between items rather than users. There are two main approaches within item-based CF: rating prediction and classification.

In [184]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # to make jupyter print all outputs, not just the last one
from IPython.core.display import HTML # to pretty print pandas df and be able to copy them over (e.g. to ppt slides)

In [185]:
# We import the necessary libraries
import pandas as pd
import numpy as np
import os

In [186]:
# We print the directory where the file is located
print(os.getcwd())

c:\Users\Jaume\Documents\MDDB\SDM\SDfM---Jaume-and-Stijn 2\SDfM---Jaume-and-Stijn


In [187]:
# We set the directory to the cleaned folder
os.listdir(os.path.join('.', 'cleaned'))

['final_sample5_parquet',
 'movielens_parquet',
 'netflix_parquet',
 'unpacked_reviews_df_100k.parquet']

In [188]:
# We read the final_sample file and store it in a dataframe
df = pd.read_parquet('cleaned/final_sample5_parquet')

In [189]:
# We print shape of the dataframe
df.shape

(4665, 26)

In [190]:
# We print the first 5 rows of the dataframe
df.head()

Unnamed: 0,movieId,title,year,userId,rating,date,Drama,Action,Sci-Fi,Comedy,...,Adventure,Fantasy,IMAX,Animation,Musical,Horror,Film-Noir,Western,Mystery,(no genres listed)
0,45635,"Notorious Bettie Page, The",2005,414,3.0,2008-07-15,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,45635,"Notorious Bettie Page, The",2005,474,3.0,2006-12-08,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1373,Star Trek V: The Final Frontier,1989,19,1.0,2000-08-08,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,1373,Star Trek V: The Final Frontier,1989,42,4.0,2001-07-27,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,1373,Star Trek V: The Final Frontier,1989,51,5.0,2009-01-02,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [191]:
# We print the columns of the dataframe
df.columns

Index(['movieId', 'title', 'year', 'userId', 'rating', 'date', 'Drama',
       'Action', 'Sci-Fi', 'Comedy', 'Crime', 'Thriller', 'Romance', 'War',
       'Documentary', 'Children', 'Adventure', 'Fantasy', 'IMAX', 'Animation',
       'Musical', 'Horror', 'Film-Noir', 'Western', 'Mystery',
       '(no genres listed)'],
      dtype='object')

In [192]:
# We print the number of unique users and movies
print(df['userId'].nunique())
print(df['movieId'].nunique())


532
487


# Item-based Rating Prediction (ItemKNN)

### Step 1: User-Item matrix construction
The first thing we have to do is build the user-item matrix:

- Choose a similarity metric to calculate the similarity between items. Common metrics include cosine similarity, Pearson correlation coefficient, and Jaccard similarity.
- Calculate the similarity between each pair of items based on the ratings provided by users. This will result in an item-item similarity matrix.

#### Collaborative Filtering with Cosine Similarity

This code snippet demonstrates how to perform item-based collaborative filtering using cosine similarity. Collaborative filtering is a technique commonly used in recommendation systems to predict a user's preferences for items based on the preferences of similar users/items.

##### Libraries Used
- **pandas**: A powerful data manipulation library in Python.
- **numpy**: A library for numerical computing in Python.
- **sklearn.metrics.pairwise.cosine_similarity**: A function from scikit-learn used to compute the cosine similarity between vectors.

In [193]:
import pandas as pd
import numpy as np

# Assuming df is your DataFrame with the specified format
# Convert userId column to integer type if needed
if df['userId'].dtype != int:
    df['userId'] = df['userId'].astype(int)

# Get all unique user IDs and movie IDs
all_user_ids = np.unique(df['userId'])
all_movie_ids = np.unique(df['movieId'])

# Create a DataFrame with all combinations of user IDs and movie IDs
all_user_movie_pairs = np.array(np.meshgrid(all_user_ids, all_movie_ids)).T.reshape(-1, 2)
df_pairs = pd.DataFrame(all_user_movie_pairs, columns=['userId', 'movieId'])

# Merge the original DataFrame with all_user_movie_pairs to fill missing ratings with 0
df_filled = pd.merge(df_pairs, df, on=['userId', 'movieId'], how='left').fillna(0)

# Create user-item matrix using pivot_table
item_user_matrix = pd.pivot_table(df_filled, values='rating', index='userId', columns='movieId', fill_value=0)

# Convert user-item matrix to NumPy array for faster computation
item_user_array = item_user_matrix.to_numpy()


In [194]:
item_user_matrix.head()

movieId,1,4,15,30,43,89,104,108,122,146,...,174479,174551,175475,176371,176389,177593,179813,181413,185029,186587
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [195]:
item_user_array

array([[4. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       ...,
       [2.5, 0. , 0. , ..., 0. , 0. , 0. ],
       [3. , 0. , 0. , ..., 0. , 0. , 0. ],
       [5. , 0. , 0. , ..., 0. , 0. , 0. ]])

### Train-test split

The train-val-test split is a technique used in machine learning to evaluate the performance of a model. It involves dividing the dataset into three subsets: the training set, the validation set, and the test set.

The training set is used to train the model and optimize its parameters.
The validation set is used to fine-tune the model and select the best hyperparameters.
The test set is used to evaluate the final performance of the model on unseen data.
By using a train-val-test split, we can assess the model's performance on unseen data and ensure that it generalizes well to new examples. It helps prevent overfitting and provides a more reliable estimate of the model's performance in real-world scenarios.

In [196]:
from sklearn.model_selection import train_test_split
# Split the data into training and test sets
train_val, test = train_test_split(item_user_array, test_size=0.2, random_state=42)

# Split the training set into training and validation sets
train, val = train_test_split(train_val, test_size=0.2, random_state=42)

# Print the shapes of the datasets
print("Training set shape:", train.shape)
print("Validation set shape:", val.shape)
print("Test set shape:", test.shape)

# We are also going to do the split for the matrix df
# Split the user-item matrix into training and test sets
train_val_matrix, test_matrix = train_test_split(item_user_matrix, test_size=0.2, random_state=42)

# Split the training set matrix into training and validation sets
train_matrix, val_matrix = train_test_split(train_val_matrix, test_size=0.2, random_state=42)

# We print a ' ' to give some space inbetween lines
print(' ')

# Print the shapes of the matrix datasets
print("Training set matrix shape:", train_matrix.shape)
print("Validation set matrix shape:", val_matrix.shape)
print("Test set matrix shape:", test_matrix.shape)


Training set shape: (340, 487)
Validation set shape: (85, 487)
Test set shape: (107, 487)
 
Training set matrix shape: (340, 487)
Validation set matrix shape: (85, 487)
Test set matrix shape: (107, 487)


In [197]:
train_matrix

movieId,1,4,15,30,43,89,104,108,122,146,...,174479,174551,175475,176371,176389,177593,179813,181413,185029,186587
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
428,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
517,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
197,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
160,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
67,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
267,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
326,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
416,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [198]:
train

array([[0., 0., 0., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [199]:
# Calculate cosine similarity between items using NumPy functions
def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    similarity = dot_product / (norm_a * norm_b)
    return similarity

# Calculate item-item similarity matrix for train data
item_similarity_matrix_train = np.zeros((train.shape[1], train.shape[1]))

for i in range(train.shape[1]):
    for j in range(i, train.shape[1]):
        item_similarity_matrix_train[i, j] = cosine_similarity(train[:, i], train[:, j])
        item_similarity_matrix_train[j, i] = item_similarity_matrix_train[i, j]

# Create a mapping from movie IDs to indices
movie_id_to_index = {movie_id: i for i, movie_id in enumerate(item_user_matrix.index)}
index_to_movie_id = {i: movie_id for movie_id, i in movie_id_to_index.items()}

# Create a mapping from user IDs to indices
user_id_to_index = {user_id: i for i, user_id in enumerate(item_user_matrix.columns)}
index_to_user_id = {i: user_id for user_id, i in user_id_to_index.items()}


  similarity = dot_product / (norm_a * norm_b)


In [200]:
movie_id_to_index

{1: 0,
 2: 1,
 3: 2,
 4: 3,
 5: 4,
 6: 5,
 7: 6,
 9: 7,
 10: 8,
 11: 9,
 12: 10,
 13: 11,
 14: 12,
 15: 13,
 16: 14,
 17: 15,
 18: 16,
 19: 17,
 20: 18,
 21: 19,
 22: 20,
 23: 21,
 24: 22,
 25: 23,
 27: 24,
 28: 25,
 29: 26,
 30: 27,
 31: 28,
 32: 29,
 33: 30,
 34: 31,
 36: 32,
 38: 33,
 39: 34,
 40: 35,
 41: 36,
 42: 37,
 43: 38,
 44: 39,
 45: 40,
 46: 41,
 47: 42,
 48: 43,
 50: 44,
 51: 45,
 52: 46,
 54: 47,
 57: 48,
 58: 49,
 59: 50,
 62: 51,
 63: 52,
 64: 53,
 65: 54,
 66: 55,
 67: 56,
 68: 57,
 69: 58,
 70: 59,
 71: 60,
 73: 61,
 74: 62,
 75: 63,
 76: 64,
 78: 65,
 79: 66,
 80: 67,
 82: 68,
 83: 69,
 84: 70,
 85: 71,
 86: 72,
 87: 73,
 88: 74,
 89: 75,
 90: 76,
 91: 77,
 92: 78,
 93: 79,
 95: 80,
 96: 81,
 97: 82,
 98: 83,
 99: 84,
 100: 85,
 101: 86,
 103: 87,
 104: 88,
 105: 89,
 106: 90,
 107: 91,
 109: 92,
 110: 93,
 111: 94,
 112: 95,
 113: 96,
 114: 97,
 115: 98,
 116: 99,
 117: 100,
 118: 101,
 119: 102,
 121: 103,
 122: 104,
 123: 105,
 124: 106,
 125: 107,
 129: 108,
 130

In [201]:
print(index_to_user_id)

{0: 1, 1: 4, 2: 15, 3: 30, 4: 43, 5: 89, 6: 104, 7: 108, 8: 122, 9: 146, 10: 290, 11: 303, 12: 305, 13: 346, 14: 353, 15: 358, 16: 363, 17: 389, 18: 393, 19: 416, 20: 423, 21: 478, 22: 504, 23: 522, 24: 551, 25: 556, 26: 605, 27: 627, 28: 640, 29: 694, 30: 708, 31: 728, 32: 790, 33: 810, 34: 835, 35: 901, 36: 953, 37: 1003, 38: 1006, 39: 1037, 40: 1040, 41: 1046, 42: 1207, 43: 1261, 44: 1266, 45: 1271, 46: 1272, 47: 1358, 48: 1373, 49: 1388, 50: 1391, 51: 1398, 52: 1415, 53: 1432, 54: 1446, 55: 1447, 56: 1507, 57: 1518, 58: 1564, 59: 1572, 60: 1580, 61: 1597, 62: 1625, 63: 1665, 64: 1757, 65: 1769, 66: 1816, 67: 1882, 68: 1940, 69: 1955, 70: 1964, 71: 1969, 72: 1995, 73: 2004, 74: 2013, 75: 2037, 76: 2046, 77: 2071, 78: 2107, 79: 2114, 80: 2148, 81: 2160, 82: 2171, 83: 2177, 84: 2273, 85: 2302, 86: 2310, 87: 2320, 88: 2324, 89: 2325, 90: 2336, 91: 2344, 92: 2356, 93: 2358, 94: 2366, 95: 2378, 96: 2381, 97: 2521, 98: 2540, 99: 2590, 100: 2606, 101: 2672, 102: 2746, 103: 2757, 104: 2803,

In [202]:
print(user_id_to_index)

{1: 0, 4: 1, 15: 2, 30: 3, 43: 4, 89: 5, 104: 6, 108: 7, 122: 8, 146: 9, 290: 10, 303: 11, 305: 12, 346: 13, 353: 14, 358: 15, 363: 16, 389: 17, 393: 18, 416: 19, 423: 20, 478: 21, 504: 22, 522: 23, 551: 24, 556: 25, 605: 26, 627: 27, 640: 28, 694: 29, 708: 30, 728: 31, 790: 32, 810: 33, 835: 34, 901: 35, 953: 36, 1003: 37, 1006: 38, 1037: 39, 1040: 40, 1046: 41, 1207: 42, 1261: 43, 1266: 44, 1271: 45, 1272: 46, 1358: 47, 1373: 48, 1388: 49, 1391: 50, 1398: 51, 1415: 52, 1432: 53, 1446: 54, 1447: 55, 1507: 56, 1518: 57, 1564: 58, 1572: 59, 1580: 60, 1597: 61, 1625: 62, 1665: 63, 1757: 64, 1769: 65, 1816: 66, 1882: 67, 1940: 68, 1955: 69, 1964: 70, 1969: 71, 1995: 72, 2004: 73, 2013: 74, 2037: 75, 2046: 76, 2071: 77, 2107: 78, 2114: 79, 2148: 80, 2160: 81, 2171: 82, 2177: 83, 2273: 84, 2302: 85, 2310: 86, 2320: 87, 2324: 88, 2325: 89, 2336: 90, 2344: 91, 2356: 92, 2358: 93, 2366: 94, 2378: 95, 2381: 96, 2521: 97, 2540: 98, 2590: 99, 2606: 100, 2672: 101, 2746: 102, 2757: 103, 2803: 104,

In [203]:
item_user_array.shape

(532, 487)

In [204]:
item_user_matrix.shape

(532, 487)

In [205]:
# We print the item_similarity_matrix
item_similarity_matrix_train

array([[1.        , 0.0143568 , 0.04177427, ...,        nan,        nan,
               nan],
       [0.0143568 , 1.        , 0.26332321, ...,        nan,        nan,
               nan],
       [0.04177427, 0.26332321, 1.        , ...,        nan,        nan,
               nan],
       ...,
       [       nan,        nan,        nan, ...,        nan,        nan,
               nan],
       [       nan,        nan,        nan, ...,        nan,        nan,
               nan],
       [       nan,        nan,        nan, ...,        nan,        nan,
               nan]])

In [206]:
item_similarity_matrix_train.shape

(487, 487)

### Step 2: Neighborhood Selection
- Determine the neighborhood size, i.e., the number of most similar items to consider when predicting ratings for a target item.
- Select the most similar items for each item in your dataset based on their calculated similarities. This forms the neighborhood for each item.

#### Item-Based Neighborhoods and Ratings Aggregation

This code following snippet enhances the previous item-based collaborative filtering approach by considering ratings aggregation within the item neighborhoods.

##### Steps:

1. **Defining Neighborhood Size**:
   - The variable `neighborhood_size` determines the number of most similar items to consider in the neighborhood.

2. **Initializing Data Structure**:
   - An empty dictionary `item_neighborhoods` is initialized to store the neighborhoods for each item.

3. **Iterating Over Items**:
   - For each movie in the dataset:
     - All ratings for the current movie are extracted from the DataFrame (`df`).
     - Ratings aggregation is performed. In this example, the average rating for the movie is computed, but other aggregation methods can be used.
     - The similarity scores for the current movie are retrieved from the precomputed `item_similarity_matrix`.
     - Similarity scores are sorted in descending order, and the indices of the most similar items (excluding itself) are obtained.
     - These indices are converted back to movie IDs, forming the neighborhood for the current item.
     - The neighborhood for the current item is stored in the `item_neighborhoods` dictionary.

4. **Output**:
   - `item_neighborhoods`: A dictionary where keys are movie IDs, and values are lists of movie IDs representing the neighborhood of each item. Each movie's neighborhood includes movies with similar ratings and content.

##### Note:
- This approach considers both similarity in ratings and content (as captured by cosine similarity) when building item neighborhoods.
- Aggregating ratings within item neighborhoods helps in providing more personalized recommendations.
- The choice of rating aggregation method (e.g., mean, median) can impact the quality of recommendations and may need to be adjusted based on the characteristics of the dataset and user preferences.


In [207]:
# Step 2: Neighborhood Selection

# Define the neighborhood size
neighborhood_size = 5

# Initialize an empty dictionary to store item neighborhoods
item_neighborhoods = {}

# Iterate over each item (movie) index in the dataset
for movie_index in range(item_user_array.shape[1]):
    # Convert the item index to movie ID
    movie_id = index_to_movie_id[movie_index]
    
    # Extract all ratings for the current movie
    movie_ratings = item_user_array[:, movie_index]  # Changed to item_user_array
    
    # Aggregate ratings (e.g., compute the mean rating)
    movie_avg_rating = np.mean(movie_ratings)
    
    # Retrieve similarity scores for the current movie
    similarity_scores = item_similarity_matrix_train[movie_index]
    
    # Sort similarity scores in descending order and get indices of most similar items
    most_similar_indices = np.argsort(similarity_scores)[::-1][1:neighborhood_size+1]
    
    # Convert indices back to movie IDs to form the neighborhood
    neighborhood = [index_to_movie_id[idx] for idx in most_similar_indices]
    
    # Store the neighborhood for the current item in the item_neighborhoods dictionary
    item_neighborhoods[movie_id] = neighborhood

# Output: item_neighborhoods dictionary containing neighborhoods for each item
print(item_neighborhoods)


{1: [357, 465, 344, 467, 342], 2: [317, 451, 183, 453, 182], 3: [422, 417, 149, 412, 409], 4: [248, 162, 164, 436, 167], 5: [441, 448, 223, 224, 445], 6: [164, 417, 412, 409, 403], 7: [357, 344, 465, 342, 467], 9: [477, 444, 182, 441, 183], 10: [149, 418, 417, 159, 412], 11: [254, 453, 228, 451, 248], 12: [436, 441, 344, 342, 444], 13: [482, 310, 441, 293, 444], 14: [412, 324, 453, 323, 455], 15: [422, 417, 412, 409, 403], 16: [286, 451, 323, 453, 318], 17: [436, 159, 160, 162, 164], 18: [477, 444, 182, 441, 183], 19: [412, 89, 403, 91, 391], 20: [436, 448, 445, 444, 223], 21: [558, 417, 412, 409, 403], 22: [477, 160, 162, 164, 422], 23: [214, 182, 183, 184, 185], 24: [286, 418, 417, 412, 149], 25: [436, 448, 445, 444, 217], 27: [317, 459, 332, 330, 328], 28: [196, 441, 182, 183, 436], 29: [422, 441, 436, 432, 182], 30: [467, 324, 453, 323, 455], 31: [293, 455, 456, 327, 325], 32: [441, 449, 448, 217, 445], 33: [476, 342, 332, 330, 465], 34: [432, 445, 444, 196, 441], 36: [254, 453, 22

### Step 4: Rating Prediction

- For each target item and user pair where the user hasn't rated the target item:
Identify the neighborhood of similar items to the target item.
- Predict the rating for the target item using a weighted average of the ratings of the items in its neighborhood, where the weights are the similarities between the items and the target item.
- Adjust the prediction based on the user's average rating or other normalization techniques, if necessary.

In [208]:
# Train your model and create the item_neighborhoods dictionary using the training data (steps 1-3)

# Initialize an empty DataFrame to store predicted ratings
predicted_ratings = pd.DataFrame(index=train_matrix.index, columns=train_matrix.columns)

# Iterate over each user-item pair in the training set
for user_id, user_ratings in train_matrix.iterrows():
    for movie_id, rating in user_ratings.items():
        
        # Skip if the rating is non-zero (indicating a rating given by the user)
        if rating != 0:
            continue
        
        # Check if the movie has a neighborhood defined
        if movie_id in item_neighborhoods:
            neighborhood = item_neighborhoods[movie_id]
            
            # Filter out movies from the neighborhood that the user has rated in the training set
            filtered_neighborhood = [neighbor_movie_id for neighbor_movie_id in neighborhood if neighbor_movie_id in user_ratings.index and user_ratings.loc[neighbor_movie_id] != 0]
            
            # Check if there are valid indices in the filtered neighborhood
            if len(filtered_neighborhood) > 0:
                # Calculate the predicted rating for the target movie based on the neighborhood
                neighbor_ratings = [user_ratings.loc[neighbor_movie_id] for neighbor_movie_id in filtered_neighborhood]
                predicted_rating = np.mean(neighbor_ratings)
            else:
                # If the filtered neighborhood is empty, assign the mean rating of all movies for the user in the training set
                predicted_rating = user_ratings.mean()
            
            # Assign the predicted rating to the corresponding cell in the DataFrame
            predicted_ratings.at[user_id, movie_id] = predicted_rating

# Fill NaN values with mean ratings across all users
predicted_ratings.fillna(predicted_ratings.mean().mean(), inplace=True)

# Display the head of the predicted_ratings DataFrame
predicted_ratings.head()


  predicted_ratings.fillna(predicted_ratings.mean().mean(), inplace=True)


movieId,1,4,15,30,43,89,104,108,122,146,...,174479,174551,175475,176371,176389,177593,179813,181413,185029,186587
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
428,0.063655,0.063655,0.063655,0.063655,0.063655,0.063655,0.056225,0.056225,0.063655,0.063655,...,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225
517,0.056225,0.084189,0.084189,0.084189,0.084189,0.084189,0.084189,0.056225,0.084189,0.084189,...,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225
197,0.00616,0.00616,0.00616,0.00616,0.00616,0.00616,0.00616,0.056225,0.00616,0.00616,...,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225
160,0.056225,0.131417,0.131417,0.131417,0.131417,0.131417,0.131417,0.056225,0.131417,0.131417,...,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225
67,0.025667,0.025667,0.025667,0.025667,0.025667,0.025667,0.025667,0.056225,0.025667,0.025667,...,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225,0.056225


In [209]:
# We count the NaN values in the predicted_ratings dataframe
predicted_ratings.isnull().sum().sum()

0

In [210]:
def get_top_recommendations(user_id, predicted_ratings, df):
    """
    Get top movie recommendations for a given user using predicted ratings.

    Parameters:
        user_id (int): ID of the user for whom recommendations are to be generated.
        predicted_ratings (pd.DataFrame): DataFrame containing predicted ratings for users and movies.
        df (pd.DataFrame): DataFrame containing movie ratings.

    Returns:
        list: Top movie titles recommended for the user.
    """
    # Ensure that predicted_ratings is a DataFrame
    if not isinstance(predicted_ratings, pd.DataFrame):
        raise ValueError("predicted_ratings must be a pandas DataFrame")

    # Check if the user ID exists in the predicted ratings DataFrame's index
    if user_id not in predicted_ratings.index:
        # If the user ID doesn't exist, return an empty list
        return ["The user doesn't exist"]

    # Get the predicted ratings for the user
    user_predicted_ratings = predicted_ratings.loc[user_id]
    
    # Filter out the movies that the user has already seen
    seen_movies = set(df[df['userId'] == user_id]['movieId'])
    unseen_movies = [movie_id for movie_id in user_predicted_ratings.index if movie_id not in seen_movies]
    
    # Check if there are unseen movies with predicted ratings
    if not unseen_movies:
        # If all movies have been seen, return an empty list
        return []
    
    # Sort the unseen movies by predicted rating in descending order
    sorted_unseen_movies = user_predicted_ratings[unseen_movies].sort_values(ascending=False)
    
    # Get the top 5 movie titles
    top_movie_ids = sorted_unseen_movies.head(5).index.tolist()
    
    # Get the unique movie titles corresponding to the top movie IDs
    top_movie_titles = set(df[df['movieId'].isin(top_movie_ids)]['title'])
    
    return list(top_movie_titles)[:5]  # Return only the first 5 unique movie titles


In [211]:
top_recommendations = []

# Iterate over each row (user) in the train_matrix
for user_id in train_matrix.index:
    # Get the actual user ID corresponding to the user index
    recommendations = get_top_recommendations(user_id, predicted_ratings, df)
    top_recommendations.append(recommendations)

# Output: List of top movie recommendations for each user in the train matrix
top_recommendations

# print size of top_recommendations
len(top_recommendations)


[['Waiting to Exhale ',
  'Romper Stomper ',
  'Toy Story ',
  'Backbeat ',
  'War Room, The '],
 ['Waiting to Exhale ',
  'Cutthroat Island ',
  'Nightmare Before Christmas, The ',
  'War Room, The ',
  'Crow, The '],
 ['42nd Street ',
  "Block Party (a.k.a. Dave Chappelle's Block Party) ",
  'Hell Ride ',
  'All the Boys Love Mandy Lane ',
  'Heart of a Dog (Sobachye serdtse) '],
 ['Waiting to Exhale ',
  'Romper Stomper ',
  'Backbeat ',
  'Cutthroat Island ',
  'War Room, The '],
 ['42nd Street ',
  'Get Him to the Greek ',
  'Hell Ride ',
  'All the Boys Love Mandy Lane ',
  'Heart of a Dog (Sobachye serdtse) '],
 ['Wuthering Heights ',
  'Mona Lisa Smile ',
  'Rachel Getting Married ',
  'Hell Ride ',
  'Heart of a Dog (Sobachye serdtse) '],
 ['Waiting to Exhale ',
  'Cutthroat Island ',
  'Nightmare Before Christmas, The ',
  'War Room, The ',
  'Higher Learning '],
 ['Waiting to Exhale ',
  'Romper Stomper ',
  'Cutthroat Island ',
  'War Room, The ',
  'Crow, The '],
 ['42nd S

340

In [212]:
print("Last version")

Last version


### Step 5: Model Evaluation
- We evaluate the performance of your ItemKNN algorithm using appropriate evaluation metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), or others.
- Split your dataset into training and testing sets to assess the model's predictive accuracy on unseen data.

In [213]:
train.shape

predicted_ratings.shape

(340, 487)

(340, 487)

In [214]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

def evaluate_model(train_matrix, predicted_ratings):
    """
    Evaluate the model's performance on the training data.

    Parameters:
        train_matrix (numpy.ndarray): Item-user matrix from the training data.
        predicted_ratings (pandas.DataFrame): Predicted ratings DataFrame for the training data.

    Returns:
        float: Mean Absolute Error (MAE) on the training data.
        float: Root Mean Squared Error (RMSE) on the training data.
    """
    # Ensure train_matrix and predicted_ratings have the same shape
    assert train_matrix.shape == predicted_ratings.shape, "Shapes of train_matrix and predicted_ratings are not consistent."

    # Convert predicted_ratings DataFrame to numpy array
    predicted_ratings_array = predicted_ratings.to_numpy()

    # Flatten the arrays
    train_ratings = train_matrix.flatten()
    predicted_ratings = predicted_ratings_array.flatten()

    # Filter out NaN values
    valid_indices = ~np.isnan(train_ratings) & ~np.isnan(predicted_ratings)
    train_ratings = train_ratings[valid_indices]
    predicted_ratings = predicted_ratings[valid_indices]

    # Calculate evaluation metrics
    mae = mean_absolute_error(train_ratings, predicted_ratings)
    rmse = np.sqrt(mean_squared_error(train_ratings, predicted_ratings))

    return mae, rmse

# Assuming train is the numpy array representing the item-user matrix
# Assuming predicted_ratings is the DataFrame representing the predicted ratings for the training data

# Evaluate the model
mae, rmse = evaluate_model(train, predicted_ratings)

print("Mean Absolute Error (MAE) on Training Data:", mae)
print("Root Mean Squared Error (RMSE) on Training Data:", rmse)


Mean Absolute Error (MAE) on Training Data: 0.11541909354372394
Root Mean Squared Error (RMSE) on Training Data: 0.4785515075368112


### Step 6: Parameter Tuning
- Experiment with different parameters such as similarity threshold, neighborhood size, and similarity metric to optimize the performance of your ItemKNN algorithm.
- Use cross-validation or other techniques to tune these parameters and avoid overfitting.

1. **Import Necessary Libraries:**
   - We import the required libraries for performing grid search cross-validation (`GridSearchCV`), creating custom scorers (`make_scorer`), and utilizing the `NearestNeighbors` algorithm.

2. **Define Cosine Similarity Function:**
   - We define a custom function `cosine_similarity` to compute the cosine similarity between two vectors. This function calculates the dot product of the vectors and divides it by the product of their norms.

3. **Define Custom Scorer:**
   - We create a custom scorer `cosine_similarity_scorer` using `make_scorer`, which enables us to use cosine similarity as the scoring metric during grid search cross-validation.

4. **Define Parameter Grid:**
   - We specify a parameter grid `param_grid` containing the hyperparameters to be tuned. In this case, we're tuning the number of neighbors (`n_neighbors`) and the distance metric (`metric`) for the `NearestNeighbors` algorithm.

5. **Initialize NearestNeighbors Model:**
   - We initialize the `NearestNeighbors` model without specifying any hyperparameters.

6. **Create GridSearchCV Object:**
   - We create a `GridSearchCV` object named `grid_search` with the specified parameter grid, cross-validation strategy (5-fold cross-validation), and custom scoring metric (`cosine_similarity_scorer`).

7. **Fit the Data:**
   - We fit the `item_user_matrix` data to the `grid_search` object to perform hyperparameter tuning. `item_user_matrix` typically contains the item-item similarity matrix computed using collaborative filtering techniques.

8. **Get Best Hyperparameters:**
   - After fitting the data, we retrieve the best hyperparameters selected by the grid search using the `best_params_` attribute of the `grid_search` object.

9. **Print Best Parameters:**
   - Finally, we print the best hyperparameters obtained from the grid search.



`scikit-learn (sklearn)`:

- Scikit-learn is a popular machine learning library in Python that provides simple and efficient tools for data analysis and modeling.
- It includes various modules for tasks such as classification, regression, clustering, dimensionality reduction, and model selection.
- The GridSearchCV class from scikit-learn is used for hyperparameter tuning through grid search along with cross-validation.
- The make_scorer function allows you to create a custom scoring function for use with GridSearchCV.
- The NearestNeighbors class provides functionality for unsupervised nearest neighbors learning, which can be used for tasks such as finding k-nearest neighbors for a given data point.

In [215]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import make_scorer
import numpy as np

# Define a custom scorer based on cosine similarity defined above
cosine_similarity_scorer = make_scorer(cosine_similarity)

# Define the parameter grid
param_grid = {
    'n_neighbors': [5, 10, 15, 30, 40],
    'metric': ['cosine', 'euclidean']
}

# Initialize NearestNeighbors model
knn_model = NearestNeighbors()

# Create the GridSearchCV object
grid_search = GridSearchCV(knn_model, param_grid, cv=5, scoring=cosine_similarity_scorer)

# Fit the data to perform hyperparameter tuning
grid_search.fit(train_val_matrix)  # train_val_matrix contains item-user matrix

# Get the best hyperparameters
best_params = grid_search.best_params_

print("Best parameters:", best_params)


Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() 

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() 

Best parameters: {'metric': 'cosine', 'n_neighbors': 5}


This code sets up a GridSearchCV object to perform hyperparameter tuning using the validation set (train_val_matrix). It explores different combinations of hyperparameters specified in the param_grid, evaluates them using 5-fold cross-validation (cv=5), and uses the cosine_similarity scorer to optimize the model's performance based on cosine similarity. Finally, it prints the best hyperparameters found during the search.

We have to perform the calculation of the predicted ratings for the train-val set to generate predicted ratings in order to evaluate the performance of our trained model on a validation dataset. 

If we computed the similarity matrix again specifically for the train_val set, it would essentially mean that we are using a different set of similarity measures for predicting ratings compared to what we used during training. This approach could lead to inconsistencies and potentially degrade the performance of our model. Here's what could happen:

1. **Inconsistency**: The similarity measures computed for the train_val set might differ from those computed for the training set due to variations in the data. As a result, the predicted ratings based on these new similarity measures may not align well with the predictions made during training, leading to inconsistency in the model's behavior.

2. **Overfitting**: Computing a new similarity matrix specifically for the train_val set might lead to overfitting on the validation data. The model may capture noise or idiosyncrasies present in the train_val set, which may not generalize well to unseen data.

3. **Increased Complexity**: Computing the similarity matrix again for the train_val set adds computational complexity and redundancy, especially if the similarity computation process is resource-intensive. This can result in longer training times and increased resource utilization.

Overall, it's generally recommended to use the same similarity measures or neighborhood definitions for both training and validation sets to ensure consistency and generalizability of the model.

In [216]:
# Assuming you have already trained your model and obtained the item_neighborhoods dictionary

# Initialize an empty DataFrame to store predicted ratings for train_val
predicted_ratings_train_val = pd.DataFrame(index=train_val_matrix.index, columns=train_val_matrix.columns)

# Iterate over each user and their ratings in the train_val set
for user_id, user_ratings in train_val_matrix.iterrows():
    for movie_id, rating in user_ratings.items():
        # Skip if the rating is non-zero (indicating a rating given by the user)
        if rating != 0:
            continue
        
        # Check if the movie has a neighborhood defined
        if movie_id in item_neighborhoods:
            neighborhood = item_neighborhoods[movie_id]
            
            # Filter out movies from the neighborhood that the user has rated in the train set
            filtered_neighborhood = [neighbor_movie_id for neighbor_movie_id in neighborhood if neighbor_movie_id in user_ratings.index and user_ratings.loc[neighbor_movie_id] != 0]
            
            # Check if there are valid indices in the filtered neighborhood
            if len(filtered_neighborhood) > 0:
                # Calculate the predicted rating for the target movie based on the neighborhood
                neighbor_ratings = [user_ratings.loc[neighbor_movie_id] for neighbor_movie_id in filtered_neighborhood]
                predicted_rating = np.mean(neighbor_ratings)
            else:
                # If the filtered neighborhood is empty, assign the mean rating of all movies for the user in the train set
                predicted_rating = user_ratings.mean()
            
            # Assign the predicted rating to the corresponding cell in the DataFrame
            predicted_ratings_train_val.at[user_id, movie_id] = predicted_rating

# Fill NaN values with mean ratings across all users
predicted_ratings_train_val.fillna(predicted_ratings_train_val.mean().mean(), inplace=True)


  predicted_ratings_train_val.fillna(predicted_ratings_train_val.mean().mean(), inplace=True)


In [217]:
predicted_ratings_train_val.head()

predicted_ratings_train_val.shape

movieId,1,4,15,30,43,89,104,108,122,146,...,174479,174551,175475,176371,176389,177593,179813,181413,185029,186587
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
344,0.019507,0.019507,0.019507,0.019507,0.019507,0.019507,0.019507,0.054999,0.019507,0.019507,...,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999
465,0.041068,0.041068,0.041068,0.041068,0.041068,0.041068,0.041068,0.054999,0.041068,0.041068,...,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999
515,0.010267,0.010267,0.010267,0.010267,0.010267,0.010267,0.010267,0.054999,0.010267,0.010267,...,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999
357,0.054999,0.185832,0.185832,0.185832,0.185832,0.185832,0.185832,0.054999,0.185832,0.185832,...,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999
164,0.010267,0.010267,0.010267,0.010267,0.010267,0.010267,0.010267,0.054999,0.010267,0.010267,...,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999,0.054999


(425, 487)

Now we evaluate the model with the train-val item-user matrix.

In [218]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Calculate item-item similarity matrix for train data
item_similarity_matrix_train = np.zeros((train.shape[1], train.shape[1]))

for i in range(train.shape[1]):
    for j in range(i, train.shape[1]):
        item_similarity_matrix_train[i, j] = cosine_similarity(train[:, i], train[:, j])
        item_similarity_matrix_train[j, i] = item_similarity_matrix_train[i, j]

# Assuming train_val_matrix and predicted_ratings_train_val are properly defined

# Now, you can evaluate the model using the evaluate_model function
mae, rmse = evaluate_model(train_val_matrix.to_numpy(), predicted_ratings_train_val)

print("Mean Absolute Error (MAE) on Train-Validation Data:", mae)
print("Root Mean Squared Error (RMSE) on Train-Validation Data:", rmse)


  similarity = dot_product / (norm_a * norm_b)


Mean Absolute Error (MAE) on Train-Validation Data: 0.11270438375369393
Root Mean Squared Error (RMSE) on Train-Validation Data: 0.4715288573983273


### Step 7: Deployment

Once we have come up with the best parameters possible and trained the model with the whole train_validation set, we will test it on the test set.

In [219]:
# Assuming you have already trained your model and obtained the item_neighborhoods dictionary

# Now, we can evaluate the model's predicted ratings on the test set

# Initialize an empty DataFrame to store predicted ratings for the test set
predicted_ratings_test = pd.DataFrame(index=test_matrix.index, columns=test_matrix.columns)

# Iterate over each user and their ratings in the test set
for user_id, user_ratings in test_matrix.iterrows():
    for movie_id, rating in user_ratings.items():
        # Skip if the rating is non-zero (indicating a rating given by the user)
        if rating != 0:
            continue
        
        # Check if the movie has a neighborhood defined
        if movie_id in item_neighborhoods:
            neighborhood = item_neighborhoods[movie_id]
            
            # Filter out movies from the neighborhood that the user has rated in the train set
            filtered_neighborhood = [neighbor_movie_id for neighbor_movie_id in neighborhood if neighbor_movie_id in user_ratings.index and user_ratings.loc[neighbor_movie_id] != 0]
            
            # Check if there are valid indices in the filtered neighborhood
            if len(filtered_neighborhood) > 0:
                # Calculate the predicted rating for the target movie based on the neighborhood
                neighbor_ratings = [user_ratings.loc[neighbor_movie_id] for neighbor_movie_id in filtered_neighborhood]
                predicted_rating = np.mean(neighbor_ratings)
            else:
                # If the filtered neighborhood is empty, assign the mean rating of all movies for the user in the train set
                predicted_rating = user_ratings.mean()
            
            # Assign the predicted rating to the corresponding cell in the DataFrame
            predicted_ratings_test.at[user_id, movie_id] = predicted_rating

# Fill NaN values with mean ratings across all users
predicted_ratings_test.fillna(predicted_ratings_test.mean().mean(), inplace=True)

# Now, you can evaluate the model's predicted ratings on the test set
mae, rmse = evaluate_model(test_matrix.to_numpy(), predicted_ratings_test)

print("Mean Absolute Error (MAE) on Test Data using Predicted Ratings from Test Set:", mae)
print("Root Mean Squared Error (RMSE) on Test Data using Predicted Ratings from Test Set:", rmse)


Mean Absolute Error (MAE) on Test Data using Predicted Ratings from Test Set: 0.1279818499213286
Root Mean Squared Error (RMSE) on Test Data using Predicted Ratings from Test Set: 0.4995626957908083


  predicted_ratings_test.fillna(predicted_ratings_test.mean().mean(), inplace=True)


In [220]:
test_matrix

movieId,1,4,15,30,43,89,104,108,122,146,...,174479,174551,175475,176371,176389,177593,179813,181413,185029,186587
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7,4.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
561,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
122,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
559,5.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
516,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.5,0.0,0.0,1.5,0.0
350,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
610,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
441,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [221]:
top_recommendations_test = []

# Iterate over each row (user) in the test_matrix
for user_id in test_matrix.index:
    # Get the actual user ID corresponding to the user index
    recommendations = get_top_recommendations(user_id, predicted_ratings_test, df)
    top_recommendations_test.append(recommendations)

# Output: List of top movie recommendations for each user in the test matrix
top_recommendations_test

# print size of top_recommendations
print("Size of top_recommendations_test:", len(top_recommendations_test))



[['Wuthering Heights ',
  'Mona Lisa Smile ',
  'Rachel Getting Married ',
  'Hell Ride ',
  'Heart of a Dog (Sobachye serdtse) '],
 ['Waiting to Exhale ',
  'Romper Stomper ',
  'Backbeat ',
  'Cutthroat Island ',
  'War Room, The '],
 ['Waiting to Exhale ',
  'Romper Stomper ',
  'Toy Story ',
  'War Room, The ',
  'Crow, The '],
 ['Wuthering Heights ',
  'Rachel Getting Married ',
  'Hell Ride ',
  'I Know Who Killed Me ',
  'Heart of a Dog (Sobachye serdtse) '],
 ['Get Him to the Greek ',
  'Hell Ride ',
  'Anna Karenina ',
  'All the Boys Love Mandy Lane ',
  'Heart of a Dog (Sobachye serdtse) '],
 ['Wuthering Heights ',
  'Mona Lisa Smile ',
  'Rachel Getting Married ',
  'Hell Ride ',
  'Heart of a Dog (Sobachye serdtse) '],
 ['Waiting to Exhale ',
  'Romper Stomper ',
  'Cutthroat Island ',
  'Nightmare Before Christmas, The ',
  'Crow, The '],
 ['42nd Street ',
  "Block Party (a.k.a. Dave Chappelle's Block Party) ",
  'Hell Ride ',
  'All the Boys Love Mandy Lane ',
  'Heart o

Size of top_recommendations_test: 107


# Item-based Classification (ItemKNN)

### Step 1: Data preparation
- Ensure you have a dataset with user-item interactions. Each interaction should include the user ID, item ID, and the corresponding label or class for classification.
- Preprocess the data as needed, including handling missing values, encoding categorical variables, and splitting into training and testing sets.

For this step we are going to resuse the `item_user_matrix` that we created in the first model built. 

### Step 2: Compute Item Similarity


In [222]:
item_similarity_matrix_train.shape

(487, 487)

In [223]:
import numpy as np

def item_neighborhood_selection(similarity_matrix, k=None, threshold=None, item_ids=None):
    """
    Select a subset of similar items for each item based on similarity matrix.

    Parameters:
        similarity_matrix (numpy.ndarray): Item-item similarity matrix.
        k (int): Number of similar items to select (optional).
        threshold (float): Similarity threshold for selecting similar items (optional).
        item_ids (list): List of item IDs corresponding to rows/columns of the similarity matrix.

    Returns:
        dict: Dictionary containing similar items for each item.
    """
    num_items = similarity_matrix.shape[0]
    item_neighborhood = {}

    for i in range(num_items):
        if k is not None:
            # Select top-k similar items (excluding the item itself)
            similar_items_indices = np.argsort(similarity_matrix[i])[::-1][:k]
        elif threshold is not None:
            # Select items with similarity above threshold (excluding the item itself)
            similar_items_indices = np.where(similarity_matrix[i] > threshold)[0]

        # Remove the item itself from the neighborhood
        similar_items_indices = similar_items_indices[similar_items_indices != i]

        # Get the item ID corresponding to the current index
        current_item_id = item_ids[i] if item_ids is not None else i

        # Get the item IDs for the neighborhood
        neighborhood_item_ids = [item_ids[index] for index in similar_items_indices]

        # Store the similar items with the item's ID as the key
        item_neighborhood[current_item_id] = neighborhood_item_ids

    return item_neighborhood

# Example usage:
# Assuming you have an item-item similarity matrix 'item_similarity_matrix_train'
# and you want to select top-5 similar items for each item
k = 5
item_neighborhood_classification = item_neighborhood_selection(item_similarity_matrix_train, k=k, item_ids=index_to_movie_id)

print(item_neighborhood_classification)


{1: [559, 357, 465, 344, 467], 2: [559, 317, 451, 183, 453], 3: [559, 422, 417, 149, 412], 4: [559, 248, 162, 164, 436], 5: [559, 441, 448, 223, 224], 6: [559, 164, 417, 412, 409], 7: [559, 357, 344, 465, 342], 9: [559, 477, 444, 182, 441], 10: [559, 149, 418, 417, 159], 11: [559, 254, 453, 228, 451], 12: [559, 436, 441, 344, 342], 13: [559, 482, 310, 441, 293], 14: [559, 412, 324, 453, 323], 15: [559, 422, 417, 412, 409], 16: [559, 286, 451, 323, 453], 17: [559, 436, 159, 160, 162], 18: [559, 477, 444, 182, 441], 19: [559, 412, 89, 403, 91], 20: [559, 436, 448, 445, 444], 21: [559, 558, 417, 412, 409], 22: [559, 477, 160, 162, 164], 23: [559, 214, 182, 183, 184], 24: [559, 286, 418, 417, 412], 25: [559, 436, 448, 445, 444], 27: [559, 317, 459, 332, 330], 28: [559, 196, 441, 182, 183], 29: [559, 422, 441, 436, 432], 30: [559, 467, 324, 453, 323], 31: [559, 293, 455, 456, 327], 32: [559, 441, 449, 448, 217], 33: [559, 476, 342, 332, 330], 34: [559, 432, 445, 444, 196], 36: [559, 254, 45

In [224]:
len(item_neighborhood_classification)

487

In [225]:
df.columns

Index(['movieId', 'title', 'year', 'userId', 'rating', 'date', 'Drama',
       'Action', 'Sci-Fi', 'Comedy', 'Crime', 'Thriller', 'Romance', 'War',
       'Documentary', 'Children', 'Adventure', 'Fantasy', 'IMAX', 'Animation',
       'Musical', 'Horror', 'Film-Noir', 'Western', 'Mystery',
       '(no genres listed)'],
      dtype='object')

In [226]:
def recommend_movies(user_id, train_matrix, item_neighborhood, num_movies=5):
    """
    Recommend top movies for a given user based on item neighborhood.

    Parameters:
        user_id (int): ID of the user for whom movies are to be recommended.
        train_matrix (pd.DataFrame): DataFrame containing user-item ratings in the train set.
        item_neighborhood (dict): Dictionary containing similar items for each item.
        num_movies (int): Number of movies to recommend for the user.

    Returns:
        list: List of movie IDs recommended for the user.
    """
    # Get positively rated items by the user
    positively_rated_items = train_matrix.loc[user_id][train_matrix.loc[user_id] > 3].index.tolist()
    
    # If the user hasn't positively rated any items
    if not positively_rated_items:
        # Get the highest rated unseen movies for the user
        unseen_movies = [movie_id for movie_id in train_matrix.columns if movie_id not in train_matrix.loc[user_id].index]
        highest_rated_unseen_movies = train_matrix[unseen_movies].mean().sort_values(ascending=False).index.tolist()
        
        # Return the top num_movies highest rated unseen movies
        return highest_rated_unseen_movies[:num_movies]
    
    predicted_movies = []
    for item_id in positively_rated_items:
        # Get similar items from the neighborhood
        similar_items = item_neighborhood.get(item_id, [])
        
        # Exclude items already rated by the user
        similar_unrated_items = [item for item in similar_items if item not in train_matrix.loc[user_id].index]
        
        # Add predicted movies to the list
        predicted_movies.extend(similar_unrated_items)
    
    # Return top num_movies predicted movies for the user
    return predicted_movies[:num_movies]

# Example usage:
# Predict movies for each user in the train set
train_set_recommendations = {}
for user_id in train_matrix.index:
    train_set_recommendations[user_id] = recommend_movies(user_id, train_matrix, item_neighborhoods)


In [227]:
# Print the unique users in train_set_recommendations
print("Unique users in train_set_recommendations:", len(train_set_recommendations))

# Print the unique users in train_matrix
print("Unique users in train_matrix:", len(train_matrix.index))
print("User IDs in train_matrix:", train_matrix.index)


Unique users in train_set_recommendations: 340
Unique users in train_matrix: 340
User IDs in train_matrix: Index([428, 517, 197, 160,  67,  32, 234, 381, 333, 116,
       ...
       360, 130, 119, 331, 367,  51, 267, 326, 416, 168],
      dtype='int64', name='userId', length=340)


In [228]:
train_set_predictions

{428: [181, 153, 154, 155, 156],
 517: [311, 403, 298, 405, 296],
 197: [],
 160: [311, 403, 298, 405, 296],
 67: [],
 32: [],
 234: [311, 403, 298, 405, 296],
 381: [311, 403, 298, 405, 296],
 333: [],
 116: [],
 78: [311, 403, 298, 405, 296],
 317: [],
 95: [413, 381, 153, 269, 377],
 607: [311, 403, 298, 405, 296],
 508: [],
 300: [],
 372: [],
 279: [276, 399, 287, 285, 284],
 23: [],
 402: [],
 544: [],
 121: [311, 403, 298, 405, 296],
 159: [311, 403, 298, 405, 296],
 357: [311, 403, 298, 405, 296],
 551: [],
 241: [],
 282: [311, 403, 298, 405, 296],
 444: [181, 153, 154, 155, 156],
 219: [311, 403, 298, 405, 296],
 598: [],
 165: [],
 58: [181, 153, 154, 155, 156],
 153: [],
 21: [311, 403, 298, 405, 296],
 445: [],
 212: [],
 424: [],
 583: [],
 434: [311, 403, 298, 405, 296],
 540: [],
 437: [333, 136, 365, 138, 361],
 301: [],
 534: [311, 403, 298, 405, 296],
 188: [],
 290: [311, 403, 298, 405, 296],
 286: [],
 200: [311, 403, 298, 405, 296],
 382: [311, 403, 298, 405, 296]

In [230]:
print("STUCK")

STUCK


In [229]:
def get_movie_titles(movie_ids, df, movie_id_to_index):
    """
    Get movie titles based on movie IDs.

    Parameters:
        movie_ids (list): List of movie IDs.
        df (pd.DataFrame): DataFrame containing movie information.
        movie_id_to_index (dict): Mapping dictionary from movie ID to index.

    Returns:
        list: List of movie titles corresponding to the given movie IDs.
    """
    # Convert movie IDs to indices using the mapping dictionary
    movie_indices = [movie_id_to_index[movie_id] for movie_id in movie_ids]
    
    # Get movie titles based on movie indices
    return df.iloc[movie_indices]['title'].tolist()

# Display predictions for all users in the train set movie by movie
for user_id, predicted_movies in train_set_predictions.items():
    print("Predicted movies for user", user_id, ":")
    
    # Get all predicted movie titles for the current user
    movie_titles = get_movie_titles(predicted_movies, df, movie_id_to_index)
    
    # Print each movie title
    for movie_title in movie_titles:
        print("   -", movie_title)


Predicted movies for user 428 :
   - Life Is Beautiful (La Vita è bella) 
   - Life Is Beautiful (La Vita è bella) 
   - Life Is Beautiful (La Vita è bella) 
   - Life Is Beautiful (La Vita è bella) 
   - Life Is Beautiful (La Vita è bella) 
Predicted movies for user 517 :
   - Cider House Rules, The 
   - City Slickers 
   - Romper Stomper 
   - City Slickers 
   - Romper Stomper 
Predicted movies for user 197 :
Predicted movies for user 160 :
   - Cider House Rules, The 
   - City Slickers 
   - Romper Stomper 
   - City Slickers 
   - Romper Stomper 
Predicted movies for user 67 :
Predicted movies for user 32 :
Predicted movies for user 234 :
   - Cider House Rules, The 
   - City Slickers 
   - Romper Stomper 
   - City Slickers 
   - Romper Stomper 
Predicted movies for user 381 :
   - Cider House Rules, The 
   - City Slickers 
   - Romper Stomper 
   - City Slickers 
   - Romper Stomper 
Predicted movies for user 333 :
Predicted movies for user 116 :
Predicted movies for user 78

KeyError: 395

### Step 3: Class Label Propagation

In [None]:
import pandas as pd

def rating_propagation(item_neighborhood, item_user_matrix):
    """
    Propagate ratings from similar items to the target item using averaging.

    Parameters:
        item_neighborhood (dict): Dictionary containing similar items for each item.
        item_user_matrix (pandas.DataFrame): User-item matrix containing user ratings.

    Returns:
        dict: Dictionary containing propagated ratings for each item.
    """
    propagated_ratings = {}

    for target_item_index, similar_item_indices in item_neighborhood.items():
        ratings = []
        for similar_item_index in similar_item_indices:
            if similar_item_index < item_user_matrix.shape[0]:
                rating = item_user_matrix.iloc[similar_item_index].mean()  # Mean rating for similar item
                ratings.append(rating)
            else:
                print(f"Index {similar_item_index} is out of bounds.")

        propagated_rating = sum(ratings) / len(ratings) if ratings else None
        propagated_ratings[target_item_index] = propagated_rating

    return propagated_ratings

# Example usage:
# Assuming you have item_neighborhood and train DataFrame with user-item interactions
# where each cell contains a user's rating for an item
propagated_ratings = rating_propagation(item_neighborhood_classification, item_user_matrix)

print(propagated_ratings)


{1: 0.05462012320328543, 2: 0.12464065708418892, 3: 0.024435318275154005, 4: 0.027104722792607804, 5: 0.12628336755646818, 6: 0.016837782340862424, 7: 0.05462012320328542, 9: 0.12710472279260782, 10: 0.0188911704312115, 11: 0.05420944558521561, 12: 0.01540041067761807, 13: 0.019507186858316223, 14: 0.037166324435318275, 15: 0.02340862422997947, 16: 0.08788501026694046, 17: 0.039014373716632446, 18: 0.12710472279260782, 19: 0.08829568788501026, 20: 0.1238193018480493, 21: 0.018891170431211503, 22: 0.0837782340862423, 23: 0.09589322381930185, 24: 0.018275154004106776, 25: 0.15790554414784394, 27: 0.07371663244353183, 28: 0.08542094455852156, 29: 0.11047227926078029, 30: 0.03839835728952772, 31: 0.04229979466119097, 32: 0.1568788501026694, 33: 0.04784394250513346, 34: 0.03449691991786448, 36: 0.05420944558521561, 38: 0.04784394250513346, 39: 0.01642710472279261, 40: 0.2117043121149897, 41: 0.04784394250513348, 42: 0.08788501026694044, 43: 0.14086242299794663, 44: 0.0595482546201232, 45: 0

### Step 4: Classify Items

- For each item in the testing set, find its nearest neighbors based on the precomputed item neighborhoods.
- Determine the class labels of these neighbors.
- Use a strategy such as majority voting to assign the class label to the item.

In [None]:
def perform_classification(item_neighborhood, propagated_ratings):
    """
    Perform classification by aggregating the propagated ratings from similar items.

    Parameters:
        item_neighborhood (dict): Dictionary containing similar items for each item.
        propagated_ratings (dict): Propagated ratings for each item.

    Returns:
        dict: Dictionary containing classified ratings for each item.
    """
    classified_ratings = {}

    for target_item_index, similar_item_indices in item_neighborhood.items():
        rating_sum = 0
        count = 0
        for similar_item_index in similar_item_indices:
            propagated_rating = propagated_ratings.get(similar_item_index)
            if propagated_rating is not None:
                rating_sum += propagated_rating
                count += 1
        
        if count > 0:
            classified_rating = rating_sum / count
            classified_ratings[target_item_index] = classified_rating
        else:
            # If there are no similar items or propagated ratings available, assign NaN
            classified_ratings[target_item_index] = float('nan')

    return classified_ratings

# Example usage:
classified_ratings = perform_classification(item_neighborhood_classification, propagated_ratings)


In [None]:
classified_ratings

{1: 0.07876796714579055,
 2: 0.07420944558521561,
 3: 0.09722792607802876,
 4: 0.07757700205338809,
 5: 0.06587268993839836,
 6: 0.08262833675564682,
 7: 0.07876796714579055,
 9: 0.057987679671457903,
 10: 0.08670431211498973,
 11: 0.0811498973305955,
 12: 0.08665297741273101,
 13: 0.08205338809034907,
 14: 0.12438398357289526,
 15: 0.09227926078028748,
 16: 0.10989733059548254,
 17: 0.08554414784394251,
 18: 0.057987679671457903,
 19: 0.11190965092402466,
 20: 0.0726488706365503,
 21: 0.09137577002053388,
 22: 0.07659137577002054,
 23: 0.04813141683778234,
 24: 0.08983572895277207,
 25: 0.0679671457905544,
 27: 0.03388090349075975,
 28: 0.07708418891170432,
 29: 0.07823408624229979,
 30: 0.09881930184804927,
 31: 0.08393223819301847,
 32: 0.06275154004106775,
 33: 0.06365503080082136,
 34: 0.08634496919917864,
 36: 0.0811498973305955,
 38: 0.06365503080082136,
 39: 0.08283367556468173,
 40: 0.0426694045174538,
 41: 0.04799794661190965,
 42: 0.06955852156057495,
 43: 0.0724435318275154

In [None]:
# We check for the NaN values in the classified_ratings
pd.Series(classified_ratings).isnull().sum()

0

### Step 6: Recommendation Generation

Once the model is trained and evaluated, you can use it to generate item recommendations for users. For a given user, recommend items that are highly rated or predicted to be liked based on the model's predictions.

In [None]:
train

array([[0., 0., 0., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
import pandas as pd

def predict_user_ratings_and_recommendations(item_user_matrix, item_neighborhood, classified_ratings, movies_df, top_n=5):
    """
    Predict ratings for users based on the majority class label in the item neighborhood.
    Also, generate item recommendations for each user.

    Parameters:
        item_user_matrix (numpy.ndarray): User-item matrix containing ratings or interactions.
        item_neighborhood (dict): Dictionary containing similar items for each item.
        classified_ratings (dict): Classified ratings for each item.
        movies_df (pandas.DataFrame): DataFrame containing movie titles and corresponding IDs.
        top_n (int): Number of top recommendations to generate (default is 5).

    Returns:
        tuple: Tuple containing predicted ratings for each user and recommended items.
    """
    user_ratings = {}
    user_recommendations = {}

    for user_id, user_interactions in enumerate(item_user_matrix):
        ratings = {}
        recommendations = []

        # Iterate over the items interacted by the user
        for item_id, interaction in enumerate(user_interactions):
            # Check if the item has a classified rating and is in the item neighborhood
            if item_id in classified_ratings and item_id in item_neighborhood:
                # Get the classified rating for the item
                classified_rating = classified_ratings[item_id]
                # Add the classified rating to the user's ratings
                ratings[item_id] = classified_rating
                # Add the item to recommendations
                recommendations.extend(item_neighborhood[item_id][:top_n])

        # Determine the top recommended items for the user
        recommendations = list(set(recommendations))  # Remove duplicates
        recommendations = [item_id for item_id in recommendations if item_id not in user_interactions]  # Remove items already interacted by the user
        recommendations_with_titles = [(item_id, movies_df.loc[item_id, 'title']) for item_id in recommendations]  # Retrieve movie titles using movie IDs
        user_recommendations[user_id] = recommendations_with_titles[:top_n]

        # Assign the predicted ratings to the user
        user_ratings[user_id] = ratings

    return user_ratings, user_recommendations

# Example usage:
predicted_user_ratings, user_recommendations = predict_user_ratings_and_recommendations(train, item_neighborhood_classification, classified_ratings, df)

print("Top 5 recommended movies for users:", user_recommendations)


Top 5 recommended movies for users: {0: [(75, 'Alice in Wonderland '), (77, 'Alice in Wonderland '), (97, 'Mr. Woodcock '), (126, 'Life Is Beautiful (La Vita è bella) '), (135, 'Life Is Beautiful (La Vita è bella) ')], 1: [(75, 'Alice in Wonderland '), (77, 'Alice in Wonderland '), (97, 'Mr. Woodcock '), (126, 'Life Is Beautiful (La Vita è bella) '), (135, 'Life Is Beautiful (La Vita è bella) ')], 2: [(75, 'Alice in Wonderland '), (77, 'Alice in Wonderland '), (97, 'Mr. Woodcock '), (126, 'Life Is Beautiful (La Vita è bella) '), (135, 'Life Is Beautiful (La Vita è bella) ')], 3: [(75, 'Alice in Wonderland '), (77, 'Alice in Wonderland '), (97, 'Mr. Woodcock '), (126, 'Life Is Beautiful (La Vita è bella) '), (135, 'Life Is Beautiful (La Vita è bella) ')], 4: [(75, 'Alice in Wonderland '), (77, 'Alice in Wonderland '), (97, 'Mr. Woodcock '), (126, 'Life Is Beautiful (La Vita è bella) '), (135, 'Life Is Beautiful (La Vita è bella) ')], 5: [(75, 'Alice in Wonderland '), (77, 'Alice in Wond

In [None]:
# we check if we have a user with id 0
df[df['userId'] == 1]

Unnamed: 0,movieId,title,year,userId,rating,date,Drama,Action,Sci-Fi,Comedy,...,Adventure,Fantasy,IMAX,Animation,Musical,Horror,Film-Noir,Western,Mystery,(no genres listed)
193,423,Blown Away,1994,1,3.0,2000-07-30,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1500,3034,Robin Hood,1973,1,5.0,2000-07-30,0,0,0,1,...,1,0,0,1,1,0,0,0,0,0
1617,2366,King Kong,1933,1,4.0,2000-07-30,0,1,0,0,...,1,1,0,0,0,1,0,0,0,0
2146,1625,"Game, The",1997,1,5.0,2000-07-30,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2411,1580,Men in Black (a.k.a. MIB),1997,1,3.0,2000-07-30,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3016,3053,"Messenger: The Story of Joan of Arc, The",1999,1,5.0,2000-07-30,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3215,4006,Transformers: The Movie,1986,1,4.0,2000-07-30,0,0,1,0,...,1,0,0,1,0,0,0,0,0,0
3468,1,Toy Story,1995,1,4.0,2000-07-30,0,0,0,1,...,1,1,0,1,0,0,0,0,0,0
3714,2273,Rush Hour,1998,1,4.0,2000-07-30,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4388,2046,Flight of the Navigator,1986,1,4.0,2000-07-30,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0


In [None]:
print("Predicted ratings for users:", predicted_user_ratings)
print("Recommended items for users:", user_recommendations)

Predicted ratings for users: {0: {1: 0.07876796714579055, 2: 0.07420944558521561, 3: 0.09722792607802876, 4: 0.07757700205338809, 5: 0.06587268993839836, 6: 0.08262833675564682, 7: 0.07876796714579055, 9: 0.057987679671457903, 10: 0.08670431211498973, 11: 0.0811498973305955, 12: 0.08665297741273101, 13: 0.08205338809034907, 14: 0.12438398357289526, 15: 0.09227926078028748, 16: 0.10989733059548254, 17: 0.08554414784394251, 18: 0.057987679671457903, 19: 0.11190965092402466, 20: 0.0726488706365503, 21: 0.09137577002053388, 22: 0.07659137577002054, 23: 0.04813141683778234, 24: 0.08983572895277207, 25: 0.0679671457905544, 27: 0.03388090349075975, 28: 0.07708418891170432, 29: 0.07823408624229979, 30: 0.09881930184804927, 31: 0.08393223819301847, 32: 0.06275154004106775, 33: 0.06365503080082136, 34: 0.08634496919917864, 36: 0.0811498973305955, 38: 0.06365503080082136, 39: 0.08283367556468173, 40: 0.0426694045174538, 41: 0.04799794661190965, 42: 0.06955852156057495, 43: 0.0724435318275154, 44:

### Step 5: Evaluate Model Performance

- Assess the performance of the ItemKNN classifier using evaluation metrics such as accuracy, precision, recall, F1-score, or area under the ROC curve (AUC).
- Compare the performance of different models with varying neighborhood sizes or similarity metrics to identify the optimal configuration.

In [None]:
import numpy as np

def evaluate_model_performance(train_matrix, classified_ratings):
    """
    Evaluate the performance of the model using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) 
    on the train split.

    Parameters:
        train_matrix (numpy.ndarray): Item-user matrix from the training data.
        classified_ratings (dict): Dictionary containing classified ratings for each item.

    Returns:
        tuple: Tuple containing MAE and RMSE
    """
    # Convert train matrix DataFrame to NumPy array
    train_matrix_array = train_matrix.to_numpy()

    # Flatten the train matrix
    actual_ratings = train_matrix_array.flatten()

    # Filter out NaN values
    valid_indices = ~np.isnan(actual_ratings)
    actual_ratings = actual_ratings[valid_indices]

    # Convert classified ratings dictionary to numpy array
    predicted_ratings = np.array([classified_ratings.get(i, np.nan) for i in range(len(actual_ratings))])

    # Filter out NaN values in predicted ratings
    valid_predicted_indices = ~np.isnan(predicted_ratings)
    actual_ratings = actual_ratings[valid_predicted_indices]
    predicted_ratings = predicted_ratings[valid_predicted_indices]

    # Calculate Mean Absolute Error (MAE)
    mae = np.mean(np.abs(actual_ratings - predicted_ratings))

    # Calculate Root Mean Squared Error (RMSE)
    rmse = np.sqrt(np.mean((actual_ratings - predicted_ratings) ** 2))

    return mae, rmse

# Example usage:
# Assuming train_matrix and classified_ratings are properly defined
mae, rmse = evaluate_model_performance(train_matrix, classified_ratings)
print("Mean Absolute Error (MAE):", mae)
print("Root Mean Squared Error (RMSE):", rmse)


Mean Absolute Error (MAE): 0.15816425558708486
Root Mean Squared Error (RMSE): 0.5390368607291904


### Step 7: Parameter Tuning

Experiment with different parameters such as similarity threshold, neighborhood size, and similarity metric to optimize the performance of your ItemKNN algorithm.
Use techniques like cross-validation to tune these parameters and avoid overfitting.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Define a function to compute cosine similarity
def cosine_similarity(X, Y):
    """
    This function computes the cosine similarity between two vectors X and Y.

    Parameters:
        X (ndarray): First vector
        Y (ndarray): Second vector

    Returns:
        float: Cosine similarity between X and Y
    """
    return np.dot(X, Y) / (np.linalg.norm(X) * np.linalg.norm(Y))

# Define a custom scorer based on cosine similarity
cosine_similarity_scorer = make_scorer(cosine_similarity)

# Define the parameter grid
param_grid = {
    'n_neighbors': [5, 10, 15, 30, 40],
    'metric': ['cosine', 'euclidean']
}

# Initialize NearestNeighbors model
knn_model_classification = NearestNeighbors()

# Create the GridSearchCV object
grid_search = GridSearchCV(knn_model_classification, param_grid, cv=5, scoring=cosine_similarity_scorer)

# Fit the data to perform hyperparameter tuning
grid_search.fit(item_user_matrix)  # item_user_matrix contains item-item similarity matrix

# Get the best hyperparameters
best_params_classification = grid_search.best_params_

print("Best parameters:", best_params_classification)


Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() 

Best parameters: {'metric': 'cosine', 'n_neighbors': 5}


### Step 7: Deployment

- If you want to predict the class labels for new items, repeat steps 4 and 5 using the entire dataset (training + testing) to build the final model.
- Use the trained model to predict the class labels for new items based on their nearest neighbors.