# Item-based Models

In this file we are going to proceed with the building in the models. Machine learning models to build recommender systems. The models that are going to be built are Collaborative filtering using Item-based rating prediction (ItemKNN) and Item-based classification (ItemKNN).

Collaborative filtering is a technique used in recommendation systems to predict or classify items based on the preferences or behavior of similar users or items. Item-based Collaborative Filtering (CF) focuses on the similarity between items rather than users. There are two main approaches within item-based CF: rating prediction and classification.

In [16]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # to make jupyter print all outputs, not just the last one
from IPython.core.display import HTML # to pretty print pandas df and be able to copy them over (e.g. to ppt slides)

In [17]:
# We import the necessary libraries
import pandas as pd
import numpy as np
import os

In [18]:
# We print the directory where the file is located
print(os.getcwd())

c:\Users\Jaume\Documents\MDDB\SDM\SDfM---Jaume-and-Stijn


In [19]:
# We set the directory to the cleaned folder
os.listdir(os.path.join('.', 'cleaned'))

['final_sample5_parquet', 'movielens_parquet', 'netflix_parquet']

In [20]:
# We read the final_sample file and store it in a dataframe
df = pd.read_parquet('cleaned/final_sample5_parquet')

In [21]:
# We print shape of the dataframe
df.shape

(4665, 26)

In [22]:
# We print the first 5 rows of the dataframe
df.head()

Unnamed: 0,movieId,title,year,userId,rating,date,Drama,Action,Sci-Fi,Comedy,...,Adventure,Fantasy,IMAX,Animation,Musical,Horror,Film-Noir,Western,Mystery,(no genres listed)
0,45635,"Notorious Bettie Page, The",2005,414,3.0,2008-07-15,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,45635,"Notorious Bettie Page, The",2005,474,3.0,2006-12-08,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1373,Star Trek V: The Final Frontier,1989,19,1.0,2000-08-08,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,1373,Star Trek V: The Final Frontier,1989,42,4.0,2001-07-27,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,1373,Star Trek V: The Final Frontier,1989,51,5.0,2009-01-02,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
# We print the columns of the dataframe
df.columns

Index(['movieId', 'title', 'year', 'userId', 'rating', 'date', 'Drama',
       'Action', 'Sci-Fi', 'Comedy', 'Crime', 'Thriller', 'Romance', 'War',
       'Documentary', 'Children', 'Adventure', 'Fantasy', 'IMAX', 'Animation',
       'Musical', 'Horror', 'Film-Noir', 'Western', 'Mystery',
       '(no genres listed)'],
      dtype='object')

In [24]:
# We print the number of unique users and movies
print(df['userId'].nunique())
print(df['movieId'].nunique())


532
487


### Train-Test validation split

In [25]:
from sklearn.model_selection import train_test_split

# Split the dataframe into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Split the train set into train and validation sets
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

# Print the shapes of the train, validation, and test sets
print("Train set shape:", train_df.shape)
print("Validation set shape:", val_df.shape)
print("Test set shape:", test_df.shape)


Train set shape: (2985, 26)
Validation set shape: (747, 26)
Test set shape: (933, 26)


# Item-based Rating Prediction (ItemKNN)

### Step 1: User-Item matrix construction
The first thing we have to do is build the user-item matrix:

- Choose a similarity metric to calculate the similarity between items. Common metrics include cosine similarity, Pearson correlation coefficient, and Jaccard similarity.
- Calculate the similarity between each pair of items based on the ratings provided by users. This will result in an item-item similarity matrix.

#### Collaborative Filtering with Cosine Similarity

This code snippet demonstrates how to perform item-based collaborative filtering using cosine similarity. Collaborative filtering is a technique commonly used in recommendation systems to predict a user's preferences for items based on the preferences of similar users/items.

##### Libraries Used
- **pandas**: A powerful data manipulation library in Python.
- **numpy**: A library for numerical computing in Python.
- **sklearn.metrics.pairwise.cosine_similarity**: A function from scikit-learn used to compute the cosine similarity between vectors.

In [37]:
import pandas as pd
import numpy as np

# Assuming df is your DataFrame with the specified format
# Convert userId column to integer type if needed
if df['userId'].dtype != int:
    df['userId'] = df['userId'].astype(int)

# Get all unique user IDs and movie IDs
all_user_ids = np.unique(df['userId'])
all_movie_ids = np.unique(df['movieId'])

# Create a DataFrame with all combinations of user IDs and movie IDs
all_user_movie_pairs = np.array(np.meshgrid(all_user_ids, all_movie_ids)).T.reshape(-1, 2)
df_pairs = pd.DataFrame(all_user_movie_pairs, columns=['userId', 'movieId'])

# Merge the original DataFrame with all_user_movie_pairs to fill missing ratings with 0
df_filled = pd.merge(df_pairs, df, on=['userId', 'movieId'], how='left').fillna(0)

# Create user-item matrix using pivot_table
user_item_matrix = pd.pivot_table(df_filled, values='rating', index='userId', columns='movieId', fill_value=0)

# Convert user-item matrix to NumPy array for faster computation
user_item_array = user_item_matrix.to_numpy()


### Train-test split

The train-val-test split is a technique used in machine learning to evaluate the performance of a model. It involves dividing the dataset into three subsets: the training set, the validation set, and the test set.

The training set is used to train the model and optimize its parameters.
The validation set is used to fine-tune the model and select the best hyperparameters.
The test set is used to evaluate the final performance of the model on unseen data.
By using a train-val-test split, we can assess the model's performance on unseen data and ensure that it generalizes well to new examples. It helps prevent overfitting and provides a more reliable estimate of the model's performance in real-world scenarios.

In [45]:
from sklearn.model_selection import train_test_split
# Split the data into training and test sets
train_val, test = train_test_split(user_item_array, test_size=0.2, random_state=42)

# Split the training set into training and validation sets
train, val = train_test_split(train_val, test_size=0.2, random_state=42)

# Print the shapes of the datasets
print("Training set shape:", train.shape)
print("Validation set shape:", val.shape)
print("Test set shape:", test.shape)


Training set shape: (340, 487)
Validation set shape: (85, 487)
Test set shape: (107, 487)


In [47]:
# Calculate cosine similarity between items using NumPy functions
def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    similarity = dot_product / (norm_a * norm_b)
    return similarity

# Calculate item-item similarity matrix for train data
item_similarity_matrix_train = np.zeros((train.shape[1], train.shape[1]))

for i in range(train.shape[1]):
    for j in range(i, train.shape[1]):
        item_similarity_matrix_train[i, j] = cosine_similarity(train[:, i], train[:, j])
        item_similarity_matrix_train[j, i] = item_similarity_matrix_train[i, j]

# Create a mapping from movie IDs to indices
movie_id_to_index = {movie_id: i for i, movie_id in enumerate(user_item_matrix.columns)}
index_to_movie_id = {i: movie_id for movie_id, i in movie_id_to_index.items()}

# Create a mapping from user IDs to indices
user_id_to_index = {user_id: i for i, user_id in enumerate(user_item_matrix.index)}
index_to_user_id = {i: user_id for user_id, i in user_id_to_index.items()}


  similarity = dot_product / (norm_a * norm_b)


In [48]:
print(index_to_user_id)

{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 9, 8: 10, 9: 11, 10: 12, 11: 13, 12: 14, 13: 15, 14: 16, 15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 29, 27: 30, 28: 31, 29: 32, 30: 33, 31: 34, 32: 36, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43, 39: 44, 40: 45, 41: 46, 42: 47, 43: 48, 44: 50, 45: 51, 46: 52, 47: 54, 48: 57, 49: 58, 50: 59, 51: 62, 52: 63, 53: 64, 54: 65, 55: 66, 56: 67, 57: 68, 58: 69, 59: 70, 60: 71, 61: 73, 62: 74, 63: 75, 64: 76, 65: 78, 66: 79, 67: 80, 68: 82, 69: 83, 70: 84, 71: 85, 72: 86, 73: 87, 74: 88, 75: 89, 76: 90, 77: 91, 78: 92, 79: 93, 80: 95, 81: 96, 82: 97, 83: 98, 84: 99, 85: 100, 86: 101, 87: 103, 88: 104, 89: 105, 90: 106, 91: 107, 92: 109, 93: 110, 94: 111, 95: 112, 96: 113, 97: 114, 98: 115, 99: 116, 100: 117, 101: 118, 102: 119, 103: 121, 104: 122, 105: 123, 106: 124, 107: 125, 108: 129, 109: 130, 110: 131, 111: 132, 112: 134, 113: 135, 114: 136, 115: 137, 116: 138, 117: 139, 118: 140, 119: 141, 12

In [49]:
print(user_id_to_index)

{1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 9: 7, 10: 8, 11: 9, 12: 10, 13: 11, 14: 12, 15: 13, 16: 14, 17: 15, 18: 16, 19: 17, 20: 18, 21: 19, 22: 20, 23: 21, 24: 22, 25: 23, 27: 24, 28: 25, 29: 26, 30: 27, 31: 28, 32: 29, 33: 30, 34: 31, 36: 32, 38: 33, 39: 34, 40: 35, 41: 36, 42: 37, 43: 38, 44: 39, 45: 40, 46: 41, 47: 42, 48: 43, 50: 44, 51: 45, 52: 46, 54: 47, 57: 48, 58: 49, 59: 50, 62: 51, 63: 52, 64: 53, 65: 54, 66: 55, 67: 56, 68: 57, 69: 58, 70: 59, 71: 60, 73: 61, 74: 62, 75: 63, 76: 64, 78: 65, 79: 66, 80: 67, 82: 68, 83: 69, 84: 70, 85: 71, 86: 72, 87: 73, 88: 74, 89: 75, 90: 76, 91: 77, 92: 78, 93: 79, 95: 80, 96: 81, 97: 82, 98: 83, 99: 84, 100: 85, 101: 86, 103: 87, 104: 88, 105: 89, 106: 90, 107: 91, 109: 92, 110: 93, 111: 94, 112: 95, 113: 96, 114: 97, 115: 98, 116: 99, 117: 100, 118: 101, 119: 102, 121: 103, 122: 104, 123: 105, 124: 106, 125: 107, 129: 108, 130: 109, 131: 110, 132: 111, 134: 112, 135: 113, 136: 114, 137: 115, 138: 116, 139: 117, 140: 118, 141: 119, 14

In [54]:
user_item_array.shape

(532, 487)

In [51]:
# We print the item_similarity_matrix
item_similarity_matrix

array([[1.        , 0.03557272, 0.10452764, ..., 0.        , 0.09204237,
        0.06805615],
       [0.03557272, 1.        , 0.15733562, ..., 0.        , 0.        ,
        0.        ],
       [0.10452764, 0.15733562, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.09204237, 0.        , 0.        , ..., 0.        , 1.        ,
        0.63644583],
       [0.06805615, 0.        , 0.        , ..., 0.        , 0.63644583,
        1.        ]])

In [None]:
item_similarity_matrix.shape

(487, 487)

### Step 2: Neighborhood Selection
- Determine the neighborhood size, i.e., the number of most similar items to consider when predicting ratings for a target item.
- Select the most similar items for each item in your dataset based on their calculated similarities. This forms the neighborhood for each item.

#### Item-Based Neighborhoods and Ratings Aggregation

This code following snippet enhances the previous item-based collaborative filtering approach by considering ratings aggregation within the item neighborhoods.

##### Steps:

1. **Defining Neighborhood Size**:
   - The variable `neighborhood_size` determines the number of most similar items to consider in the neighborhood.

2. **Initializing Data Structure**:
   - An empty dictionary `item_neighborhoods` is initialized to store the neighborhoods for each item.

3. **Iterating Over Items**:
   - For each movie in the dataset:
     - All ratings for the current movie are extracted from the DataFrame (`df`).
     - Ratings aggregation is performed. In this example, the average rating for the movie is computed, but other aggregation methods can be used.
     - The similarity scores for the current movie are retrieved from the precomputed `item_similarity_matrix`.
     - Similarity scores are sorted in descending order, and the indices of the most similar items (excluding itself) are obtained.
     - These indices are converted back to movie IDs, forming the neighborhood for the current item.
     - The neighborhood for the current item is stored in the `item_neighborhoods` dictionary.

4. **Output**:
   - `item_neighborhoods`: A dictionary where keys are movie IDs, and values are lists of movie IDs representing the neighborhood of each item. Each movie's neighborhood includes movies with similar ratings and content.

##### Note:
- This approach considers both similarity in ratings and content (as captured by cosine similarity) when building item neighborhoods.
- Aggregating ratings within item neighborhoods helps in providing more personalized recommendations.
- The choice of rating aggregation method (e.g., mean, median) can impact the quality of recommendations and may need to be adjusted based on the characteristics of the dataset and user preferences.


In [None]:
# Step 2: Neighborhood Selection

# Define the neighborhood size
neighborhood_size = 10

# Initialize an empty dictionary to store item neighborhoods
item_neighborhoods = {}

# Iterate over each item (movie) index in the dataset
for movie_index in range(user_item_matrix.shape[1]):
    # Convert the item index to movie ID
    movie_id = index_to_movie_id[movie_index]
    
    # Extract all ratings for the current movie
    movie_ratings = user_item_array[:, movie_index]  # Change to user_item_array
    
    # Aggregate ratings (e.g., compute the mean rating)
    movie_avg_rating = np.mean(movie_ratings)
    
    # Retrieve similarity scores for the current movie
    similarity_scores = item_similarity_matrix[movie_index]
    
    # Sort similarity scores in descending order and get indices of most similar items
    most_similar_indices = np.argsort(similarity_scores)[::-1][1:neighborhood_size+1]
    
    # Convert indices back to movie IDs to form the neighborhood
    neighborhood = [index_to_movie_id[idx] for idx in most_similar_indices]
    
    # Store the neighborhood for the current item in the item_neighborhoods dictionary
    item_neighborhoods[movie_id] = neighborhood

# Output: item_neighborhoods dictionary containing neighborhoods for each item
print(item_neighborhoods)


{1: [4306, 1580, 104, 1391, 551, 953, 708, 2324, 2302, 5991], 4: [146, 835, 1006, 43, 416, 122, 89, 708, 358, 15], 15: [303, 416, 3295, 27178, 8933, 6516, 835, 353, 3641, 146], 30: [3792, 108, 3586, 1415, 363, 2540, 835, 6184, 2988, 2177], 43: [4292, 4113, 1006, 3965, 6603, 1507, 1398, 2757, 4495, 5440], 89: [504, 87192, 26686, 835, 694, 1432, 65810, 605, 3269, 1037], 104: [1, 1580, 2804, 551, 1391, 4306, 708, 7325, 4018, 2302], 108: [3792, 3586, 1415, 363, 30, 2540, 835, 6184, 2988, 2177], 122: [416, 5247, 5984, 65810, 504, 26686, 3680, 4065, 87192, 2378], 146: [2590, 6386, 835, 4822, 1006, 4, 43, 8982, 416, 89], 290: [1816, 3169, 4458, 1040, 5322, 5424, 4926, 6033, 6816, 346], 303: [6595, 2803, 15, 4822, 1882, 67186, 4545, 2606, 4487, 6615], 305: [8933, 6516, 27178, 3295, 2177, 3641, 4289, 728, 2962, 7092], 346: [1816, 4831, 4111, 3792, 3586, 108, 835, 1415, 3308, 3169], 353: [551, 1391, 1266, 1580, 1261, 5991, 1, 2968, 1037, 15], 358: [835, 5424, 4458, 4926, 5322, 6033, 1040, 6816, 

### Step 4: Rating Prediction

- For each target item and user pair where the user hasn't rated the target item:
Identify the neighborhood of similar items to the target item.
- Predict the rating for the target item using a weighted average of the ratings of the items in its neighborhood, where the weights are the similarities between the items and the target item.
- Adjust the prediction based on the user's average rating or other normalization techniques, if necessary.

In [None]:
# Train your model and create the item_neighborhoods dictionary using the training data (steps 1-3)

# Initialize an empty DataFrame to store predicted ratings
predicted_ratings = pd.DataFrame(index=train_df['userId'], columns=train_df['movieId'])

# Iterate over each user-item pair in the testing set
for index, row in train_df.iterrows():
    user_id = row['userId']
    movie_id = row['movieId']
    
    # Check if the movie has a neighborhood defined
    if movie_id in item_neighborhoods:
        neighborhood = item_neighborhoods[movie_id]
        
        # Filter out movies from the neighborhood that the user has rated in the training set
        filtered_neighborhood = [neighbor_movie_id for neighbor_movie_id in neighborhood if neighbor_movie_id in train_df[train_df['userId'] == user_id]['movieId'].values]
        
        # Check if there are ratings available in the filtered neighborhood
        if filtered_neighborhood:
            # Calculate the predicted rating for the target movie based on the neighborhood
            predicted_rating = np.mean([train_df[(train_df['userId'] == user_id) & (train_df['movieId'] == neighbor_movie_id)]['rating'].values[0] for neighbor_movie_id in filtered_neighborhood])
        else:
            # If the filtered neighborhood is empty, assign the mean rating of all movies for the user in the training set
            predicted_rating = np.mean(train_df[train_df['userId'] == user_id]['rating'].values)
        
        # Assign the predicted rating to the corresponding cell in the DataFrame
        predicted_ratings.at[user_id, movie_id] = predicted_rating

# predicted_ratings now contains predicted ratings for items in the train_df set
predicted_ratings.head()


movieId,4306,1,89864,85881,100507,3034,8604,551,2968,4321,...,2378,7154,97913,605,8405,64614,2171,3301,5991,26150
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
370,2.75,,,,,,,,,,...,,,,,,,,,,
112,,3.75,,,,,,3.5,,,...,,,,,,,,,,
62,3.642857,,4.5,,,,,,,,...,,,,,,,,,,
89,3.0,,2.5,4.416667,4.333333,,,,,,...,,,,,,,,,,
89,3.0,,2.5,4.416667,4.333333,,,,,,...,,,,,,,,,,


In [None]:
# We count the NaN values in the predicted_ratings dataframe
predicted_ratings.isnull().sum().sum()

11217611

In [None]:
def get_top_recommendations(user_id):

    """
    Get top movie recommendations for a given user using predicted ratings.

    Parameters:
        user_id (int): ID of the user for whom recommendations are to be generated.

    Returns:
        list: Top movie titles recommended for the user.
    """

    # Get the index of the user
    user_index = user_id_to_index[user_id]
    
    # Get the predicted ratings for the user_id in predicted_ratings
    user_ratings = predicted_ratings.iloc[user_index]
    
    # Filter out movies that the user has already rated
    unrated_movies = user_ratings[user_ratings.isna()].index
    
    # Sort the ratings of unrated movies in descending order
    sorted_unrated_movies = user_ratings.loc[unrated_movies].sort_values(ascending=False)
    
    # Initialize a list to store the top movie titles
    top_movie_titles = []
    
    # Initialize a set to keep track of movie IDs that have been added to the recommendations
    added_movie_ids = set()
    
    # Iterate over the sorted unrated movies
    for movie_id in sorted_unrated_movies.index:
        # Check if the movie ID has already been added to the recommendations
        if movie_id not in added_movie_ids:
            # Get the movie title corresponding to the movie ID and append it to the list
            top_movie_titles.append(df.loc[df['movieId'] == movie_id, 'title'].values[0])
            
            # Add the movie ID to the set of added movie IDs
            added_movie_ids.add(movie_id)
            
            # If we have already added 5 recommendations, stop iterating
            if len(top_movie_titles) == 5:
                break
    
    return top_movie_titles


In [None]:
top_recommendations = []

# Iterate over user IDs using the index_to_user_id mapping
for i, user_id in enumerate(index_to_user_id):
    # Get the actual user ID corresponding to the index i
    actual_user_id = index_to_user_id[i]
    recommendations = get_top_recommendations(actual_user_id)
    top_recommendations.append(recommendations)

# Output: List of top movie recommendations for each user
top_recommendations



[['City Slickers ',
  'Truth About Cats & Dogs, The ',
  'Source Code ',
  'Unforgiven ',
  'Perfect Blue '],
 ['Men in Black (a.k.a. MIB) ',
  'Truth About Cats & Dogs, The ',
  'Source Code ',
  'Children of Men ',
  'Perfect Blue '],
 ['Men in Black (a.k.a. MIB) ',
  'City Slickers ',
  'Source Code ',
  'Unforgiven ',
  'Children of Men '],
 ['City Slickers ',
  'Truth About Cats & Dogs, The ',
  'Unforgiven ',
  'Children of Men ',
  'Perfect Blue '],
 ['Men in Black (a.k.a. MIB) ',
  'Truth About Cats & Dogs, The ',
  'Source Code ',
  'Children of Men ',
  'Perfect Blue '],
 ['City Slickers ',
  'Truth About Cats & Dogs, The ',
  'Source Code ',
  'Unforgiven ',
  'Perfect Blue '],
 ['City Slickers ',
  'Truth About Cats & Dogs, The ',
  'Source Code ',
  'Unforgiven ',
  'Children of Men '],
 ['Men in Black (a.k.a. MIB) ',
  'City Slickers ',
  'Truth About Cats & Dogs, The ',
  'Source Code ',
  'Unforgiven '],
 ['Unforgiven ',
  'Absolutely Anything ',
  "Hitchhiker's Guide t

### Step 5: Model Evaluation
- Evaluate the performance of your ItemKNN algorithm using appropriate evaluation metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), or others.
- Split your dataset into training and testing sets to assess the model's predictive accuracy on unseen data.

In [None]:
import numpy as np

def evaluate_model(test_data, item_neighborhoods, user_item_matrix):
    """
    Evaluate Item-based Rating Prediction (ItemKNN) model using MAE, RMSE, and R-squared.
    
    Parameters:
        test_data (DataFrame): Testing data containing user-item interactions.
        item_neighborhoods (dict): Dictionary containing item neighborhoods generated by the model.
        user_item_matrix (DataFrame): User-item matrix containing ratings.
    
    Returns:
        float: Mean Absolute Error (MAE)
        float: Root Mean Squared Error (RMSE)
        float: Coefficient of determination (R-squared)
    """
    # Initialize variables to store sum of absolute errors, sum of squared errors, and residuals
    total_absolute_error = 0
    total_squared_error = 0
    residuals = []
    
    # Count of total predictions
    count_predictions = 0
    
    # Iterate over each user-item pair in the test data
    for row in test_data:
        user_id = row[0]
        movie_id = row[1]
        actual_rating = row[2]
        
        # Get the index of the user
        user_index = user_id_to_index[user_id]
        
        # Check if the user ID exists in the user-item matrix
        if user_index < user_item_matrix.shape[0]:
            # Get the neighborhood for the current item
            neighborhood = item_neighborhoods.get(movie_id, [])
            
            # If the neighborhood is empty or the user has not rated any item in the neighborhood, use the mean rating
            if not neighborhood or not set(neighborhood).intersection(set(index_to_movie_id.values())):
                predicted_rating = np.mean(user_item_matrix[user_index])
            else:
                # Get the indices of the neighborhood items
                neighborhood_indices = [movie_id_to_index[movie_id] for movie_id in neighborhood]
                
                # Predict the rating for the current user-item pair
                predicted_rating = np.mean(user_item_matrix[user_index, neighborhood_indices])
            
            # Calculate absolute and squared errors
            absolute_error = abs(actual_rating - predicted_rating)
            squared_error = (actual_rating - predicted_rating) ** 2
            
            # Accumulate errors
            total_absolute_error += absolute_error
            total_squared_error += squared_error
            residuals.append(actual_rating - predicted_rating)
            
            # Increment count of predictions
            count_predictions += 1
    
    # Calculate Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared if there are predictions
    if count_predictions > 0:
        mae = total_absolute_error / count_predictions
        rmse = np.sqrt(total_squared_error / count_predictions)
        actual_ratings = test_data[:, 2]
        avg_actual_rating = np.mean(actual_ratings)
        r_squared = 1 - (np.sum(np.square(residuals)) / np.sum(np.square(actual_ratings - avg_actual_rating)))
    else:
        # Handle case where there are no predictions
        mae = float('nan')
        rmse = float('nan')
        r_squared = float('nan')
    
    return mae, rmse, r_squared


# Example usage:
mae, rmse, r_squared = evaluate_model(test_df, item_neighborhoods, user_item_matrix)
print("Mean Absolute Error (MAE):", mae)
print("Root Mean Squared Error (RMSE):", rmse)
print("Coefficient of determination (R-squared):", r_squared)


KeyError: 'm'

### Step 6: Parameter Tuning
- Experiment with different parameters such as similarity threshold, neighborhood size, and similarity metric to optimize the performance of your ItemKNN algorithm.
- Use cross-validation or other techniques to tune these parameters and avoid overfitting.

1. **Import Necessary Libraries:**
   - We import the required libraries for performing grid search cross-validation (`GridSearchCV`), creating custom scorers (`make_scorer`), and utilizing the `NearestNeighbors` algorithm.

2. **Define Cosine Similarity Function:**
   - We define a custom function `cosine_similarity` to compute the cosine similarity between two vectors. This function calculates the dot product of the vectors and divides it by the product of their norms.

3. **Define Custom Scorer:**
   - We create a custom scorer `cosine_similarity_scorer` using `make_scorer`, which enables us to use cosine similarity as the scoring metric during grid search cross-validation.

4. **Define Parameter Grid:**
   - We specify a parameter grid `param_grid` containing the hyperparameters to be tuned. In this case, we're tuning the number of neighbors (`n_neighbors`) and the distance metric (`metric`) for the `NearestNeighbors` algorithm.

5. **Initialize NearestNeighbors Model:**
   - We initialize the `NearestNeighbors` model without specifying any hyperparameters.

6. **Create GridSearchCV Object:**
   - We create a `GridSearchCV` object named `grid_search` with the specified parameter grid, cross-validation strategy (5-fold cross-validation), and custom scoring metric (`cosine_similarity_scorer`).

7. **Fit the Data:**
   - We fit the `user_item_matrix` data to the `grid_search` object to perform hyperparameter tuning. `user_item_matrix` typically contains the item-item similarity matrix computed using collaborative filtering techniques.

8. **Get Best Hyperparameters:**
   - After fitting the data, we retrieve the best hyperparameters selected by the grid search using the `best_params_` attribute of the `grid_search` object.

9. **Print Best Parameters:**
   - Finally, we print the best hyperparameters obtained from the grid search.



`scikit-learn (sklearn)`:

- Scikit-learn is a popular machine learning library in Python that provides simple and efficient tools for data analysis and modeling.
- It includes various modules for tasks such as classification, regression, clustering, dimensionality reduction, and model selection.
- The GridSearchCV class from scikit-learn is used for hyperparameter tuning through grid search along with cross-validation.
- The make_scorer function allows you to create a custom scoring function for use with GridSearchCV.
- The NearestNeighbors class provides functionality for unsupervised nearest neighbors learning, which can be used for tasks such as finding k-nearest neighbors for a given data point.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Define a function to compute cosine similarity
def cosine_similarity(X, Y):
    """
    This function computes the cosine similarity between two vectors X and Y.

    Parameters:
        X (ndarray): First vector
        Y (ndarray): Second vector

    Returns:
        float: Cosine similarity between X and Y
    """
    return np.dot(X, Y) / (np.linalg.norm(X) * np.linalg.norm(Y))

# Define a custom scorer based on cosine similarity
cosine_similarity_scorer = make_scorer(cosine_similarity)

# Define the parameter grid
param_grid = {
    'n_neighbors': [5, 10, 15, 30, 40],
    'metric': ['cosine', 'euclidean']
}

# Initialize NearestNeighbors model
knn_model = NearestNeighbors()

# Create the GridSearchCV object
grid_search = GridSearchCV(knn_model, param_grid, cv=5, scoring=cosine_similarity_scorer)

# Fit the data to perform hyperparameter tuning
grid_search.fit(user_item_matrix)  # user_item_matrix contains item-item similarity matrix

# Get the best hyperparameters
best_params = grid_search.best_params_

print("Best parameters:", best_params)


Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() 

Best parameters: {'metric': 'cosine', 'n_neighbors': 5}


### Step 7: Deployment

Once you're satisfied with the performance of your ItemKNN model, deploy it in your recommendation system to provide personalized recommendations to users based on their preferences and behaviors.

# Item-based Classification (ItemKNN)

### Step 1: Data preparation
- Ensure you have a dataset with user-item interactions. Each interaction should include the user ID, item ID, and the corresponding label or class for classification.
- Preprocess the data as needed, including handling missing values, encoding categorical variables, and splitting into training and testing sets.

For this step we are going to resuse the `user_item_matrix` that we created in the first model built. 

### Step 2: Compute Item Similarity


In [None]:
import numpy as np

def item_neighborhood_selection(similarity_matrix, k=None, threshold=None):
    """
    Select a subset of similar items for each item based on similarity matrix.

    Parameters:
        similarity_matrix (numpy.ndarray): Item-item similarity matrix.
        k (int): Number of similar items to select (optional).
        threshold (float): Similarity threshold for selecting similar items (optional).

    Returns:
        dict: Dictionary containing similar items for each item.
    """
    num_items = similarity_matrix.shape[0]
    item_neighborhood = {}

    for i in range(num_items):
        if k is not None:
            # Select top-k similar items (excluding the item itself)
            similar_items_indices = np.argsort(similarity_matrix[i])[::-1][:k+1]
        elif threshold is not None:
            # Select items with similarity above threshold (excluding the item itself)
            similar_items_indices = np.where(similarity_matrix[i] > threshold)[0]

        # Remove the item itself from the neighborhood
        similar_items_indices = similar_items_indices[similar_items_indices != i]

        item_neighborhood[i] = similar_items_indices

    return item_neighborhood

# Example usage:
# Assuming you have an item-item similarity matrix 'item_similarity_matrix'
# and you want to select top-5 similar items for each item
k = 5
item_neighborhood = item_neighborhood_selection(item_similarity_matrix, k=k)


### Step 3: Class Label Propagation

In [None]:
print(item_neighborhood)

{0: array([166,  60,   6,  50,  24], dtype=int64), 1: array([ 9, 34, 38,  4, 19], dtype=int64), 2: array([ 11,  19, 126, 284, 271], dtype=int64), 3: array([145,   7, 134,  52,  16], dtype=int64), 4: array([165, 155,  38, 149, 231], dtype=int64), 5: array([ 22, 379, 282,  34,  29], dtype=int64), 6: array([  0,  60, 105,  24,  50], dtype=int64), 7: array([145, 134,  52,  16,   3], dtype=int64), 8: array([ 19, 193, 213, 341,  22], dtype=int64), 9: array([ 99, 225,  34, 182,  38], dtype=int64), 10: array([ 66, 124, 169,  40, 196], dtype=int64), 11: array([230, 104,   2, 182,  67], dtype=int64), 12: array([271, 228, 284, 126,  83], dtype=int64), 13: array([ 66, 184, 154, 145, 134], dtype=int64), 14: array([24, 50, 44, 60, 43], dtype=int64), 15: array([ 34, 197, 169, 185, 196], dtype=int64), 16: array([145, 153, 134, 140,   7], dtype=int64), 17: array([ 10,  82, 167,  85,  42], dtype=int64), 18: array([241, 183, 177,  73, 451], dtype=int64), 19: array([ 34,   8,   2,   5, 380], dtype=int64),

In [None]:
def class_label_propagation(item_neighborhood, user_item_matrix):
    """
    Propagate class labels from similar items to the target item using majority voting.

    Parameters:
        item_neighborhood (dict): Dictionary containing similar items for each item.
        user_item_matrix (pandas.DataFrame): User-item matrix containing class labels.

    Returns:
        dict: Dictionary containing propagated class labels for each item.
    """
    propagated_labels = {}

    for item_id, similar_items in item_neighborhood.items():
        label_counts = {}
        for similar_item_id in similar_items:
            if similar_item_id in user_item_matrix.columns:  # Check if the item exists in the DataFrame
                label = user_item_matrix[similar_item_id].mode().iloc[0]  # Mode for majority voting
                label_counts[label] = label_counts.get(label, 0) + 1

        if label_counts:
            propagated_label = max(label_counts, key=label_counts.get)
            propagated_labels[item_id] = propagated_label

    return propagated_labels

# Example usage:
propagated_labels = class_label_propagation(item_neighborhood, user_item_matrix)

print(propagated_labels)


{1: 0.0, 11: 0.0, 14: 0.0, 28: 0.0, 29: 0.0, 38: 0.0, 44: 0.0, 49: 0.0, 63: 0.0, 72: 0.0, 74: 0.0, 78: 0.0, 89: 0.0, 97: 0.0, 106: 0.0, 108: 0.0, 109: 0.0, 114: 0.0, 119: 0.0, 123: 0.0, 130: 0.0, 144: 0.0, 148: 0.0, 156: 0.0, 157: 0.0, 158: 0.0, 159: 0.0, 167: 0.0, 170: 0.0, 186: 0.0, 194: 0.0, 204: 0.0, 222: 0.0, 226: 0.0, 233: 0.0, 237: 0.0, 252: 0.0, 266: 0.0, 273: 0.0, 278: 0.0, 295: 0.0, 298: 0.0, 301: 0.0, 314: 0.0, 320: 0.0, 322: 0.0, 324: 0.0, 348: 0.0, 350: 0.0, 367: 0.0, 374: 0.0, 377: 0.0, 384: 0.0, 388: 0.0, 389: 0.0, 390: 0.0, 393: 0.0, 394: 0.0, 396: 0.0, 399: 0.0, 404: 0.0, 406: 0.0, 410: 0.0, 412: 0.0, 414: 0.0, 426: 0.0, 430: 0.0, 434: 0.0, 446: 0.0, 449: 0.0, 451: 0.0, 460: 0.0, 463: 0.0, 466: 0.0, 467: 0.0, 468: 0.0, 481: 0.0}


### Step 4: Classify Items

- For each item in the testing set, find its nearest neighbors based on the precomputed item neighborhoods.
- Determine the class labels of these neighbors.
- Use a strategy such as majority voting to assign the class label to the item.

In [None]:
def perform_classification(item_neighborhood, propagated_labels):
    """
    Perform classification by assigning the most frequent class label in the neighborhood to each item.

    Parameters:
        item_neighborhood (dict): Dictionary containing similar items for each item.
        propagated_labels (dict): Propagated class labels for each item.

    Returns:
        dict: Dictionary containing classified labels for each item.
    """
    classified_labels = {}

    for item_id, similar_items in item_neighborhood.items():
        label_counts = {}
        for similar_item_id in similar_items:
            propagated_label = propagated_labels.get(similar_item_id)
            if propagated_label is not None:
                label_counts[propagated_label] = label_counts.get(propagated_label, 0) + 1

        if label_counts:
            classified_label = max(label_counts, key=label_counts.get)
            classified_labels[item_id] = classified_label

    return classified_labels

# Example usage:
classified_labels = perform_classification(item_neighborhood, propagated_labels)


### Step 5: Evaluate Model Performance

- Assess the performance of the ItemKNN classifier using evaluation metrics such as accuracy, precision, recall, F1-score, or area under the ROC curve (AUC).
- Compare the performance of different models with varying neighborhood sizes or similarity metrics to identify the optimal configuration.

In [None]:
def evaluate_model_performance(test_df, predicted_ratings):
    """
    Evaluate the performance of the model using Mean Absolute Error (MAE), R-squared error, and R-squared metric.
    
    Parameters:
        test_df (pandas.DataFrame): DataFrame containing the testing data with columns 'userId', 'movieId', and 'rating'.
        predicted_ratings (pandas.DataFrame): DataFrame containing the predicted ratings for each user-item pair.
    
    Returns:
        float: Mean Absolute Error (MAE)
        float: R-squared error
        float: R-squared metric
    """
    # Initialize variables to store sum of absolute errors and sum of squared errors
    total_absolute_error = 0
    total_squared_error = 0
    
    # Count of total predictions
    count_predictions = 0
    
    # Initialize variables for R-squared calculation
    actual_ratings = []
    predicted_ratings_list = []
    
    # Iterate over each user-item pair in the test data
    for index, row in test_df.iterrows():
        user_id = row['userId']
        movie_id = row['movieId']
        actual_rating = row['rating']
        
        # Check if the user ID exists in the predicted ratings
        if user_id in predicted_ratings.index and movie_id in predicted_ratings.columns:
            # Get the predicted rating
            predicted_rating = predicted_ratings.loc[user_id, movie_id]
            
            # Calculate absolute and squared errors
            absolute_error = abs(actual_rating - predicted_rating)
            squared_error = (actual_rating - predicted_rating) ** 2
            
            # Accumulate errors
            total_absolute_error += absolute_error
            total_squared_error += squared_error
            
            # Increment count of predictions
            count_predictions += 1
            
            # Append actual and predicted ratings for R-squared calculation
            actual_ratings.append(actual_rating)
            predicted_ratings_list.append(predicted_rating)
    
    # Calculate Mean Absolute Error (MAE)
    mae = total_absolute_error / count_predictions
    
    # Calculate R-squared error
    mean_actual_rating = sum(actual_ratings) / len(actual_ratings)
    sst = sum((rating - mean_actual_rating) ** 2 for rating in actual_ratings)
    ssr = sum((actual_ratings[i] - predicted_ratings_list[i]) ** 2 for i in range(len(actual_ratings)))
    r_squared_error = 1 - (ssr / sst)
    
    # Calculate R-squared metric
    residuals = [actual_ratings[i] - predicted_ratings_list[i] for i in range(len(actual_ratings))]
    sse = sum(residual ** 2 for residual in residuals)
    r_squared = 1 - (sse / sst)
    
    return mae, r_squared_error, r_squared


### Step 6: Recommendation Generation

Once the model is trained and evaluated, you can use it to generate item recommendations for users. For a given user, recommend items that are highly rated or predicted to be liked based on the model's predictions.

In [None]:
def predict_user_labels_and_recommendations(user_item_matrix, item_neighborhood, classified_labels, top_n=5):
    """
    Predict class labels for users based on the majority class label in the item neighborhood.
    Also, generate item recommendations for each user.

    Parameters:
        user_item_matrix (pandas.DataFrame): User-item matrix containing ratings or interactions.
        item_neighborhood (dict): Dictionary containing similar items for each item.
        classified_labels (dict): Classified labels for each item.
        top_n (int): Number of top recommendations to generate (default is 5).

    Returns:
        tuple: Tuple containing predicted class labels for each user and recommended items.
    """
    user_labels = {}
    user_recommendations = {}

    for user_id, user_ratings in user_item_matrix.iterrows():
        label_counts = {}
        recommendations = []

        # Iterate over the items rated by the user
        for item_id, rating in user_ratings.items():
            # Check if the item has a classified label and is in the item neighborhood
            if item_id in classified_labels and item_id in item_neighborhood:
                # Get the classified label for the item
                classified_label = classified_labels[item_id]
                # Increment the count of the classified label in the neighborhood
                label_counts[classified_label] = label_counts.get(classified_label, 0) + 1
                # Add the item to recommendations
                recommendations.extend(item_neighborhood[item_id][:top_n])
        
        # Determine the predicted label for the user based on the majority class label in the neighborhood
        if label_counts:
            predicted_label = max(label_counts, key=label_counts.get)
            user_labels[user_id] = predicted_label
            # Remove items already rated by the user from recommendations
            recommendations = [item_id for item_id in recommendations if item_id not in user_ratings.index]
            user_recommendations[user_id] = recommendations[:top_n]

    return user_labels, user_recommendations

# Example usage:
predicted_user_labels, user_recommendations = predict_user_labels_and_recommendations(user_item_matrix, item_neighborhood, classified_labels)
print("Predicted class labels for users:", predicted_user_labels)
print("Recommended items for users:", user_recommendations)


Predicted class labels for users: {1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 9: 0.0, 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0, 14: 0.0, 15: 0.0, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0, 20: 0.0, 21: 0.0, 22: 0.0, 23: 0.0, 24: 0.0, 25: 0.0, 27: 0.0, 28: 0.0, 29: 0.0, 30: 0.0, 31: 0.0, 32: 0.0, 33: 0.0, 34: 0.0, 36: 0.0, 38: 0.0, 39: 0.0, 40: 0.0, 41: 0.0, 42: 0.0, 43: 0.0, 44: 0.0, 45: 0.0, 46: 0.0, 47: 0.0, 48: 0.0, 50: 0.0, 51: 0.0, 52: 0.0, 54: 0.0, 57: 0.0, 58: 0.0, 59: 0.0, 62: 0.0, 63: 0.0, 64: 0.0, 65: 0.0, 66: 0.0, 67: 0.0, 68: 0.0, 69: 0.0, 70: 0.0, 71: 0.0, 73: 0.0, 74: 0.0, 75: 0.0, 76: 0.0, 78: 0.0, 79: 0.0, 80: 0.0, 82: 0.0, 83: 0.0, 84: 0.0, 85: 0.0, 86: 0.0, 87: 0.0, 88: 0.0, 89: 0.0, 90: 0.0, 91: 0.0, 92: 0.0, 93: 0.0, 95: 0.0, 96: 0.0, 97: 0.0, 98: 0.0, 99: 0.0, 100: 0.0, 101: 0.0, 103: 0.0, 104: 0.0, 105: 0.0, 106: 0.0, 107: 0.0, 109: 0.0, 110: 0.0, 111: 0.0, 112: 0.0, 113: 0.0, 114: 0.0, 115: 0.0, 116: 0.0, 117: 0.0, 118: 0.0, 119: 0.0, 121: 0.0, 122: 0.0, 123: 0.0

### Step 7: Parameter Tuning

Experiment with different parameters such as similarity threshold, neighborhood size, and similarity metric to optimize the performance of your ItemKNN algorithm.
Use techniques like cross-validation to tune these parameters and avoid overfitting.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Define a function to compute cosine similarity
def cosine_similarity(X, Y):
    """
    This function computes the cosine similarity between two vectors X and Y.

    Parameters:
        X (ndarray): First vector
        Y (ndarray): Second vector

    Returns:
        float: Cosine similarity between X and Y
    """
    return np.dot(X, Y) / (np.linalg.norm(X) * np.linalg.norm(Y))

# Define a custom scorer based on cosine similarity
cosine_similarity_scorer = make_scorer(cosine_similarity)

# Define the parameter grid
param_grid = {
    'n_neighbors': [5, 10, 15, 30, 40],
    'metric': ['cosine', 'euclidean']
}

# Initialize NearestNeighbors model
knn_model_classification = NearestNeighbors()

# Create the GridSearchCV object
grid_search = GridSearchCV(knn_model_classification, param_grid, cv=5, scoring=cosine_similarity_scorer)

# Fit the data to perform hyperparameter tuning
grid_search.fit(user_item_matrix)  # user_item_matrix contains item-item similarity matrix

# Get the best hyperparameters
best_params_classification = grid_search.best_params_

print("Best parameters:", best_params_classification)


Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() 

Best parameters: {'metric': 'cosine', 'n_neighbors': 5}


### Step 7: Deployment

- If you want to predict the class labels for new items, repeat steps 4 and 5 using the entire dataset (training + testing) to build the final model.
- Use the trained model to predict the class labels for new items based on their nearest neighbors.