## Item-based Models

In this file we are going to proceed with the building in the models. Machine learning models to build recommender systems. The models that are going to be built are Collaborative filtering using Item-based rating prediction (ItemKNN) and Item-based classification (ItemKNN).

Collaborative filtering is a technique used in recommendation systems to predict or classify items based on the preferences or behavior of similar users or items. Item-based Collaborative Filtering (CF) focuses on the similarity between items rather than users. There are two main approaches within item-based CF: rating prediction and classification.

In [8]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # to make jupyter print all outputs, not just the last one
from IPython.core.display import HTML # to pretty print pandas df and be able to copy them over (e.g. to ppt slides)

In [9]:
# We import the necessary libraries
import pandas as pd
import numpy as np
import os

In [10]:
# We print the directory where the file is located
print(os.getcwd())

c:\Users\Jaume\Documents\MDDB\SDM\SDfM---Jaume-and-Stijn


In [11]:
# We set the directory to the cleaned folder
os.listdir(os.path.join('.', 'cleaned'))

['final_sample5_parquet', 'movielens_parquet', 'netflix_parquet']

In [12]:
# We read the final_sample file and store it in a dataframe
df = pd.read_parquet('cleaned/final_sample5_parquet')

In [13]:
# We print shape of the dataframe
df.shape

(4665, 26)

In [14]:
# We print the first 5 rows of the dataframe
df.head()

Unnamed: 0,movieId,title,year,userId,rating,date,Drama,Action,Sci-Fi,Comedy,...,Adventure,Fantasy,IMAX,Animation,Musical,Horror,Film-Noir,Western,Mystery,(no genres listed)
0,45635,"Notorious Bettie Page, The",2005,414,3.0,2008-07-15,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,45635,"Notorious Bettie Page, The",2005,474,3.0,2006-12-08,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1373,Star Trek V: The Final Frontier,1989,19,1.0,2000-08-08,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,1373,Star Trek V: The Final Frontier,1989,42,4.0,2001-07-27,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,1373,Star Trek V: The Final Frontier,1989,51,5.0,2009-01-02,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# We print the number of unique users and movies
print(df['userId'].nunique())
print(df['movieId'].nunique())


532
487


# Item-based Rating Prediction (ItemKNN)

### Step 1: User-Item matrix construction
The first thing we have to do is build the user-item matrix:

- Choose a similarity metric to calculate the similarity between items. Common metrics include cosine similarity, Pearson correlation coefficient, and Jaccard similarity.
- Calculate the similarity between each pair of items based on the ratings provided by users. This will result in an item-item similarity matrix.

In [16]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Assuming df is your DataFrame with the specified format
# We'll first pivot the DataFrame to get the user-item matrix
user_item_matrix = df.pivot_table(index='userId', columns='movieId', values='rating', fill_value=0)

# Drop any duplicate rows or columns
user_item_matrix = user_item_matrix.loc[~user_item_matrix.index.duplicated(keep='first')]
user_item_matrix = user_item_matrix.loc[:, ~user_item_matrix.columns.duplicated()]

# Calculate cosine similarity between items
item_similarity_matrix = cosine_similarity(user_item_matrix.T)

# Create a mapping from movie IDs to indices
movie_id_to_index = {movie_id: i for i, movie_id in enumerate(user_item_matrix.columns)}
index_to_movie_id = {i: movie_id for movie_id, i in movie_id_to_index.items()}

# Create a mapping from user IDs to indices
user_id_to_index = {user_id: i for i, user_id in enumerate(user_item_matrix.index)}
index_to_user_id = {i: user_id for user_id, i in user_id_to_index.items()}

# Determine the neighborhood size (number of most similar items to consider)
neighborhood_size = 5  # Adjust this value as needed

# Initialize an empty dictionary to store the neighborhoods for each item
item_neighborhoods = {}

# Iterate over each item in your dataset
for movie_id in user_item_matrix.columns:
    # Get the index corresponding to the current movie ID
    movie_index = movie_id_to_index[movie_id]
    
    # Get the similarity scores for the current item (movie)
    item_similarities = item_similarity_matrix[movie_index]
    
    # Sort the similarity scores in descending order and get the indices of the most similar items
    most_similar_indices = np.argsort(item_similarities)[::-1][1:neighborhood_size+1]
    
    # Convert indices back to movie IDs
    most_similar_movie_ids = [index_to_movie_id[idx] for idx in most_similar_indices]
    
    # Store the neighborhood for the current item
    item_neighborhoods[movie_id] = most_similar_movie_ids

# item_neighborhoods now contains the neighborhood for each item


In [28]:
user_item_matrix.head()
user_item_matrix.shape

movieId,1,4,15,30,43,89,104,108,122,146,...,174479,174551,175475,176371,176389,177593,179813,181413,185029,186587
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


(531, 487)

In [18]:
# We print the item_similarity_matrix
item_similarity_matrix

array([[1.        , 0.03557272, 0.10452764, ..., 0.        , 0.09204237,
        0.06805615],
       [0.03557272, 1.        , 0.15733562, ..., 0.        , 0.        ,
        0.        ],
       [0.10452764, 0.15733562, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.09204237, 0.        , 0.        , ..., 0.        , 1.        ,
        0.63644583],
       [0.06805615, 0.        , 0.        , ..., 0.        , 0.63644583,
        1.        ]])

### Step 2: Neighborhood Selection
- Determine the neighborhood size, i.e., the number of most similar items to consider when predicting ratings for a target item.
- Select the most similar items for each item in your dataset based on their calculated similarities. This forms the neighborhood for each item.


In [19]:
# Determine the neighborhood size (number of most similar items to consider)
neighborhood_size = 5  # Adjust this value as needed

# Initialize an empty dictionary to store the neighborhoods for each item
item_neighborhoods = {}

# Iterate over each item in your dataset
for movie_id in user_item_matrix.columns:
    # Get all ratings for the current movie ID
    movie_ratings = df[df['movieId'] == movie_id]
    
    # Aggregate the ratings (e.g., take the average)
    aggregated_rating = movie_ratings['rating'].mean()
    
    # Get the index corresponding to the current movie ID
    movie_index = movie_id_to_index[movie_id]
    
    # Get the similarity scores for the current item (movie)
    item_similarities = item_similarity_matrix[movie_index]
    
    # Sort the similarity scores in descending order and get the indices of the most similar items
    most_similar_indices = np.argsort(item_similarities)[::-1][1:neighborhood_size+1]
    
    # Convert indices back to movie IDs
    most_similar_movie_ids = [index_to_movie_id[idx] for idx in most_similar_indices]
    
    # Store the neighborhood for the current item
    item_neighborhoods[movie_id] = most_similar_movie_ids

# item_neighborhoods now contains the neighborhood for each item


### Step 3: Train-test Split
Data splitting is essential for evaluating the performance of your recommendation system and preventing overfitting. Here's where you should perform data splitting:

In [20]:
from sklearn.model_selection import train_test_split

# Assuming df is your DataFrame with the specified format
# Split the data into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)


### Step 4: Rating Prediction

- For each target item and user pair where the user hasn't rated the target item:
Identify the neighborhood of similar items to the target item.
- Predict the rating for the target item using a weighted average of the ratings of the items in its neighborhood, where the weights are the similarities between the items and the target item.
- Adjust the prediction based on the user's average rating or other normalization techniques, if necessary.

In [29]:
# Train your model and create the item_neighborhoods dictionary using the training data (steps 1-3)

# Initialize an empty DataFrame to store predicted ratings
predicted_ratings = pd.DataFrame(index=test_df['userId'], columns=test_df['movieId'])

# Iterate over each user-item pair in the testing set
for index, row in test_df.iterrows():
    user_id = row['userId']
    movie_id = row['movieId']
    
    # Check if the movie has a neighborhood defined
    if movie_id in item_neighborhoods:
        neighborhood = item_neighborhoods[movie_id]
        
        # Filter out movies from the neighborhood that the user has rated in the training set
        filtered_neighborhood = [neighbor_movie_id for neighbor_movie_id in neighborhood if neighbor_movie_id in train_df[train_df['userId'] == user_id]['movieId'].values]
        
        # Check if there are ratings available in the filtered neighborhood
        if filtered_neighborhood:
            # Calculate the predicted rating for the target movie based on the neighborhood
            predicted_rating = np.mean([train_df[(train_df['userId'] == user_id) & (train_df['movieId'] == neighbor_movie_id)]['rating'].values[0] for neighbor_movie_id in filtered_neighborhood])
        else:
            # If the filtered neighborhood is empty, assign the mean rating of all movies for the user in the training set
            predicted_rating = np.mean(train_df[train_df['userId'] == user_id]['rating'].values)
        
        # Assign the predicted rating to the corresponding cell in the DataFrame
        predicted_ratings.at[user_id, movie_id] = predicted_rating

# predicted_ratings now contains predicted ratings for items in the testing set


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [None]:
def get_top_recommendations(user_id):
    # Get the index of the user
    user_index = user_id_to_index[user_id]
    
    # Get the predicted ratings for the user
    user_ratings = predicted_ratings.loc[user_index]
    
    # Filter out movies that the user has already rated
    unrated_movies = user_ratings[user_ratings.isna()].index
    
    # Sort the ratings of unrated movies in descending order
    sorted_unrated_movies = user_ratings[unrated_movies].sort_values(ascending=False)
    
    # Initialize a list to store the top movie titles
    top_movie_titles = []
    
    # Initialize a set to keep track of movie IDs that have been added to the recommendations
    added_movie_ids = set()
    
    # Iterate over the sorted unrated movies
    for movie_id in sorted_unrated_movies.index:
        # Check if the movie ID has already been added to the recommendations
        if movie_id not in added_movie_ids:
            # Get the movie title corresponding to the movie ID and append it to the list
            top_movie_titles.append(df.loc[df['movieId'] == movie_id, 'title'].values[0])
            
            # Add the movie ID to the set of added movie IDs
            added_movie_ids.add(movie_id)
            
            # If we have already added 5 recommendations, stop iterating
            if len(top_movie_titles) == 5:
                break
    
    return top_movie_titles

In [None]:
# Example: Get top recommendations for user with ID 2
user_id = 2
top_recommendations = get_top_recommendations(user_id)
print("Top recommendations for user", user_id, ":")
print(top_recommendations)


Top recommendations for user 2 :
['Evil Dead II (Dead by Dawn) ', 'Kramer vs. Kramer ', 'Spirited Away (Sen to Chihiro no kamikakushi) ', 'Star Trek V: The Final Frontier ', 'Short Circuit ']


### Step 5: Model Evaluation
- Evaluate the performance of your ItemKNN algorithm using appropriate evaluation metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), or others.
- Split your dataset into training and testing sets to assess the model's predictive accuracy on unseen data.

### Step 6: Parameter Tuning
- Experiment with different parameters such as similarity threshold, neighborhood size, and similarity metric to optimize the performance of your ItemKNN algorithm.
- Use cross-validation or other techniques to tune these parameters and avoid overfitting.

1. **Import Necessary Libraries:**
   - We import the required libraries for performing grid search cross-validation (`GridSearchCV`), creating custom scorers (`make_scorer`), and utilizing the `NearestNeighbors` algorithm.

2. **Define Cosine Similarity Function:**
   - We define a custom function `cosine_similarity` to compute the cosine similarity between two vectors. This function calculates the dot product of the vectors and divides it by the product of their norms.

3. **Define Custom Scorer:**
   - We create a custom scorer `cosine_similarity_scorer` using `make_scorer`, which enables us to use cosine similarity as the scoring metric during grid search cross-validation.

4. **Define Parameter Grid:**
   - We specify a parameter grid `param_grid` containing the hyperparameters to be tuned. In this case, we're tuning the number of neighbors (`n_neighbors`) and the distance metric (`metric`) for the `NearestNeighbors` algorithm.

5. **Initialize NearestNeighbors Model:**
   - We initialize the `NearestNeighbors` model without specifying any hyperparameters.

6. **Create GridSearchCV Object:**
   - We create a `GridSearchCV` object named `grid_search` with the specified parameter grid, cross-validation strategy (5-fold cross-validation), and custom scoring metric (`cosine_similarity_scorer`).

7. **Fit the Data:**
   - We fit the `user_item_matrix` data to the `grid_search` object to perform hyperparameter tuning. `user_item_matrix` typically contains the item-item similarity matrix computed using collaborative filtering techniques.

8. **Get Best Hyperparameters:**
   - After fitting the data, we retrieve the best hyperparameters selected by the grid search using the `best_params_` attribute of the `grid_search` object.

9. **Print Best Parameters:**
   - Finally, we print the best hyperparameters obtained from the grid search.



In [27]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Define a function to compute cosine similarity
def cosine_similarity(X, Y):
    return np.dot(X, Y) / (np.linalg.norm(X) * np.linalg.norm(Y))

# Define a custom scorer based on cosine similarity
cosine_similarity_scorer = make_scorer(cosine_similarity)

# Define the parameter grid
param_grid = {
    'n_neighbors': [5, 10, 15, 30, 40],
    'metric': ['cosine', 'euclidean']
}

# Initialize NearestNeighbors model
knn_model = NearestNeighbors()

# Create the GridSearchCV object
grid_search = GridSearchCV(knn_model, param_grid, cv=5, scoring=cosine_similarity_scorer)

# Fit the data to perform hyperparameter tuning
grid_search.fit(user_item_matrix)  # user_item_matrix contains item-item similarity matrix

# Get the best hyperparameters
best_params = grid_search.best_params_

print("Best parameters:", best_params)


Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Jaume\anaconda3\envs\SDM\Lib\site-packages\sklearn\model_selection\_validation.py", line 980, in _score
    scores = scorer(estimator, X_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() 

Best parameters: {'metric': 'cosine', 'n_neighbors': 5}


### Step 7: Deployment

Once you're satisfied with the performance of your ItemKNN model, deploy it in your recommendation system to provide personalized recommendations to users based on their preferences and behaviors.

# Item-based Classification (ItemKNN)

### Step 1: Data preparation
- Ensure you have a dataset with user-item interactions. Each interaction should include the user ID, item ID, and the corresponding label or class for classification.
- Preprocess the data as needed, including handling missing values, encoding categorical variables, and splitting into training and testing sets.

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Assuming df is your DataFrame with the specified format
# We'll first pivot the DataFrame to get the user-item matrix
user_item_matrix_classification = df.pivot_table(index='userId', columns='movieId', values='rating', fill_value=0)

# Drop any duplicate rows or columns
user_item_matrix_classification = user_item_matrix_classification.loc[~user_item_matrix_classification.index.duplicated(keep='first')]
user_item_matrix_classification = user_item_matrix_classification.loc[:, ~user_item_matrix_classification.columns.duplicated()]

# Calculate cosine similarity between items
item_similarity_matrix = cosine_similarity(user_item_matrix_classification.T)

# Create a mapping from movie IDs to indices
movie_id_to_index = {movie_id: i for i, movie_id in enumerate(user_item_matrix_classification.columns)}
index_to_movie_id = {i: movie_id for movie_id, i in movie_id_to_index.items()}

# Create a mapping from user IDs to indices
user_id_to_index = {user_id: i for i, user_id in enumerate(user_item_matrix_classification.index)}
index_to_user_id = {i: user_id for user_id, i in user_id_to_index.items()}

# Determine the neighborhood size (number of most similar items to consider)
neighborhood_size = 5  # Adjust this value as needed

# Initialize an empty dictionary to store the neighborhoods for each item
item_neighborhoods = {}

# Iterate over each item in your dataset
for movie_id in user_item_matrix_classification.columns:
    # Get the index corresponding to the current movie ID
    movie_index = movie_id_to_index[movie_id]
    
    # Get the similarity scores for the current item (movie)
    item_similarities = item_similarity_matrix[movie_index]
    
    # Sort the similarity scores in descending order and get the indices of the most similar items
    most_similar_indices = np.argsort(item_similarities)[::-1][1:neighborhood_size+1]
    
    # Convert indices back to movie IDs
    most_similar_movie_ids = [index_to_movie_id[idx] for idx in most_similar_indices]
    
    # Store the neighborhood for the current item
    item_neighborhoods[movie_id] = most_similar_movie_ids

# item_neighborhoods now contains the neighborhood for each item


### Step 2: Compute Item Similarity

- Use the training data to calculate the similarity between items. You can use similarity measures like cosine similarity, Pearson correlation coefficient, or Jaccard similarity.
- Construct an item-item similarity matrix where each element represents the similarity between two items.

### Step 3: Define Nearest Neighbors

- Determine the neighborhood size, i.e., the number of most similar items to consider for each item.
- For each item, identify its nearest neighbors based on the item-item similarity matrix.

### Step 4: Classify Items

- For each item in the testing set, find its nearest neighbors based on the precomputed item neighborhoods.
- Determine the class labels of these neighbors.
- Use a strategy such as majority voting to assign the class label to the item.

### Step 5: Evaluate Model Performance

- Assess the performance of the ItemKNN classifier using evaluation metrics such as accuracy, precision, recall, F1-score, or area under the ROC curve (AUC).
- Compare the performance of different models with varying neighborhood sizes or similarity metrics to identify the optimal configuration.

### Step 6: Parameter Tuning

Experiment with different parameters such as similarity threshold, neighborhood size, and similarity metric to optimize the performance of your ItemKNN algorithm.
Use techniques like cross-validation to tune these parameters and avoid overfitting.

### Step 7: Deployment

- If you want to predict the class labels for new items, repeat steps 4 and 5 using the entire dataset (training + testing) to build the final model.
- Use the trained model to predict the class labels for new items based on their nearest neighbors.