In [17]:
import pandas as pd
import numpy as np

In [18]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # to make jupyter print all outputs, not just the last one
from IPython.core.display import HTML # to pretty print pandas df and be able to copy them over (e.g. to ppt slides)

In [19]:
netflix_df = pd.read_parquet('cleaned/netflix_parquet')
movielens_df = pd.read_parquet('cleaned/movielens_parquet')

In [20]:
# netflix_df = netflix_df[netflix_df['review_data'].apply(lambda x: len(x) if x is not None else 0) > 500]
netflix_df = netflix_df[netflix_df['review_data'].apply(lambda x: 30 <= len(x) <= 350 if x is not None else False)]
movielens_df = movielens_df[movielens_df['review_data'].apply(lambda x: 30 <= len(x) <= 350 if x is not None else False)]

In [21]:
n_rows = 10

In [22]:
df = (netflix_df.sample(n=n_rows,random_state=42))[['movieId','review_data']]
df
df2 = (movielens_df.sample(n=n_rows,random_state=42))[['movieId','review_data']]
df2

Unnamed: 0,movieId,review_data
648,649,"[{'date': 2002-01-09, 'rating': 1.0, 'userId':..."
84,85,"[{'date': 2005-07-11, 'rating': 4.0, 'userId':..."
926,927,"[{'date': 2005-12-05, 'rating': 3.0, 'userId':..."
734,735,"[{'date': 2005-07-06, 'rating': 4.0, 'userId':..."
1336,1337,"[{'date': 2005-06-08, 'rating': 3.0, 'userId':..."
1522,1523,"[{'date': 2005-05-17, 'rating': 5.0, 'userId':..."
967,968,"[{'date': 2004-11-08, 'rating': 5.0, 'userId':..."
978,979,"[{'date': 2005-08-27, 'rating': 5.0, 'userId':..."
1252,1253,"[{'date': 2005-09-17, 'rating': 1.0, 'userId':..."
1725,1726,"[{'date': 2005-12-16, 'rating': 4.0, 'userId':..."


Unnamed: 0,movieId,review_data
1932,2029,"[{'date': 2000-01-18, 'rating': 4.0, 'userId':..."
16102,89305,"[{'date': 2011-12-19, 'rating': 4.0, 'userId':..."
18076,101088,"[{'date': 2020-05-10, 'rating': 2.5, 'userId':..."
12563,64285,"[{'date': 2009-01-29, 'rating': 4.5, 'userId':..."
546,554,"[{'date': 2000-03-20, 'rating': 1.0, 'userId':..."
6352,6517,"[{'date': 2009-12-14, 'rating': 2.5, 'userId':..."
5133,5264,"[{'date': 2020-03-17, 'rating': 2.0, 'userId':..."
23832,131050,"[{'date': 2015-07-08, 'rating': 5.0, 'userId':..."
3871,3991,"[{'date': 2001-07-05, 'rating': 3.0, 'userId':..."
7591,8157,"[{'date': 2009-02-06, 'rating': 4.5, 'userId':..."


In [23]:
review_data = df['review_data'].values
user_ids = np.concatenate([np.array([entry['userId'] for entry in row]) for row in review_data])
ratings = np.concatenate([np.array([entry['rating'] for entry in row]) for row in review_data])
movieIds = np.concatenate([[movieId] * len(row) for movieId, row in zip(df['movieId'], review_data)])
len(user_ids)
len(np.unique(movieIds))

review_data2 = df2['review_data'].values
user_ids2 = np.concatenate([np.array([entry['userId'] for entry in row]) for row in review_data2])
ratings2 = np.concatenate([np.array([entry['rating'] for entry in row]) for row in review_data2])
movieIds2 = np.concatenate([[movieId] * len(row) for movieId, row in zip(df['movieId'], review_data2)])
len(user_ids2)
len(np.unique(movieIds2))

1932

10

1000

10

### Set-up user-item matrix
First we will create a user-item matrix which records all the user-item interactions.


### `create_user_item_matrix` Function Explanation

### Steps:
1. **Extract Review Data**:
   - Extract the review data from the provided DataFrame, which contains user IDs, ratings, and movie IDs.

2. **Create User and Movie IDs Arrays**:
   - Extract user IDs, ratings, and movie IDs from the review data and concatenate them into separate arrays.
   - Generate dictionaries to map user IDs and movie IDs to unique indices in the user-item matrix.

3. **Initialize User-Item Matrix**:
   - Determine the dimensions of the user-item matrix based on the number of unique users and movies.
   - Initialize an empty user-item matrix filled with NaN values.

4. **Populate User-Item Matrix**:
   - Iterate through the review data and populate the user-item matrix with ratings.
   - Map user and movie IDs to their corresponding indices in the matrix and insert the ratings.

5. **Return Results**:
   - Return the user-item matrix along with dictionaries mapping user and movie IDs to indices, and arrays containing user and movie IDs.
  
### Functions Used and Purpose:

- **`np.concatenate()`**: Used to concatenate arrays containing user IDs, ratings, and movie IDs extracted from the review data.
- **`enumerate()`**: Used to iterate over the unique user IDs and movie IDs and generate indices for mapping.
- **`np.unique()`**: Used to find the unique user IDs and movie IDs in the review data.
- **`np.full()`**: Used to initialize an empty user-item matrix filled with NaN values.
- **`zip()`**: Used to iterate over multiple iterables simultaneously (user IDs, movie IDs, ratings).
- **`enumerate()`**: Used to iterate over the indices and elements of an iterable (user IDs, movie IDs) simultaneously.
- **Indexing and Slicing**: Used to access and modify elements in arrays and matrices.

In [24]:
def create_user_item_matrix(train_test_val_set):
    """
    Creates a user-item matrix from the provided dataset containing review data.

    Parameters:
    train_test_val_set (DataFrame): DataFrame containing review data with columns 'review_data',
                                    which is a list of dictionaries with keys 'userId', 'rating',
                                    and 'movieId'.

    Returns:
    user_item_matrix (numpy.ndarray): Matrix representing users' ratings for items (movies), the matrix is an NumPy array which contains lists of user-item interactions, meaning a user and their corresponding ratings to the movieIds.    
    
    user_id_dict (dict): Dictionary mapping user IDs to unique indices in the user-item matrix.
    
    movie_id_dict (dict): Dictionary mapping movie IDs to unique indices in the user-item matrix.
    
    user_ids (numpy.ndarray): Array containing user IDs corresponding to each rating in the matrix.
    
    movie_ids (numpy.ndarray): Array containing movie IDs corresponding to each rating in the matrix.

    """
    review_data = train_test_val_set['review_data'].values
    user_ids = np.concatenate([np.array([entry['userId'] for entry in row]) for row in review_data])
    ratings = np.concatenate([np.array([entry['rating'] for entry in row]) for row in review_data])
    movieIds = np.concatenate([[movieId] * len(row) for movieId, row in zip(train_test_val_set['movieId'], review_data)])

    # create dictionaries to map user IDs and movie IDs to unique indices to map over
    user_id_dict = {user_id: index for index, user_id in enumerate(np.unique(user_ids))}
    movie_id_dict = {movie_id: index for index, movie_id in enumerate(np.unique(movieIds))}

    # initialize an empty user-item matrix
    user_count = len(user_id_dict)
    movie_count = len(movie_id_dict)
    user_item_matrix = np.full((user_count, movie_count), np.nan)

    # populate the user-item matrix with ratings from the dataset
    for i, (user_id, movie_id, rating) in enumerate(zip(user_ids, movieIds, ratings)):
        user_index = user_id_dict[user_id]
        movie_index = movie_id_dict[movie_id]
        user_item_matrix[user_index, movie_index] = rating

    return user_item_matrix, user_id_dict, movie_id_dict, user_ids, movieIds

### Preprocessing of ratings in user-item matrix:
We might suggest filling the empty values with 0s, but that can create issues with recommendation engines. 

If we were to fill this NaN with a 0, we would be incorrectly implying they greatly disliked! We are going to center each user’s ratings around 0 by deducting the row average and then fill in the missing values with 0. This means the missing data is replaced with neutral scores.

### `computing_neutral_scores` Function Explanation

### Functions Used and Purpose:
- **`np.nanmean()`**: Used to calculate the average rating for each user while handling NaN (missing) values.
  - **`axis=1`**: Specifies that the calculation is done along the rows (i.e., for each user).
- **`np.nan_to_num()`**: Used to fill in missing data (NaN) with zeros while preserving non-NaN values.
- **`np.reshape(-1, 1)`**: Used to reshape the array to ensure proper broadcasting during subtraction.
- **Indexing and Slicing**: Used to access elements in arrays and matrices.

### Steps:
1. **Calculate Average Ratings**:
   - Use `np.nanmean()` to compute the average rating for each user along the rows of the user-item matrix. This handles missing ratings (NaN) gracefully, computing the mean while ignoring NaN values.

2. **Center Ratings Around 0**:
   - Subtract the average ratings from each user's ratings in the user-item matrix. This centers each user's ratings around 0, effectively removing the user bias from the ratings.

3. **Fill Missing Data with Zeros**:
   - Use `np.nan_to_num()` to replace missing data (NaN) with zeros while preserving the existing non-NaN values. This ensures that missing ratings are treated neutrally (i.e., as if the user has not rated the item).

4. **Return Normalized User Ratings**:
   - Return the resulting normalized user ratings matrix, where missing ratings have been replaced with zeros and each user's ratings are centered around 0.

In [25]:
def computing_neutral_scores(user_item_matrix):
    """
    Compute neutral scores for user-item interactions in a user-item matrix.

    Parameters:
    user_item_matrix (numpy.ndarray): Matrix representing users' ratings for items (movies).

    Returns:
    user_ratings_matrix_normed (numpy.ndarray): Matrix representing users' ratings normalized with neutral scores.
    """
    # Calculate the average rating for each user
    avg_ratings = np.nanmean(user_item_matrix, axis=1)

    # Center each user's ratings around 0
    user_ratings_matrix_centered = user_item_matrix - avg_ratings.reshape(-1, 1)

    # Fill in the missing data with 0s
    user_ratings_matrix_normed = np.nan_to_num(user_ratings_matrix_centered, nan=0)

    return user_ratings_matrix_normed

### Compute similarity:
Regularly, cosine similarity is often used to measure the similarity between users based on their preferences or ratings for items (in this case, movies). Cosine similarity ranges from -1 to 1, where:

- 1 indicates perfect similarity,
- 0 indicates no similarity, and
- -1 indicates perfect dissimilarity.

### Interpretation:

- **Positive Cosine Similarity**: Users are similar in their preferences or ratings for movies.
- **Zero Cosine Similarity**: Users have no similarity in their preferences.
- **Negative Cosine Similarity**: Users are dissimilar in their preferences, tending towards opposite ratings for movies.

### Practical Implication:

If one user likes certain types of movies, the other user tends to dislike them, or vice versa. In other words, users with negative cosine similarities have contrasting preferences, making them less suitable for recommending movies to each other.

___

To see how similar users are we will compute the similarity between them. I will use cosine similarity as distance measure. The manhatten norm will be used to decrease computational weight instead of euclidian norm.

### Explanation `calculate_user_similarity_manhattan` Function

This function calculates the cosine similarity matrix between users based on their ratings using the Manhattan norm.

1. **Thresholding**: First, the function applies thresholding to the user ratings matrix. Ratings below the threshold are set to 0, ensuring that only significant ratings are considered.

2. **Dot Product Calculation**: It then computes the dot product of each pair of row vectors (users) in the thresholded matrix. This represents the similarity between users based on their common rated items.

3. **Norm Calculation**: Next, it calculates the norms (magnitude) of each row vector, considering only values above the threshold. This step prepares for the normalization process.

4. **Normalization**: The dot products are divided by the norms of the corresponding row vectors, effectively normalizing the similarity values. This step ensures that users with a large number of ratings are not favored over users with fewer ratings.

5. **Setting Diagonal to 0**: Finally, the diagonal elements of the similarity matrix are set to 0 to avoid self-similarity, as a user's rating should not be compared to itself.

### Explanation of NumPy Functions

- **np.dot**: Computes the dot product of arrays. Here, it calculates the dot product of the thresholded user ratings matrix with its transpose, resulting in the similarity matrix.
  
- **np.where**: Returns indices where a condition is true. It's used here to apply thresholding to the user ratings matrix.
  
- **np.sum**: Computes the sum of array elements. It calculates the norms of each row vector after thresholding, which are then used for normalization.
  
- **np.abs**: Computes the absolute value element-wise. Used to ensure positive values for norms calculation.
  
- **np.fill_diagonal**: Fills the diagonal of an array with a specified value. It's used to set diagonal elements of the similarity matrix to 0 to avoid self-similarity.

In [26]:
def calculate_user_similarity_manhattan(user_ratings_matrix, threshold):
    """
    Calculate user similarity using Manhattan distance-based similarity measure.

    Parameters:
    user_ratings_matrix (numpy.ndarray): Matrix representing users' ratings for items (movies).
    threshold (float): Threshold value for considering ratings in the similarity calculation.

    Returns:
    similarity_matrix (numpy.ndarray): Matrix representing similarity between users based on the Manhattan distance.

    The Manhattan distance-based similarity measure is calculated as follows:
    1. Compute the dot product of each pair of row vectors in the user_ratings_matrix, considering only values above the threshold.
    2. Calculate the norms of each row vector, considering only values above the threshold.
    3. Replace zero norms with a small value to avoid division by zero.
    4. Calculate the similarity matrix using broadcasting, where the similarity between users i and j is given by the dot product
       divided by the product of their norms.
    5. Set diagonal elements to 0 to avoid self-similarity.

    """
    # Calculate dot product of each pair of row vectors, considering only values above the threshold
    dot_products = np.dot(np.where(user_ratings_matrix >= threshold, user_ratings_matrix, 0), user_ratings_matrix.T)
    
    # Calculate norms of each row vector, considering only values above the threshold
    norms = np.sum(np.abs(np.where(user_ratings_matrix >= threshold, user_ratings_matrix, 0)), axis=1)
    
    # Replace zero norms with a small value to avoid division by zero
    norms[norms == 0] = 1e-8
    
    # Calculate similarity matrix using broadcasting
    similarity_matrix = dot_products / (norms[:, None] * norms)
    
    # Set diagonal elements to 0 to avoid self-similarity
    np.fill_diagonal(similarity_matrix, 0)
    
    return similarity_matrix

## UserKNN classifier:
In contrast to userKNN regressor, we will now recommend items based on majority voting. Which will consist of the following:

Based on the cosine similarity, the nearest neighbours will be selected, just like in the KNN regressor. Afterwards, instead of computing the avg rating and then computing the items which have the highest avg. ratings, KNN classifier will predict whether an outcome will be positive or negative, meaining positive or negative reviews. Using **majority vote** the classifier will assess the nearest neighbours on which items they postively or negatively liked and recommend the items which have the highest amount of postive likes.

### `generate_user_knn_recommendations_classifier` Function Explanation
 insert explanation

In [27]:
def recommend_movies_classification(user_id, user_ratings_matrix_normed, user_similarity_matrix, movie_id_dict, user_id_dict, threshold=0.5, k=1, top_n=5):
    """
    Recommend movies to a user based on the KNN classifier approach.

    Parameters:
    user_id (int): ID of the target user.
    user_ratings_matrix_normed (numpy.ndarray): Matrix representing users' ratings normalized with neutral scores.
    user_similarity_matrix (numpy.ndarray): Matrix representing similarity between users.
    movie_id_dict (dict): Dictionary mapping movie IDs to unique indices in the user-item matrix.
    threshold (float): Threshold value for considering positive ratings.
    k (int): Number of neighbors to consider.
    top_n (int): Number of movies to recommend.

    Returns:
    recommended_movies (list): List of recommended movie IDs.
    """

    # Find the target user's index in the similarity matrix
    user_index = user_id_dict[user_id]

    # Find indices of k most similar users (excluding the target user)
    similar_users_indices = np.argsort(user_similarity_matrix[user_index])[::-1][1:k+1]

    # Find movies not rated by the target user
    # unrated_movies = np.where(np.isnan(user_ratings_matrix_normed[user_index]))[0]
    unrated_movies = np.where(user_ratings_matrix_normed[user_index] == 0)[0]


    # Predict whether the target user will like or dislike each unrated movie
    predicted_classes = []
    for movie_index in unrated_movies:
        neighbor_ratings = user_ratings_matrix_normed[similar_users_indices, movie_index]
        positive_votes = np.sum(neighbor_ratings >= threshold)
        negative_votes = np.sum(neighbor_ratings < threshold)
        predicted_class = 'positive' if positive_votes > negative_votes else 'negative'
        predicted_classes.append(predicted_class)

    # Count the votes for each movie class
    unique_classes, class_counts = np.unique(predicted_classes, return_counts=True)

    # Recommend movies with the majority predicted class
    recommended_movies = []
    for movie_index, predicted_class in zip(unrated_movies, predicted_classes):
        if class_counts[np.where(unique_classes == predicted_class)[0][0]] >= k // 2 + 1:  # Majority voting
            movie_id = list(movie_id_dict.keys())[list(movie_id_dict.values()).index(movie_index)]
            recommended_movies.append(movie_id)
            if len(recommended_movies) == top_n:
                break

    return recommended_movies

### See a first batch of recommendations:

By using the functions above to recommend movies above the following results are generated for each dataset:

`Netflix dataset:`

In [28]:
user_item_matrix, user_id_dict, movie_id_dict, user_ids, movieIds = create_user_item_matrix(df)
user_ratings_matrix_normed = computing_neutral_scores(user_item_matrix)
user_similarity_matrix_manhattan = calculate_user_similarity_manhattan(user_ratings_matrix_normed, threshold=0.5) # still explain why treshold on 0.5!!!!!!!
user_ratings_matrix_classified = np.where(user_ratings_matrix_normed >= 0.5, 'positive', 'negative')

In [29]:
# Example usage:
user_id = user_ids[1]  # Specify the target user ID
recommended_movies = recommend_movies_classification(user_id, user_ratings_matrix_normed, user_similarity_matrix_manhattan, movie_id_dict, user_id_dict, threshold=0.5, k=1, top_n=5)
print("Recommended Movies for User", user_id, ":", recommended_movies)

Recommended Movies for User 2012897 : [85, 649, 735, 927, 968]


`MovieLens dataset:`

In [30]:
user_item_matrix2, user_id_dict2, movie_id_dict2, user_ids2, movieIds2 = create_user_item_matrix(df2)
user_ratings_matrix_normed2 = computing_neutral_scores(user_item_matrix2)
user_similarity_matrix_manhattan2 = calculate_user_similarity_manhattan(user_ratings_matrix_normed2, threshold=0.5) # still explain why treshold on 0.5!!!!!!!

In [31]:
# Example usage:
user_id2 = user_ids2[1]  # Specify the target user ID
recommended_movies2 = recommend_movies_classification(user_id2, user_ratings_matrix_normed2, user_similarity_matrix_manhattan2, movie_id_dict2, user_id_dict2, threshold=0.5, k=1, top_n=5)
print("Recommended Movies for User", user_id2, ":", recommended_movies2)

Recommended Movies for User 144354 : [554, 2029, 3991, 5264, 6517]


Even though the movieIds which are recommended are the same, the predicted rating differs somewhat, already indicating a difference between the two userKNN models.

## Baseline performance

To assess performance, we are going to assess the comparison between the original user-item interactions and the predicted user-item interactions. In order to do so, we will generate a complete matrix with predicted raitings to compare with the original one. 