## **Libraries needed to run the notebook**

In [1]:
# All the libraries needed to run the Notebook

from itertools import chain
import pandas as pd
from functions import extract_top_movies_per_user
from functions import minhash_signature
from functions import our_hash_function
from functions import lsh
from functions import jaccard_similarity
from functions import find_nearest_neighbors
from functions import recommend_movies


# **1. Recommendation system** 


The exercise asks to gather the top 10 movies clicked by each user, therefore I implemented a function named "extract_top_movies_per_user". This function takes in input the path of the dataset and returns a new dataset with the following details for each user:

- *user_id:* The unique identifier of the user
- *title:* The title of the movie
- *genres:* The genres of the movie
- *click_count:* The number of times the user clicked on the movie

The function uses pandas to read the dataset, groups the data by user_id, title, and genres, calculates the click count for each movie, and then selects the top 10 movies with the highest click counts for each user. The resulting dataset is sorted by user_id and click_count.


In [2]:
top_movies_per_user = extract_top_movies_per_user('vodclickstream_uk_movies_03.csv')

# Printing only the first 10 rows to provide an overview of the dataset
top_movies_per_user.head(10)

Unnamed: 0,user_id,title,genres,click_count
0,00004e2862,Hannibal,"Crime, Drama, Thriller",1
6,000052a0a0,Looper,"Action, Drama, Sci-Fi, Thriller",9
3,000052a0a0,Frailty,"Crime, Drama, Thriller",3
5,000052a0a0,Jumanji,"Adventure, Comedy, Family, Fantasy",3
7,000052a0a0,Resident Evil,"Action, Horror, Sci-Fi",2
1,000052a0a0,Ant-Man,"Action, Adventure, Comedy, Sci-Fi",1
2,000052a0a0,Drive Angry,"Action, Fantasy, Thriller",1
4,000052a0a0,Green Room,"Horror, Music, Thriller",1
8,000052a0a0,Resident Evil: Retribution,"Action, Horror, Sci-Fi, Thriller",1
9,000052a0a0,The Big Lebowski,"Comedy, Crime, Sport",1


## **1.2 Minhash Signatures**

To calculate the MinHash signature, we started the process by building a *shingle matrix* for each user. This matrix was built by enumerating all possible genres, assigning unique integers to them, and creating a list for each user. In this list, each position was set to 1 if the genre at that position in the list of all possible genres was present in the user's list of preferred genres; otherwise, the position was set to zero.

After building the shingle matrix, we proceeded to calculate the MinHash signature using the minhash_signature function. This function returns a list representing the MinHash signature for the given user based on their preferred genres. The input parameters for this function include:
- *user_genres:* List of preferred genres for the user
- *num_hashes:* Number of hash functions to use
- *genre_dict:* Dictionary mapping genres to unique integers
- *num_total_genres:* Total number of unique genres in the dataset

To calculate the hash function it was not used any already implemented function, as requested. We chose 100 for the number of hashes since it is generally considered a good amount of hashes for Minhashing.

In [3]:
# Dictionary that keeps track of each users' preferred genres
user_genres_dict = top_movies_per_user.groupby('user_id')['genres'].apply(list).to_dict()

# Making a list of all unique genres in the dataset and assigning a unique integer to each genre to build the shingle matrix in the minhash function for each user
all_genres = sorted(list(set(chain.from_iterable([genre.split(', ') if isinstance(genre, str) else genre for genre in top_movies_per_user['genres']]))))
genre_dict = {genre: i for i, genre in enumerate(all_genres)}
num_total_genres = len(all_genres)
    
# Using the defined function to calculate the MinHash signatures for all users
num_hashes = 100
minhash_signatures = {user_id: minhash_signature(genres, num_hashes, genre_dict, num_total_genres) for user_id, genres in user_genres_dict.items()}


## **1.3 Locality-Sensitive Hashing (LSH)**

Now that we have a dictionary containing the signatures for each user, we can utilize the Locality-Sensitive Hashing (LSH) algorithm to identify users with similar interests. In this algorithm, users with similar signatures are grouped into the same bucket. This approach helps in reducing the number of users for which we need to calculate the Jaccard similarity.

To find the two users most similar users to the input user, we calculate the Jaccard Similarity for all users in the bucket where the input user is present. We then select the two users with the highest similarity scores. This process is done through the 'find_nearest_neighbors' function, which first executes the LSH algorithm to organize users into buckets based on their MinHash signatures and then calculates the similarity scores to find the **two most similar users**. The function takes the following inputs:
- target_user: The user for which we want to find the most similar users
- minhash_signatures: The dictionary containing the MinHash signatures for each user
- bands: The number of bands used in the LSH algorithm
- rows_per_band: The number of rows per band in the LSH algorithm

We selected an adequate number of bands and rows to make sure that the similarity measure between users is not too restrictive.

In [34]:
# Target user is the user which we want to recommend movies to
target_user = '9966728c1c'
bands = 6
rows_per_band = 10
    
most_similar_users = find_nearest_neighbors(target_user, minhash_signatures, bands, rows_per_band)

print(f"The two most similar users to {target_user} are: {most_similar_users}")


The two most similar users to 9966728c1c are: ['004f54b636', '00ca88823f']


Now that we have identified the two most similar neighbors, we can proceed to recommend movies as specified by the exercise, choosing first the movies that these two users have in common, in order based on the number of clicks, and then if there are no more common movies, choosing the most clicked movies by the most similar user first, followed by the other user.
We proceeded to recommend movies using the "recommend_movies" function, which takes as input:
- *most_similar_users:* list containing the two most similar users
- *top_movies_per_user:* DataFrame containing information about the top movies per user

 And returns the list of recommended movies, in order.

In [35]:
recommended_movies = recommend_movies(most_similar_users, top_movies_per_user)

print(f"Recommended movies for user \"{target_user}\":")
for movie in recommended_movies:
    print(movie)

{'004f54b636': {'The Martian', 'Bright', "Daddy's Home", 'Ip Man 3'}, '00ca88823f': {'Ip Man 2', 'The Great Gatsby', '21 Jump Street', 'Ip Man', 'Ip Man 3'}}
Recommended movies for user "9966728c1c":
Ip Man 3
Bright
Daddy's Home
The Martian
The Great Gatsby


As we can see from the above example, if there are common movies between the two most similar users , the algorithm chooses them first based on the number of clicks, and then, if the number of recommendations is not enough, it chooses the moviest from the most similar user first and then from the second most similar user.

In [33]:
target_user = '1ec083a8bd'

most_similar_users = find_nearest_neighbors(target_user, minhash_signatures, bands, rows_per_band)
print(f"The two most similar users to {target_user} are: {most_similar_users}")

recommended_movies = recommend_movies(most_similar_users, top_movies_per_user)

print(f"Recommended movies for user \"{target_user}\":")
for movie in recommended_movies:
    print(movie)

{'058fae3544': {"Child's Play", 'Before I Wake'}, '06d8685ddb': {'IBOY', 'Cold Skin', 'Hardcore'}}
Recommended movies for user "1ec083a8bd":
Child's Play
Before I Wake
Cold Skin
Hardcore
IBOY


And as we can see from the second example, if there no common movies, the algorithm chooses the movies directly in order of number of clicks first from the most similar user and then from the second most similar user.