## **Libraries needed to run the notebook**

In [88]:
# All the libraries needed to run the Notebook

from itertools import chain
import pandas as pd


# **1. Recommendation system** 


The exercise asks to gather the top 10 movies clicked by each user, therefore I implemented a function named "extract_top_movies_per_user". This function takes in input the path of the dataset and returns a new dataset with the following details for each user:

- *user_id:* The unique identifier of the user
- *title:* The title of the movie
- *genres:* The genres of the movie
- *click_count:* The number of times the user clicked on the movie

The function uses pandas to read the dataset, groups the data by user_id, title, and genres, calculates the click count for each movie, and then selects the top 10 movies with the highest click counts for each user. The resulting dataset is sorted by user_id and click_count.


In [89]:
def extract_top_movies_per_user(file_path):
    # Reads the DataFrame from the path in input
    df = pd.read_csv(file_path).head(10000)

    # Grouping by user_id, title, and genres to count the number of clicks for each movie for each user
    user_movie_counts = df.groupby(['user_id', 'title', 'genres']).size().reset_index(name='click_count')

    # Sorting based on the number of clicks for each user
    user_movie_counts.sort_values(by=['user_id', 'click_count'], ascending=[True, False], inplace=True)

    result = pd.DataFrame()

    # Iterates over each user_id and chooses only the top 10 clicked movies
    for _, group in user_movie_counts.groupby('user_id'):
        result = pd.concat([result, group.head(10)])

    # Sorting the result based on user_id and the number of clicks is sorted by user_id and click_count in descending order
    result.sort_values(by=['user_id', 'click_count'], ascending=[True, False], inplace=True)

    return result

In [90]:
top_movies_per_user = extract_top_movies_per_user('vodclickstream_uk_movies_03.csv')

# Printing only the first 10 rows to provide an overview of the dataset
top_movies_per_user.head(10)

Unnamed: 0,user_id,title,genres,click_count
0,0005d9a8f4,Joe and Caspar Hit the Road,"Documentary, Adventure, Comedy",1
1,001991be8a,Star Trek: First Contact,"Action, Adventure, Drama, Sci-Fi, Thriller",2
2,0029f6bb1e,6 Bullets,"Action, Crime, Drama, Thriller",1
3,0029f6bb1e,A Thousand Words,"Comedy, Drama",1
4,0029f6bb1e,Adore,"Drama, Romance",1
5,0029f6bb1e,American Heist,"Action, Crime, Drama, Thriller",1
6,0029f6bb1e,Anger Management,Comedy,1
7,0029f6bb1e,Bee Movie,"Animation, Adventure, Comedy, Family",1
8,0029f6bb1e,"Big Mommas: Like Father, Like Son","Action, Comedy, Crime",1
9,0029f6bb1e,Damage,"Action, Drama",1


## **1.2 Minhash Signatures**

To calculate the MinHash signature, we started the process by building a *shingle matrix* for each user. This matrix was built by enumerating all possible genres, assigning unique integers to them, and creating a list for each user. In this list, each position was set to 1 if the genre at that position in the list of all possible genres was present in the user's list of preferred genres; otherwise, the position was set to zero.

After building the shingle matrix, we proceeded to calculate the MinHash signature using the minhash_signature function. This function returns a list representing the MinHash signature for the given user based on their preferred genres. The input parameters for this function include:
- *user_genres:* List of preferred genres for the user
- *num_hashes:* Number of hash functions to use
- *genre_dict:* Dictionary mapping genres to unique integers
- *num_total_genres:* Total number of unique genres in the dataset

To calculate the hash function it was not used any already implemented function, as requested. We chose 100 for the number of hashes since it is generally considered a good amount of hashes for Minhashing.

In [91]:
def minhash_signature(user_genres, num_hashes, genre_dict, num_total_genres):

    # Extracting user genres from the list in input (which contanins 1 string with all the genres) to create the shingle matrix
    user_genres = user_genres[0].strip().split(', ')

    # Creating the shingle matrix using the genre dictionary and the user genres
    shingle_matrix = [0]*num_total_genres
    for genre in user_genres:
        # Putting 1 only the positions of the preferred genres
        shingle_matrix[genre_dict.get(genre)] = 1

    # Now let's calculate the MinHash signature, first we initialize the MinHash signature with positive infinity values
    minhash_signature = [float('inf')] * num_hashes  
    # We Iterate over the rows of the shingle matrix and if the genre is present in the user's preferences we calculate and modify the hash function 
    for i in range(num_total_genres):
        if shingle_matrix[i] == 1:
            for j in range(num_hashes):
                # Vary the seed to obtain different hash values
                hash_value = hash_function(i + j)
                
                # Update the MinHash signature in position j by taking the minimum hash value
                minhash_signature[j] = min(minhash_signature[j], hash_value)

    return minhash_signature

# Hash function created not using already implemented hash functions, as requested
def hash_function(x):
    hash_value = 17  # Arbitrary initial value

    while x:
        digit = x % 10
        hash_value = (hash_value * 31) + digit  # Multiplication by an arbitrary prime and addition with the digit
        x //= 10  # Removing the last digit

    return hash_value

In [92]:
# Dictionary that keeps track of each users' preferred genres
user_genres_dict = top_movies_per_user.groupby('user_id')['genres'].apply(list).to_dict()

# Making a list of all unique genres in the dataset and assigning a unique integer to each genre to build the shingle matrix in the minhash function for each user
all_genres = sorted(list(set(chain.from_iterable([genre.split(', ') if isinstance(genre, str) else genre for genre in top_movies_per_user['genres']]))))
genre_dict = {genre: i for i, genre in enumerate(all_genres)}
num_total_genres = len(all_genres)
    
# Using the defined function to calculate the MinHash signatures for all users
num_hashes = 100
minhash_signatures = {user_id: minhash_signature(genres, num_hashes, genre_dict, num_total_genres) for user_id, genres in user_genres_dict.items()}


## **1.3 Locality-Sensitive Hashing (LSH)**

Now that we have a dictionary containing the signatures for each user, we can utilize the Locality-Sensitive Hashing (LSH) algorithm to identify users with similar interests. In this algorithm, users with similar signatures are grouped into the same bucket. This approach helps in reducing the number of users for which we need to calculate the Jaccard similarity.

To find the two users most similar users to the input user, we calculate the Jaccard Similarity for all users in the bucket where the input user is present. We then select the two users with the highest similarity scores. This process is done through the 'find_nearest_neighbors' function, which first executes the LSH algorithm to organize users into buckets based on their MinHash signatures and then calculates the similarity scores to find the **two most similar users**. The function takes the following inputs:
- target_user: The user for which we want to find the most similar users
- minhash_signatures: The dictionary containing the MinHash signatures for each user
- bands: The number of bands used in the LSH algorithm
- rows_per_band: The number of rows per band in the LSH algorithm

We selected an adequate number of bands and rows to make sure that the similarity measure between users is not too restrictive.

In [93]:
def lsh(minhash_signatures, num_bands, band_size):

    # Final resulting dictionary where to put all the other buckets
    buckets = {}

    for band in range(num_bands):
        band_start = band * band_size
        band_end = (band + 1) * band_size

        # Creating a single bucket where to put the users
        band_buckets = {}

        # Iterating over users and creating band specific hash buckets
        for user_id, signature in minhash_signatures.items():
            band_slice = signature[band_start:band_end]
            band_hash = hash(tuple(band_slice))

            # Checking if the hash bucket already exists, and appending it or creating it accordingly
            if band_hash in band_buckets:
                band_buckets[band_hash].append(user_id)
            else:
                band_buckets[band_hash] = [user_id]

        # Merging band specific buckets into global buckets
        for bucket, users_in_bucket in band_buckets.items():
            if bucket in buckets:
                buckets[bucket].extend(users_in_bucket)
            else:
                buckets[bucket] = users_in_bucket

    return buckets

# Simple function that uses the Jaccard similarity formula to two sets to calculate similarity score
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))

    if union > 0:
        result = intersection / union
    else:
        0

    return result

# Function to find the top 2 most similar users
def find_nearest_neighbors(target_user, minhash_signatures, bands, rows_per_band):
    
    # Apply LSH to divide users in buckets
    similar_users = lsh(minhash_signatures, bands, rows_per_band)

    # Find the bucket in which the target user is present
    target_bucket = None
    for _, users_in_bucket in similar_users.items():
        if target_user in users_in_bucket:
            target_bucket = users_in_bucket
            break

    if target_bucket is None:
        print(f"User {target_user} not found in buckets.")
        return []

    # Calculate exact Jaccard similarity for users in the same bucket
    jaccard_similarities = {}
    for user_id in target_bucket:
        if user_id != target_user:
            set1 = set(minhash_signatures[target_user])
            set2 = set(minhash_signatures[user_id])
            similarity = jaccard_similarity(set1, set2)
            jaccard_similarities[user_id] = similarity

    # And find the top 2 users with the highest similarity
    nearest_neighbors = sorted(jaccard_similarities.items(), key=lambda x: x[1], reverse=True)[:2]

    # Lastly, extract only the user IDs without similarity scores
    nearest_neighbors = [user_id for user_id, _ in nearest_neighbors]

    return nearest_neighbors

In [94]:
# Target user is the user which we want to recommend movies to
target_user = '0823fa7f47'
bands = 6
rows_per_band = 10
    
most_similar_users = find_nearest_neighbors(target_user, minhash_signatures, bands, rows_per_band)

print(f"The two most similar users to {target_user} are: {most_similar_users}")


The two most similar users to 0823fa7f47 are: ['0449970712', '04f987c36b']


Now that we have identified the two most similar neighbors, we can proceed to recommend movies as specified by the exercise, choosing first the movies that these two users have in common, in order based on the number of clicks, and then if there are no more common movies, choosing the most clicked movies by the most similar user first, followed by the other user.
We proceeded to recommend movies using the "recommend_movies" function, which takes as input:
- *most_similar_users:* list containing the two most similar users
- *top_movies_per_user:* DataFrame containing information about the top movies per user

 And returns the list of recommended movies, in order.

In [106]:
def recommend_movies(most_similar_users, top_movies_per_user):

    common_movies = set()
    user_movies_dict = {}
    recommended_movies = []

    # Finding the clicked movies for each neighbor
    for neighbor_id in most_similar_users:
        user_movies = set(top_movies_per_user[top_movies_per_user['user_id'] == neighbor_id]['title'])
        user_movies_dict[neighbor_id] = user_movies

    print(user_movies_dict)

    # Finding first the common movies among the neighbors and adding them if present, given the first condition to add movies
    common_movies = set.intersection(*user_movies_dict.values())

    if common_movies:
        common_movies_df = top_movies_per_user[top_movies_per_user['title'].isin(common_movies)]
        recommended_movies = common_movies_df.groupby('title')['click_count'].sum().sort_values(ascending=False).head(5).index.tolist()

    # If there are no more common movies and the recommended movies are less than 5, recommend the most clicked movies by the most similar neighbor first
    if len(recommended_movies) < 5:
        
        most_similar_neighbor = most_similar_users[0]
        most_similar_neighbor_movies = top_movies_per_user[top_movies_per_user['user_id'] == most_similar_neighbor].sort_values(by='click_count', ascending=False)['title'].tolist()
        # Deleting movies already present in recommended_movies
        most_similar_neighbor_movies = [movie for movie in most_similar_neighbor_movies if movie not in recommended_movies]
        remaining_recommendations = most_similar_neighbor_movies[:5 - len(recommended_movies)]
        recommended_movies += remaining_recommendations

        # If there are still less than 5 recommendations, add movies from the second neighbor based on clicks
        if len(recommended_movies) < 5:
            second_neighbor = most_similar_users[1]
            second_neighbor_movies = top_movies_per_user[top_movies_per_user['user_id'] == second_neighbor].sort_values(by='click_count', ascending=False)['title'].tolist()
            # Deleting movies already present in recommended_movies
            second_neighbor_movies = [movie for movie in second_neighbor_movies if movie not in recommended_movies]    
            remaining_recommendations = second_neighbor_movies[:5 - len(recommended_movies)]
            recommended_movies += remaining_recommendations



    return recommended_movies

In [107]:
recommended_movies = recommend_movies(most_similar_users, top_movies_per_user)

print(f"Recommended movies for user \"{target_user}\":")
for movie in recommended_movies:
    print(movie)

{'0449970712': {'Skiptrace', 'Some Kind of Hero', 'Winter in Wartime', 'The Water Diviner', 'The Man from the Future', 'Criminal', 'Return to the USS Atlanta: Defender of Guadalcanal', 'World War Z'}, '04f987c36b': {'Criminal'}}
Recommended movies for user "0823fa7f47":
Criminal
Return to the USS Atlanta: Defender of Guadalcanal
Skiptrace
Some Kind of Hero
The Man from the Future
