# Homework 4 - Recommendation systems and clustering everywhere

#### Group 9 <br>

<div style="float: left;">
    <table>
        <tr>
            <th>Student</th>
            <th>GitHub</th>
            <th>Matricola</th>
            <th>E-Mail</th>
        </tr>
        <tr>
            <td>André Leibrant</td>
            <td>JesterProphet</td>
            <td>2085698</td>
            <td>leibrant.2085698@studenti.uniroma1.it</td>
        </tr>
    </table>
</div>

#### Import Libraries and Modules

In [1]:
from typing import List

import pandas as pd
import random

In [2]:
pd.set_option("display.max_colwidth", None)

## 1. Recommendation sytem
Implementing a recommendation system is critical for businesses and digital platforms that want to thrive in today's competitive environment. These systems use data-driven personalization to tailor content, products, and services to individual user preferences. The latter improves user engagement, satisfaction, retention, and revenue through increased sales and cross-selling opportunities. In this section, you will attempt to implement a recommendation system by identifying similar users' preferences and recommending movies they watch to the study user.

To be more specific, you will implement your version of the [LSH algorithm](https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/), which will take as input the user's preferred genre of movies, find the most similar users to this user, and recommend the most watched movies by those who are more similar to the user.

**Data:** The data you will be working with can be found [here](https://www.kaggle.com/datasets/vodclickstream/netflix-audience-behaviour-uk-movies).

Looking at the data, you can see that there is data available for each user for the movies the user <u>clicked on</u>. Gather the **title** and **genre** of the **maximum top 10 movies** that each user clicked on regarding the **number of clicks**.

---

First we are reading the data into a Pandas DataFrame. Then we do a little bit of preprocessing:
1. Droping the `row ID` column because we don't need it
2. Converting the `datatime` column to a date
3. Converting the `release_date` column to a date
4. Extracting the `genres` column to a new column `genre_list` that includes a list with all genres

In [3]:
# Read data into a Pandas DataFrame
dataset = pd.read_csv("vodclickstream_uk_movies_03.csv")

In [4]:
# Drop Row ID column
dataset.drop(dataset.columns[0], axis=1, inplace=True)

# Convert datetime column to a date
dataset.datetime = pd.to_datetime(dataset.datetime)

# Convert release_date column to a date
dataset.release_date = pd.to_datetime(dataset.release_date, errors="coerce")

# Extract genres column to a new column genre_list that includes a list with all genres
dataset["genre_list"] = dataset.genres.apply(lambda row: [word.strip() for word in row.split(",")])

In the next step we are defining a function `top_10_movies` that takes a `user_id` and extracts the top 10 most clicked movies by the user. For this we can easily group by the columns `user_id`, `movie_id`, `title`, and `genres` and create a new aggregation column called `clicks`. At the end we sort the column `clicks` in ascending order and return only the top 10 entries including only the desired columns.

In [5]:
def top_10_movies(user_id: str) -> pd.DataFrame:
    """
    Function that takes a user id and returns the top 10 most clicked movies by the user.

    Args:
        user_id (str): User ID.

    Returns:
        top_10_movies (pd.DataFrame): Pandas DataFrame that includes the top 10 most clicked movies by the user.
    """
    
    # Filter all movies for given user
    movies = dataset[dataset["user_id"] == user_id]

    # Extract the top 10 movies regarding the number of clicks for the given user
    movies = movies.groupby(["user_id", "movie_id", "title", "genres"]).size().reset_index(name="clicks")
    top_10_movies = movies.sort_values(by="clicks", ascending=False).head(10)[["title", "genres", "clicks"]]

    # Reset the index of the dataframe
    top_10_movies.reset_index(drop=True,inplace=True)
    
    return top_10_movies

This is an example output for the user `b15926c011`.

In [6]:
# Given user id
user_id = "b15926c011"

# Retrive top 10 movies of given user regarding the clicks
top_10_movies(user_id)

Unnamed: 0,title,genres,clicks
0,Wild Child,"Comedy, Drama, Romance",23
1,Zapped,"Comedy, Family, Fantasy",19
2,Barely Lethal,"Action, Comedy",18
3,#Horror,"Crime, Drama, Horror, Mystery, Thriller",16
4,Innocence,"Fantasy, Horror, Mystery, Romance, Thriller",13
5,You Get Me,"Crime, Drama, Romance, Thriller",12
6,Minor Details,"Adventure, Family, Mystery",12
7,The 5th Wave,"Action, Adventure, Sci-Fi, Thriller",10
8,IBOY,"Action, Crime, Sci-Fi, Thriller",10
9,Molly Moon and the Incredible Book of Hypnotism,"Adventure, Family, Fantasy",10


### 1.2 Minhash Signatures
Using the movie genre and user_ids, try to implement your min-hash signatures so that users with similar interests in a genre appear in the same bucket.

**Important note:** You must write your minhash function from scratch. You are not permitted to use any already implemented hash functions. Read the class materials and, if necessary, conduct an internet search. The description of hash functions in the [book](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) may be helpful as a reference.

---

First we create a list `unique_genres` that includes the unique values of all genres across our data. We use the `capture` hint so that we suppress the output of the cell.

In [7]:
%%capture

# Retrieve a list with all unique genres
unique_genres = set()
dataset["genre_list"].apply(lambda row: [unique_genres.add(value) for value in row])

Based on this list we can create a new column `one_hot_genre_list` that includes only 0s and 1s which is at the end a binary representation of the genres of a movie where a 1 on the i-th position represents the genre in the i-th position in the `unique_genres` list.

In [8]:
# Encode the genre list (which we treat as our shingles) of every movie to a one-hot list
dataset["one_hot_genre_list"] = dataset["genre_list"].apply(
    lambda genre_list: [1 if genre in genre_list else 0 for genre in unique_genres]
)

In the next step we defined our minhash function in the following way: We decided on choosing a similarity signature based on 12 different hash functions, all defined in the function `hash_function` which takes the binary representation of the genres of a movie in `one_hot_genre_list` and returns the similarity signature. The modulo of all hash functions is based on the highest value the term before the modulo can take from which we then took the closest or seconds closest prime number (larger than the highest value the term before can take) to make our hash functions as precise as possible and avoid coellisions.

In [9]:
def hash_function(element: List) -> List:
    """
    Function that takes a list of integer numbers and returns all results based on 12 different hash functions.

    Args:
        element (List): List of integer numbers.

    Returns:
        hashes_result (List): Hash function results.
    """
    
    # Create an empty list of size 12
    hashes_result = [None]*12
    
    # Calculate every value of the values of the given list based on 12 different hash functions
    hashes_result[0] = (element + 1) % 29
    hashes_result[1] = (3*element + 1) % 83
    hashes_result[2] = (2*element + 4) % 59
    hashes_result[3] = (3*element - 1) % 83
    hashes_result[4] = (element << 1) % 59
    hashes_result[5] = (element >> 1) % 19
    hashes_result[6] = (element << 2) % 109
    hashes_result[7] = (element >> 2) % 11
    hashes_result[8] = (element << 3) % 211
    hashes_result[9] = (element >> 3) % 7
    hashes_result[10] = (element << 4) % 421
    hashes_result[11] = (element >> 4) % 5
    
    return hashes_result


def minhash(genre_list: List) -> List:
    """
    Function that takes a list of genres based on a binary representation and returns its minhash similarity
    signature.

    Args:
        genre_list (List): List of genres based on a binary representation.

    Returns:
        similarity_signature (List): Minhash Similarity Signature.
    """
    
    # Create list of size 12 and inf as the default value
    similarity_signature = [float("inf")]*12

    # Iterate through every element of the genre list
    for row_index, genre in enumerate(genre_list):
        
        # Calculate the hash values of the current genre and skip otherwise
        if genre == 1:
            
            # Retrieve hash values of current genre based on the row index
            hashes_result = hash_function(row_index)
            
            # Only update the similarity signature if a new hash value is smaller then the current one
            for i in range(0, 12):
                similarity_signature[i] = min(similarity_signature[i], hashes_result[i])

    return similarity_signature

Using `hash_function` we create a new column `minhash` transforming the binary representation in `one_hot_genre_list` to the similariy signature of every movie.

**Note:** We are decreasing from `n = len(unique_genres)` elements to 12!

In [10]:
# Create the minhash similarity signature for every movie
dataset["minhash"] = dataset["one_hot_genre_list"].apply(lambda genre_list: minhash(genre_list))

Finally we define our LSH function `lsh` which transforms every similarity signature based on the given number of buckets `b` and number of rows `r` to the final band of hashes based on the same hash function on every bucket. We decided on using the inbuild `hash` function for our LSH hashing on every bucket. The retrieved value we then divide by a very large prime number.

In [11]:
def lsh(similarity_signature: List, num_buckets: int, num_rows: int) -> List:
    """
    Function that transforms a taken similarity signature based on the given number of buckets and number of rows
    to the final band of hashes using the same hash function for every bucket.

    Args:
        similarity_signature (List): Minhash Similarity Signature.
        num_buckets(int): Number of buckets the Similarity Signature is devided in.
        num_rows(int): Number of rows every bucket has.

    Returns:
        band_hashes (List): Final band of hashes.
    """
    
    # Set seed for reproducibility
    random.seed(42)

    # Create empty list
    band_hashes = []

    # Iterate through every bucket
    for bucket_start in range(0, len(similarity_signature), num_rows):
        
        # Extract current bucket
        bucket = similarity_signature[bucket_start:bucket_start+num_rows]
        
        # Calculate band hash for current bucket
        band_hash = hash(tuple(bucket)) % 997
        
        # Append hash value to final band
        band_hashes.append(band_hash)

    return band_hashes

Using `lsh` we create a new column `lsh_bands` transforming the Minhash Similarity Signature of every movie to the final LSH band.

**Note:** We are decreasing from `n = 12` elements to 4!

In [12]:
# Given number of buckets and number of rows
num_buckets = 4
num_rows = 3

# Create LSH band for every movie
dataset["lsh_bands"] = dataset["minhash"].apply(lambda similarity_signature: lsh(similarity_signature,
                                                                                 num_buckets,
                                                                                 num_rows))

In the following output you can see the dataset with all newly created columns.

In [13]:
dataset

Unnamed: 0,datetime,duration,title,genres,release_date,movie_id,user_id,genre_list,one_hot_genre_list,minhash,lsh_bands
0,2017-01-01 01:15:09,0.0,"Angus, Thongs and Perfect Snogging","Comedy, Drama, Romance",2008-07-25,26bd5987e8,1dea19f6fe,"[Comedy, Drama, Romance]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[10, 28, 22, 26, 18, 4, 36, 2, 72, 1, 144, 0]","[633, 444, 12, 337]"
1,2017-01-01 13:56:02,0.0,The Curse of Sleeping Beauty,"Fantasy, Horror, Mystery, Thriller",2016-06-02,f26ed2675e,544dcbc510,"[Fantasy, Horror, Mystery, Thriller]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0]","[4, 10, 10, 8, 6, 1, 12, 0, 24, 0, 48, 0]","[509, 720, 23, 637]"
2,2017-01-01 15:17:47,10530.0,London Has Fallen,"Action, Thriller",2016-03-04,f77e500e7a,7cbcc791bf,"[Action, Thriller]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]","[23, 67, 48, 65, 44, 11, 88, 5, 176, 2, 352, 1]","[314, 318, 336, 587]"
3,2017-01-01 16:04:13,49.0,Vendetta,"Action, Drama",2015-06-12,c74aec7673,ebf43c36b6,"[Action, Drama]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]","[10, 28, 22, 26, 18, 4, 36, 2, 72, 1, 144, 0]","[633, 444, 12, 337]"
4,2017-01-01 19:16:37,0.0,The SpongeBob SquarePants Movie,"Animation, Action, Adventure, Comedy, Family, Fantasy",2004-11-19,a80d6fc2aa,a57c992287,"[Animation, Action, Adventure, Comedy, Family, Fantasy]","[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]","[2, 4, 6, 2, 2, 0, 4, 0, 8, 0, 16, 0]","[456, 205, 387, 463]"
...,...,...,...,...,...,...,...,...,...,...,...
671731,2019-06-30 21:37:08,851.0,Oprah Presents When They See Us Now,Talk-Show,2019-06-12,43cd23f30f,57501964fd,[Talk-Show],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]","[21, 61, 44, 59, 40, 10, 80, 5, 160, 2, 320, 1]","[528, 439, 642, 244]"
671732,2019-06-30 21:49:34,91157.0,HALO Legends,"Animation, Action, Adventure, Family, Sci-Fi",2010-02-16,febf42d55f,d4fcb079ba,"[Animation, Action, Adventure, Family, Sci-Fi]","[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0]","[2, 4, 6, 2, 2, 0, 4, 0, 8, 0, 16, 0]","[456, 205, 387, 463]"
671733,2019-06-30 22:00:44,0.0,Pacific Rim,"Action, Adventure, Sci-Fi",2013-07-12,7b15e5ada1,4a14a2cd5a,"[Action, Adventure, Sci-Fi]","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]","[2, 4, 6, 2, 2, 0, 4, 0, 8, 0, 16, 0]","[456, 205, 387, 463]"
671734,2019-06-30 22:04:23,0.0,ReMastered: The Two Killings of Sam Cooke,"Documentary, Music",2019-02-08,52d49c515a,0b8163ea4b,"[Documentary, Music]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[8, 22, 18, 20, 14, 3, 28, 1, 56, 0, 112, 0]","[912, 665, 787, 981]"


### 1.3 Locality-Sensitive Hashing (LSH)
Now that your buckets are ready, it's time to ask a few queries. We will provide you with some user_ids and ask you to recommend at **most five movies** to the user to watch based on the movies clicked by similar users.

To recommend at most five movies given a user_id, use the following procedure:
1. Identify the <u>two most similar</u> users to this user.
2. If these two users have any movies **in common**, recommend those movies based on the total number of clicks by these users.
3. If there are **no more common** movies, try to propose the most clicked movies by the **most similar user first**, followed by the other user.

**Note:** At the end of the process, we expect to see at most five movies recommended to the user.

**Example:** assume you've identified user **A** and **B** as the most similar users to a single user, and we have the following records on these users:

- User A with 80% similarity
- User B with 50% similarity

| user | movie title | #clicks |
|----------|----------|----------|
| A | Wild Child | 20 |
| A | Innocence | 10 |
| A | Coin Heist | 2 |
| B | Innocence | 30 |
| B | Coin Heist | 15 |
| B | Before I Fall | 30 |
| B | Beyond Skyline | 8 |
| B | The Amazing Spider-Man | 5 |

**Recommended movies** in order:
- Innocence
- Coin Heist
- Wild Child
- Before I Fall
- Beyond Skyline

---

We are going to use the results in our final defined column `lsh_bands` to retrieve the two most similar users to a given user. We are defining the similarity inside the function `get_two_most_similar_users` in the following way: First we extract the unique `lsh_bands` of the given user. Then we retrieve all movies that have **exaclty** the same `lsh_band`. At last we group by the `user_id` and define two new aggregated values: The total `movie_count` of each user and based on the intersection in `lsh_bands` the `similarity` score which is the division of `lsh_bands_count` and the size of `user_lsh_bands`. We are using the `movie_count` as a tie breaker if the `similarity` score is the same. At the end we only return the first two entries.

In [14]:
def get_two_most_similar_users(user_id: str) -> pd.DataFrame:
    """
    Function that takes a user and returns the two most similar users based on the movie genres.

    Args:
        user_id (str): User ID.

    Returns:
        similar_users (pd.DataFrame): Pandas DataFrame with the two most similar users.
    """
    
    # Retrieve the unique lsh bands of the given user
    user_lsh_bands = dataset[dataset["user_id"] == user_id].drop_duplicates(subset=["lsh_bands"])
    user_lsh_bands = user_lsh_bands["lsh_bands"].tolist()

    # Create empty list
    similar_movies = []
    
    # Iterate through every unique lsh band
    for user_lsh_band in user_lsh_bands:
        
        # Only save the movies which have exactly the same lsh band
        similar_movies.append(dataset[dataset["lsh_bands"].apply(lambda lsh_band: lsh_band == user_lsh_band)])

    # Transform result into a Pandas DataFrame
    similar_movies = pd.concat(similar_movies)
    
    # Group by the user_id and lsh_bands and retrieve movie_count
    similar_movies["lsh_bands"] = similar_movies["lsh_bands"].apply(tuple)
    similar_users = similar_movies.groupby(["user_id", "lsh_bands"]).size().reset_index(name="movie_count")
    
    # Group by the user_id and retrieve lsh_bands_count and movie_count for every user
    similar_users = similar_users.groupby("user_id").agg({"lsh_bands": "count", "movie_count": "sum"})
    similar_users = similar_users.rename(columns={"lsh_bands": "lsh_bands_count",
                                                  "movie_count": "movie_count"})
    
    # Calculate the similarity score for every user
    similar_users["similarity"] = (similar_users["lsh_bands_count"] / len(user_lsh_bands)).round(2)
    
    # Sort by the highest similarity score and use movie_count as tie breaker
    similar_users = similar_users.sort_values(by=["similarity", "movie_count"], ascending=[False, False])
    
    # Remove the given user from the results and return only the first two users
    similar_users = similar_users[similar_users.index != user_id].head(5)
    
    return similar_users

In the following output you can see the two most similar users given the user `b15926c011`.

In [15]:
# Given user id
user_id = "b15926c011"

# Retrieve the two most similar users given the user
two_most_similar_users = get_two_most_similar_users(user_id)
two_most_similar_users

Unnamed: 0_level_0,lsh_bands_count,movie_count,similarity
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
779343a3ea,13,484,0.81
89fbb087f3,13,278,0.81
322f2bd4d4,13,172,0.81
9eaa2fc081,13,142,0.81
e259fd87b7,12,162,0.75


In the last step we are going to retrieve all movies from the two most similar users. We are adding the similarity score from the previous result because we are going to recommend first the movies which both users clicked on and then the movies from the most similar user using the clicks as a tie breaker. In the final result we are returning the top 5 movies.

In [16]:
# Retrive all movies from the two most similar users
movies = dataset[dataset["user_id"].isin(two_most_similar_users.index.tolist())]

# Merge the movies with the previous table to add the similary score
movies = pd.merge(two_most_similar_users, movies, on="user_id")

# Group by user_id, movie_id, and similarity and calcualate new aggregated value clicks
movies = movies.groupby(["user_id", "movie_id", "similarity"]).size().reset_index(name="clicks")
movies = movies.groupby("movie_id").agg({"user_id": "count", "clicks": "sum", "similarity": "sum"})
movies = movies.rename(columns={"user_id": "common"})

# Replace the similariy score with the similarity score of the most similar user if the score is larger than that
threshold_similarity = two_most_similar_users.loc[two_most_similar_users.index[0], "similarity"]
movies.loc[movies["similarity"] > threshold_similarity,"similarity"] = threshold_similarity

# Order by the common column first to recommend the movies which both users clicked on and than the movies from
# the user with the highest similarity and clicks as the tie breaker and return only the first 5 movies
movies = movies.sort_values(by=["common", "similarity", "clicks"], ascending=[False, False, False]).head(5)

# Retrieve the title of every movie and set movie_id as the index
top_5_similar_movies = dataset[dataset["movie_id"].isin(movies.index.tolist())]
top_5_similar_movies = top_5_similar_movies.drop_duplicates(subset=["movie_id"]).loc[:, ["movie_id", "title"]]
top_5_similar_movies = top_5_similar_movies.reset_index(drop=True,inplace=False).set_index("movie_id")
top_5_similar_movies

Unnamed: 0_level_0,title
movie_id,Unnamed: 1_level_1
ed2f7aad6a,Natural Selection
fb1e870bea,6 Years
080f74d403,"Knock, Knock"
b9ef0f30ff,Rip Tide
4884acf862,Candy Jar
