# 103089 - Data mining

<center><img src="media/M-UdL2.png"  width="300" alt="Universitat de Lleida"></center>

## Activity 2: Cosine similarity for movie comparison 

In this exercise you have to implement in a python notebook using the spark framework:

1. The distributed (map/reduce) algorithm of slide "3.7" (in notebook "8-Item-to-Items-globalfiltering-recommenders-py3-sshow.ipynb") for computing the cosine similarity of a set of products with negative and positive ratings, using as input information an RDD (or spark dataframe that is also distributed) with ratings with this format:

- (userID,movieID,rating)

2. The computation of the Cosine Similarity (with the previous algorithm) of all the pairs of movies from the different files you have with this exercise:

- filtered50movies.csv

- filtered100movies.csv

- filtered150movies.csv

- filtered200movies.csv

Each file contains ratings for a different set of movies, but the ones in a smaller file are always a subset of a file with bigger size. We provide files with different size in case you have some memory issues in your computer, so use the biggest file you are able to use, although during "testing" of your code you can of course use the smallest file, or even any smaller subset of the file filtered50movies.csv.

3. Show on the screen the information for the "top 10" most similar pairs, but using the name of the movies you can find in the  movies file.

### Cell 1: Initialization of PySpark

This cell initializes PySpark, a Python library for Spark, a big data processing framework. It sets up the environment, specifies the Python version to use, and creates a SparkContext (sc).

In [None]:
import pyspark.sql
import math

SPARK_ENDPOINT = "local[*]"
sparkSession = pyspark.sql.SparkSession.builder.master(SPARK_ENDPOINT).getOrCreate()
sparkContext = sparkSession.sparkContext
sparkSession

### Cell 2: Loading Data (movies.csv)
We load a dataset named movies.csv, process the data to change the data types of certain columns, convert it into an RDD for distributed processing, and then print the DataFrame

In [None]:
moviesDataFrame = sparkSession.read.csv("dataset/movies.csv", header = True)
moviesDataFrame.show()
moviesRdd = moviesDataFrame.rdd.map(lambda x: (int(x[0]), str(x[1]), str(x[2])))

### Cell 3: Loading Data (filtered50movies.csv)
We load a dataset named filtered50movies.csv, process the data to change the data types of certain columns, convert it into an RDD for distributed processing, and then print both the schema of the DataFrame and the content of the RDD to the console

In [None]:
filteredMoviesDataFrame = sparkSession.read.csv("dataset/filtered50movies.csv", header = True)
filteredRdd = filteredMoviesDataFrame.rdd.map(lambda x: (int(x[0]), int(x[1]), float(x[2])))
filteredMoviesDataFrame.show()

### Cell 4: Calculation of the cosine distance for all pairs
We perform the Cartesian product and process the pairs based on certain conditions, and then we create a DataFrame from the processed RDD

In [None]:
cartesian_rdd = filteredRdd.cartesian(filteredRdd)

def process_pair(pair):
    (user1, product1, rating1), (user2, product2, rating2) = pair
    return ((user1, product1, rating1), (user2, product2, rating2))

def filter_pair(pair):
    (user1, product1, _), (user2, product2, _) = pair
    return user1 == user2 and product1 < product2

processed_cartesian_rdd = cartesian_rdd.map(process_pair).filter(filter_pair)

cartesian_dataframe = sparkSession.createDataFrame(processed_cartesian_rdd, ["(u, p1, r1)", "(u, p2, r2)"])

cartesian_dataframe.show()


### Cell 5: Mapping user-product ratings
We map every pair of user-product ratings with the same user (u) to the values they contribute to in the final cosine distance between p1 and p2

In [None]:
userProductRatingsRdd = processed_cartesian_rdd.map(lambda x: (
    (x[0][1], x[1][1]), 
    (x[0][2] * x[1][2], pow(x[0][2], 2), pow(x[1][2], 2))
))
userProductRatingsDataFrame = sparkSession.createDataFrame(userProductRatingsRdd, ["(p1, p2)", "(r1 * r2, r1^2, r2^2)"])
userProductRatingsDataFrame.show()

### Cell 6: Reduce key-value pairs
Then we reduce all the previous key-value pairs with the same key

In [None]:
step2Rdd = userProductRatingsRdd.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
step2DataFrame = sparkSession.createDataFrame(step2Rdd, ["(p1, p2)", "(pra1,2 + prb1,2, ra1^2 + rb1^2, ra2^2 + rb2^2)"])
step2DataFrame.show()

### Cell 7: Compute cosine distance
We end up calculating the cosine similarity between pairs of movies based on their ratings data

In [None]:
def calculate_cosine_distance(data):
    movie1_id, movie2_id = data[0]
    dot_product, sum_squares1, sum_squares2 = data[1]
    cosine_distance = dot_product / (math.sqrt(sum_squares1) * math.sqrt(sum_squares2))
    return (movie1_id, movie2_id, cosine_distance)

cosineRdd = step2Rdd.map(calculate_cosine_distance)
cosineDataFrame = sparkSession.createDataFrame(cosineRdd, ["Movie1Id", "Movie2Id", "CosineDistance"])
cosineDataFrame.show()

### Cell 8: Print results 

We sort the pairs of movies by their cosine distance and take the first 10 with the highest cosine distance

In [None]:
movies1_df = sparkSession.createDataFrame(moviesRdd, ["MovieId1", "Title1", "Genres1"])
movies2_df = sparkSession.createDataFrame(moviesRdd, ["MovieId2", "Title2", "Genres2"])

joined_df = (cosineDataFrame
    .join(movies1_df, cosineDataFrame.Movie1Id == movies1_df.MovieId1)
    .join(movies2_df, cosineDataFrame.Movie2Id == movies2_df.MovieId2))

cosine_with_titles_df = (joined_df
    .select(cosineDataFrame.Movie1Id, movies1_df.Title1, cosineDataFrame.Movie2Id, movies2_df.Title2, cosineDataFrame.CosineDistance)
    .withColumnRenamed("Title1", "MovieTitle1")
    .withColumnRenamed("Title2", "MovieTitle2")
    .orderBy(cosineDataFrame.CosineDistance.desc()))

cosine_with_titles_df.show(10)
