## CSC8101 - Practical 07 Feb 2023

#### Exercise 2 - Movies Dataset - Take home

Two input [datasets](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset):

- `ratings` dataset: Movie ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.
- `movies` dataset: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset.

Each of these datasets is read into a DataFrame below.

##### Task 1

1. How many partitions has each dataset?
2. How big is each dataset? (Report number of rows)
3. Repartition the `ratings` dataset by key `movieID` across `100` partitions.
4. Verify that the `ratings` dataset now has `100` partitions.

Docs:
- [Repartition](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.repartition.html)

##### Task 2

Write a data pipeline function that takes as input the two datasets above and outputs the `N` most popular films (by rating), for a given `genre`, for a given `decade` (specified by its start year, e.g. `1980` for 80s and `2000` for 2000s).

> Example function run: `pipeline(N = 10, genre = "comedy", decade = 2010)`

Run your function for the following parameter inputs and report your findings. Set `N = 10` throughout:

- `genre = "Thriller"`, `decade = 1980`
- `genre = "Drama"`, `decade = 2000`
- `genre = "Comedy"`, `decade = 2010`

Helpful docs:

- [DataFrame quickstart](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html?highlight=select)
- [withColumn](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumn)
- [select](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.select.html#pyspark.sql.DataFrame.select)
- [orderBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.orderBy.html?highlight=orderby#pyspark.sql.DataFrame.orderBy)
- [join](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html?highlight=join#pyspark.sql.DataFrame.join)
- [filter](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html?highlight=filter#pyspark.sql.DataFrame.filter)

In [0]:
import pyspark.sql.functions as FN
import pyspark.sql.types as TP

# Task 1

# File location and type
ratings_file_location = "/FileStore/tables/movies/ratings.csv"
movies_file_location = "/FileStore/tables/movies/movies_metadata.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","



# The applied options are for CSV files. For other file types, these will be ignored.
ratings = spark.read.format(file_type) \
          .option("inferSchema", infer_schema) \
          .option("header", first_row_is_header) \
          .option("sep", delimiter) \
          .load(ratings_file_location)

movies_metadata = spark.read.format(file_type) \
                  .option("inferSchema", infer_schema) \
                  .option("header", first_row_is_header) \
                  .option("sep", delimiter) \
                  .load(movies_file_location)

In [0]:
display(ratings.take(5))

In [0]:
display(movies_metadata.printSchema())

In [0]:
movies_metadata[['genres']].take(1)

In [0]:
# Initial pre-processing

## Select relevant columns
movies = movies_metadata[['id','original_title','genres','release_date']]

## sample data
display(movies.take(5))

In [0]:
# Notice that genres is a string but that's not very useful - so we need to make it a structure that can be 
# In this case that is an array of dicts with key set ('id', 'name')


# schema for 'genres' column in movies metadata dataset
genres_schema = TP.ArrayType(
    TP.StructType([
        TP.StructField("id", TP.IntegerType()),
        TP.StructField("name", TP.StringType())
    ])
)

# Now we overwrite columns 'genres' to parse the string into a data structure that we can manipulate
movies = movies.withColumn("genres", FN.from_json(movies.genres, genres_schema))


In [0]:
display(movies.printSchema())

## Task 1 - Partition `ratings` dataset

1. How many partitions has each dataset?
2. How big is each dataset? (Report number of rows)
3. Repartition the `ratings` dataset by key `movieID` across `100` partitions.
4. Verify that the `ratings` dataset now has `100` partitions.

In [0]:
# write your solution here

In [0]:
# 1
print("# 1")
print(f"Number of partitions (ratings): {ratings.rdd.getNumPartitions()}")
print(f"Number of partitions (movies): {movies.rdd.getNumPartitions()}")

# 2
print("# 2")
print("Number of rows (ratings): {t:,}".format(t = ratings.count()))
print("Number of rows (movies): {t:,}".format(t = movies.count()))

# 3
print("# 3")
ratings_100part = ratings.repartition(100, "movieID")
print(f"Number of partitions (ratings): {ratings_100part.rdd.getNumPartitions()}")
print(f"Number of partitions (movies): {movies.rdd.getNumPartitions()}")

## Task 2 - Pipeline

Write a data pipeline function that takes as input the two datasets above and outputs the `N` most popular films (by rating), for a given `genre`, for a given `decade` (specified by its start year, e.g. `1980` for 80s and `2000` for 2000s).

In [0]:
def movies_pipeline(movies_df, ratings_df, N = 10, genre = "Comedy", decade = 1980):
    # write your solution here
    # develop your solution in separate cells before implementing this function    
    pass

#### 2a. Begin by further pre-processing the movies dataframe

Save the output of this sequence of operations into a new variable.

- Extract the name of each genre in column `genres`
- Convert date string to datetime structure in column `release_date`
- Remove movies with null date values
- Create new column `year` from `release_date`
- Drop the `release_date` column

Docs:

- [Operations on columns - Pyspark functions](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html) (used within `withColumn` or `select`)
- [Operations on DataFrames](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)
- [withColumn](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumn)
- [filter](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html?highlight=filter#pyspark.sql.DataFrame.filter)
- [drop](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.drop.html#pyspark.sql.DataFrame.drop)

In [0]:
processed = (
    movies.withColumn('genres', FN.transform(movies.genres, lambda x: x['name'])) # extract genre name
      .withColumn('release_date', FN.to_timestamp(movies.release_date, "yyyy-MM-dd")) # date string to datetime structure
      .filter(FN.col('release_date').isNotNull()) # remove movies with null dates
      .withColumn('year', FN.year(FN.col('release_date'))) # extract year from date
      .drop('release_date')
)

processed.take(5)

#### 2b. Filter the processed movies dataset based on parameters `genre` and `decade`

Save the output of this sequence of operations into a new variable.

Docs:

- [Operations on columns - Pyspark functions](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html) (used within `withColumn` or `select`)
- [filter](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html?highlight=filter#pyspark.sql.DataFrame.filter)

Hints:

- Each film can have multiple genres. What is the pyspark function that allows you to find whether an array contains an element?
- How can we calculate that a given year is part of a decade? There's a simple mathematical formula..

In [0]:
genre = "Thriller"
decade = 1980

subset = processed.filter(FN.array_contains(FN.col('genres'), genre))\
                  .filter(
                     (FN.col('year') - decade >= 0) & 
                     (FN.col('year') - decade < 10))

subset.take(5)

#### 2c. Calculate the average rating of each film in the ratings dataset

Save the output of this opepration into a new variable.

Docs:

- [groupBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.groupBy.html#pyspark.sql.DataFrame.groupBy)
- [agg](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.agg.html#pyspark.sql.DataFrame.agg)

In [0]:
avg_ratings = ratings.groupBy('movieID').agg(FN.avg(ratings.rating).alias('avg_rating'))

#### 2d. Join the result of 2b with 2c, order by avg rating (desceding order) and select top N

Docs:

- [join](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html?highlight=join#pyspark.sql.DataFrame.join)
- [orderBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.orderBy.html?highlight=orderby#pyspark.sql.DataFrame.orderBy)
- [select](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.select.html#pyspark.sql.DataFrame.select)

In [0]:
N = 10

result = subset.join(avg_ratings, subset.id == avg_ratings.movieID, how = "inner")\
               .orderBy(FN.col('avg_rating').desc())\
               .select(['original_title', 'year', 'avg_rating'])\
               .take(N)

display(result)

#### 2e. Put everything together in a function

In [0]:
def movies_pipeline(movies_df, ratings_df, N = 10, genre = "Comedy", decade = 1980):
    # write your solution here
    # develop your solution in separate cells before implementing this function
    
    # further pre-processing of movies dataframe
    movies_processed = (
      movies_df.withColumn('genres', FN.transform(movies.genres, lambda x: x['name'])) # extract genre name
               .withColumn('release_date', FN.to_timestamp(movies.release_date, "yyyy-MM-dd")) # date string to datetime structure
               .filter(FN.col('release_date').isNotNull()) # remove movies with null dates
               .withColumn('year', FN.year(FN.col('release_date'))) # extract year from date
               .drop('release_date')
    )
    
    # filter movies based on parameters genre and decade
    movie_subset = movies_processed.filter(FN.array_contains(FN.col('genres'), genre))\
                                   .filter(
                                        (FN.col('year') - decade >= 0) & 
                                        (FN.col('year') - decade < 10))
    
    # ratings dataset: calculate avg ratings for each movie
    total_ratings = ratings_df.groupBy('movieID').agg(FN.avg(ratings_df.rating).alias('avg_rating'))
    
    # join with ratings, calculate total ratings and select top N
    out = movie_subset.join(total_ratings, movie_subset.id == total_ratings.movieID, how = "inner")\
                      .orderBy(FN.col('avg_rating').desc())\
                      .select(['original_title', 'year', 'avg_rating'])\
                      .take(N)
    
    return out

In [0]:
display(
    movies_pipeline(movies, ratings, genre = "Thriller", decade = 1980, N = 10)
)

original_title,year,avg_rating
Scarface,1983,4.0893090376222485
Lethal Weapon 2,1989,4.06918449197861
The Falcon and the Snowman,1985,4.013636363636364
48 Hrs.,1982,3.884030583809391
Apartment Zero,1988,3.857142857142857
Garde à vue,1981,3.8126293995859215
Star Trek II: The Wrath of Khan,1982,3.795300982800983
Die Hard,1988,3.763978263978264
Black Rain,1989,3.694737696930323
Messenger of Death,1988,3.681395348837209
