# Fun exercises on Spark using the Movielens Dataset

We are going to use the [MovieLens](https://grouplens.org/datasets/movielens/) dataset for these exercises. This is non trivial and should expand to about 1GB on you hard-drive.

Download and unzip [MovieLens 25M Dataset](https://grouplens.org/datasets/movielens/25m/) for this analysis.

Either ensure the data is in ```"./data/ml-25m"``` folder or update the path to the data below.

**Citation**:  
*F. Maxwell Harper and Joseph A. Konstan.* 2015.  
The MovieLens Datasets: History and Context.  
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>  

You got this.  


In [1]:
# Step 1: initialize findspark
import findspark
findspark.init()

In [2]:
# Step 2: import pyspark
import pyspark
from pyspark.sql import SparkSession
pyspark.__version__

'3.3.0'

In [3]:
# Step 3: Create a spark session

# 'local[1]' indicates spark on 1 core on the local machine, specify the number of cores needed
# use .config("spark.some.config.option", "some-value") for additional configuration

spark = SparkSession \
    .builder \
    .master('local[1]') \
    .appName("Analyzing Movielens Data") \
    .getOrCreate()

# spark

# ...to read and load the data *correctly*

This is typically the first problem you need to work out. You'll see.  
  
If you've downloaded and unzipped the data, you'll see that some of the files are quite large (genome-scores.csv is 400+ Mb, ratings.csv is 600+ Mb).  

So before we start loading the data to explore further, let's go through the [readme](https://files.grouplens.org/datasets/movielens/ml-25m-README.html) file to build a strategy for loading and analyzing data without clogging up the system.  

In real life, either you'll have to load files in small chunks to work out a strategy or you'll have to rely on defined schema for data.  

Here's the list of files (as of Aug 2022) that you get when you unzip the dataset:
1. movies.csv - list of movies with at least one rating.  
1. links.csv - IDs to generate links to the movie listing on imdb.com and themoviedb.org  
1. ratings.csv - Each line of this file after the header row represents one rating of one movie by one user.  
    Header: ```userId,movieId,rating,timestamp```  
1. tags.csv - Each line of this file after the header row represents one tag applied to one movie by one user.  
    Header: ```userId,movieId,tag,timestamp```  
1. Tag Genome: The tag genome contains tag relevance scores for movies. See [this](http://files.grouplens.org/papers/tag_genome.pdf)  
	1. genome-tags.csv - A list of tags  
	1. genome-scores.csv - Each movie in the genome has a relevance score value for every tag in the genome  
1. README.txt - Check out the README.txt for more details about the files.  

## formatting and encoding

From the Readme file, we have the following observations about the data:
1. Each file is a CSV with a single header row
1. Separator char is ```,```
1. Escape char is ```"```
1. Encoding is UTF-8

Let's set these options when reading the CSV files.

In [4]:
from pyspark.sql.types import *
# where possible, let's avoid inferSchema
# 
schema_movies = StructType([
    StructField('movieId', StringType(), False),
    StructField('title', StringType(), False),
    StructField('genres', StringType(), True)    
    ])
# 
schema_ratings = StructType([
    StructField('userId', StringType(), False),
    StructField('movieId', StringType(), False),
    StructField('rating', IntegerType(), True),
    StructField('timestamp', StringType(), True)
    ])
# 
schema_tags = StructType([
    StructField('userId', StringType(), False),
    StructField('movieId', StringType(), False),
    StructField('tag', StringType(), True),
    StructField('timestamp', StringType(), True)
    ])
# 
schema_genome_tags = ''
schema_genome_scores = ''

In [5]:
datalocation = "./data/ml-25m/"
movies_file = datalocation + 'movies.csv'
ratings_file = datalocation + 'ratings.csv'
tags_file = datalocation + 'tags.csv'
genome_tags_file = datalocation + 'genome-tags.csv'
genome_scores_file = datalocation + 'genome-scores.csv'

In [6]:
movies_raw = spark.read.format('csv') \
    .option('encoding', 'UTF-8') \
    .option('header', True) \
    .option('sep', ',') \
    .option('escape','\"') \
    .schema(schema_movies) \
    .load(movies_file)

In [7]:
movies_raw.show(10,False)

+-------+----------------------------------+-------------------------------------------+
|movieId|title                             |genres                                     |
+-------+----------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                  |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                    |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)           |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)          |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)|Comedy                                     |
|6      |Heat (1995)                       |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                    |Comedy|Romance                             |
|8      |Tom and Huck (1995)               |Adventure|Children                         |
|9      |Sudden Death

In [8]:
ratings_raw = spark.read.format('csv') \
    .option('encoding', 'UTF-8') \
    .option('header', True) \
    .option('sep', ',') \
    .option('escape','\"') \
    .schema(schema_ratings) \
    .load(ratings_file)

In [9]:
ratings_raw.show(10, False)

+------+-------+------+----------+
|userId|movieId|rating|timestamp |
+------+-------+------+----------+
|1     |296    |null  |1147880044|
|1     |306    |null  |1147868817|
|1     |307    |null  |1147868828|
|1     |665    |null  |1147878820|
|1     |899    |null  |1147868510|
|1     |1088   |null  |1147868495|
|1     |1175   |null  |1147868826|
|1     |1217   |null  |1147878326|
|1     |1237   |null  |1147868839|
|1     |1250   |null  |1147868414|
+------+-------+------+----------+
only showing top 10 rows



In [10]:
tags_raw = spark.read.format('csv') \
    .option('encoding', 'UTF-8') \
    .option('header', True) \
    .option('sep', ',') \
    .option('escape','\"') \
    .schema(schema_tags) \
    .load(tags_file)

In [11]:
tags_raw.show(10, False)

+------+-------+-----------------------+----------+
|userId|movieId|tag                    |timestamp |
+------+-------+-----------------------+----------+
|3     |260    |classic                |1439472355|
|3     |260    |sci-fi                 |1439472256|
|4     |1732   |dark comedy            |1573943598|
|4     |1732   |great dialogue         |1573943604|
|4     |7569   |so bad it's good       |1573943455|
|4     |44665  |unreliable narrators   |1573943619|
|4     |115569 |tense                  |1573943077|
|4     |115713 |artificial intelligence|1573942979|
|4     |115713 |philosophical          |1573943033|
|4     |115713 |tense                  |1573943042|
+------+-------+-----------------------+----------+
only showing top 10 rows



Let's load the other files later.

Here's the exercises we'll do with this data next

# Problem Set 1  - ```tags.csv```

1. List all unique tags found in ```tags.csv```  
    * Also print the execution plan  
    * *[think]* If there are multiple ways of doing this, compare the execution plans  
  
1. Remove wrapping quotation marks from tags
    * so ```"A Christmas Carol"``` becomes ```A Christmas Carol```
    * sort all tags lexically  

1. Which movies have the most number of tags? 
    * List movieIds in order of # of tags associated  

1. Which users have added the most number of tags?
    * List userIds in order of # of tags created  

1. Which users have tagged the most number of movies?

1. We want to find out if there were days of higher activity during the tagging exercise or if the tagging output was more-or-less consistent. 
    * Convert time-stamps to Day-Month-Year. 
    * Find the date range (min-date, max-date) during which the tagging activity took place.
    * Plot number of movies tagged per day during the date range

1. We want to find out how many users were active every day of the tagging activity. 
    * Plot number of users who tagged at least one movie during the tagging activity date range

In [12]:
# unique tags found in tags.csv - method 1, using distinct()
distinct_tags = tags_raw.select('tag').distinct()
# let's do the explaining later, so it's easy to compare methods
# distinct_tags.explain(True)
# show 5 rows, do not truncate
distinct_tags.show(5, False)

+-----------+
|tag        |
+-----------+
|anime      |
|art        |
|traveling  |
|Banksy     |
|smart drugs|
+-----------+
only showing top 5 rows



In [13]:
# unique tags found in tags.csv - method 2, using groupBy()
# an aggregator like count() results in a dataframe
distinct_tags2 = tags_raw.groupBy('tag').count()
# let's do the explaining later, so it's easy to compare methods
# distinct_tags2.explain(True)
# show 5 rows, no truncate
distinct_tags2.show(5,False)

+-----------+-----+
|tag        |count|
+-----------+-----+
|anime      |1531 |
|art        |305  |
|traveling  |14   |
|Banksy     |19   |
|smart drugs|17   |
+-----------+-----+
only showing top 5 rows



In [14]:
# ignore case when sorting values
from pyspark.sql.functions import col, lower


# Problem Set 2  - ```movies.csv```

1. Extract the year of release in movies.csv into a new column year_of_release  

1. Prepare a yearwise list of movies - list all the movies released in 1995, then 1996 and so on...   

1. List all unique genres found in ```movies.csv```, ordered lexically, case insensitive  

1. Prepare a genere wise list of movies - list all the movies for 'Crime', for 'Romance', and so on...

1. Add another column num_genres and list total number of genres associated with each film  

1. Find number of films associated with each genre - absolute_frequency_of_genre  

1. Find out if a movie has both genres associated with it and also has ```(no genres listed)``` - if this is the case, find out how many such movies exist in the data set

*[think]*: Is there a 'variety' metric? sum of absolute frequencies divided by total absolute frequency?


In [15]:
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

In [16]:
# unique genres found in movies.csv
movie_genres = movies_raw.select(
    explode( # convert each element in an array to a new row
        split( # split the data on pipe and create an arry
            movies_raw.genres, "\|"
        )
    ).alias('genre')
)

In [17]:
genre_freq = movie_genres.groupBy('genre').count().alias('freq')
genre_freq.show()
# TODO: rename the column 'count' to 'freq'

+------------------+-----+
|             genre|count|
+------------------+-----+
|             Crime| 5319|
|           Romance| 7719|
|          Thriller| 8654|
|         Adventure| 4145|
|             Drama|25606|
|               War| 1874|
|       Documentary| 5605|
|           Fantasy| 2731|
|           Mystery| 2925|
|           Musical| 1054|
|         Animation| 2929|
|         Film-Noir|  353|
|(no genres listed)| 5062|
|              IMAX|  195|
|            Horror| 5989|
|           Western| 1399|
|            Comedy|16870|
|          Children| 2935|
|            Action| 7348|
|            Sci-Fi| 3595|
+------------------+-----+



In [19]:
# spark.stop()

# Problem Set 3  - ```ratings.csv```

1. Find number of films for each rating, so number of films that have at least one rating of 1, number of films that have at least one rating of 2 and so on...  

1. List user-IDs in order of number of films they have rated, descending.  

1. Are there users who have given multiple ratings to the same film?  

# Problem Set 4  - mixing things up, ```movies + ratings```

1. Prepare a list of highly rated movies, present this list by year of release and sorted in alphabetical order by movie title.  
    * "Highly Rated" = movies with atleast 3 instances where users have rated the film a 4 or a 5
    * Expected Columns in the output: ```year of release, movie title, # of 4s, # of 5s```  
    
1. Another approach to 'highly rated', prepare a list of 'highly rated' movies
    * "Highly Rated" = sum of 4 and 5 ratings is the highest across all years
    * Sort this list by year of release

1. Which genres have recieved the highest number of ratings?

1. *[think]* Can we find "Late Bloomers" or "Cult Films"? 
    * Films that were not highly rated during the year of release or were not well rated initially,but their ratings improve over time. 
    * How can we rank these in descending order of "Cult Status"?


# Problem Set 5  
Bonus!
1. Cross-check tags from the tag genome, insert tag_genome_id and relevance score, save file as tags_with_relevance.csv  

1. Just fun on string operations: Prepare a list of movies that have atleast two vowels except 'e' - sort the list by month and year of video release.  
   * Expected Columns in the output: ```year of video release, month of video release, movie title, # of vowels that are not e```  