#Exercise #3 - Movies Exercises
Because you loved doing movies-related exeercises with MapReduce and Spark, we will continue the tradition, this time with SparkSQL =D

## Task 1 - Find the movies with the lowest average rating
In this task, you'll have answer the question: **"What are the 10 worst movies in the MovieLens dataset?"**

We're going to define the "worst movies" as the ones with **lowest average rating** and with **at least 10 users that rated it**.

Use the `FileStore/tables/ratings.csv` file to complete this task.

In [3]:
# Write your code here
from pyspark.sql import functions as F

df_ratings = spark.read.option("header", "true").option("inferSchema", "true").csv('FileStore/tables/ratings.csv')
df_movies_with_avg_rating = df_ratings\
                              .groupby("movieId")\
                              .agg(F.avg("rating").alias('avg_rating'), F.count("userId").alias('users_count'))\
                              .filter('users_count >= 10')

df_movies_with_avg_rating.sort("avg_rating", ascending=True).show(10)

## Task 2 - Find the movie names with the best average rating
This task is similar to task #2, with two main differences:
1. This time we want the top 10 movies (with the highest ratings and at least 10 user rates)
2. We want the name of the movies and it's average rating, not it's ID. Hint: You'll have to join the result with the `movies` dataset in: `/FileStore/tables/ratings.csv`.

In [5]:
# Write your code here
from pyspark.sql import functions as F

df_movies = spark.read.option("header", "true").option("inferSchema", "true").csv('FileStore/tables/movies.csv')
df_movienames_with_avg_ratings = df_movies_with_avg_rating\
                                    .join(df_movies, 'movieId')
df_movienames_with_avg_ratings\
    .sort("avg_rating", ascending=False)\
    .select("title", "avg_rating")\
    .show(10)

## Task #3 - Movies Genre
In this task we are also going to use both the `ratings.csv` and `movies.csv` dataset to answer the following questions:

1. What are all the different genres of the movies?
2. How many movies are in each genre? Show it in a 'Pie' graph in databricks.
3. What's the average rating given for each genre? 

#### Hints
1. You can convert DFs to RDDs and the opposite to make the job easier.
2. After reading the datasets into the DF, convert it into an RDD to extract the genres with the help of a map function, convert the result back into a DF with the genre column as an Array.
3. Don't forget about the `explode` methos in SparkSQL you might find it usefull.
4. You can create an DF from an RDD with the help of the method `rdd.toDF()` and the object `Row` from `pyspark.sql`.

In [7]:
# Write your code here
from pyspark.sql import functions as F
from pyspark.sql import Row

df_movies = spark.read.option("header", "true").option("inferSchema", "true").csv('FileStore/tables/movies.csv')
df_ratings = spark.read.option("header", "true").option("inferSchema", "true").csv('FileStore/tables/ratings.csv')

# 1 - What are all the different genres of the movies?
df_movies_with_genres_as_array = df_movies\
                                        .rdd\
                                        .map(lambda r: Row(movieId=r['movieId'], title=r['title'], genres=r['genres'].split('|')))\
                                        .toDF()
df_movies_with_genres_exploded = df_movies_with_genres_as_array\
                                                        .select("movieId", "title", F.explode("genres").alias("genre"))

df_movies_with_genres_exploded\
                            .select("genre")\
                            .distinct()\
                            .show()

In [8]:
# 2 - How many movies are in each genre?
df_genres_count = df_movies_with_genres_exploded\
                      .groupby('genre')\
                      .agg(F.count('*').alias('movies count'))
display(df_genres_count)
# Create a 'Pie' visualization

In [9]:
# 3 - What's the average rating given for each genre?
# We are first going to calculate the average rating for each movie, then explode the movies' genres and calculate the average again. There are other ways to resolve this.

# DF from task #2 and some logic from this task.
df_movienames_with_avg_ratings\
  .rdd\
  .map(lambda r: Row(title=r['title'], avg_rating=r['avg_rating'], genres=r['genres'].split('|')))\
  .toDF()\
  .select("title", "avg_rating", F.explode("genres").alias("genre"))\
  .groupby('genre')\
  .agg(F.avg("avg_rating").alias("avg_genre_rating"))\
  .show()

## Task 4 - Years Analysis
In this task we are also going to use both the `ratings.csv` and `movies.csv` dataset to answer the following questions:

1. How many movies where there in each year? Sort them in descending order and display a 'Bar' graph with databricks
2. Which year has the best average rating? Only count years with more than 10 ratings

In [11]:
# Write your code here
from pyspark.sql import functions as F
from pyspark.sql import Row

df_movies = spark.read.option("header", "true").option("inferSchema", "true").csv('FileStore/tables/movies.csv')
df_ratings = spark.read.option("header", "true").option("inferSchema", "true").csv('FileStore/tables/ratings.csv')

# 1 - How many movies where there in each year? 
df_years_extracted = df_movies.withColumn('year', F.regexp_extract('title','\((\d+)\)',1))
df_years_by_count = df_years_extracted\
                                    .groupby('year')\
                                    .count()\
                                    .sort('count', ascending=False)

#df_years_by_count.show()
display(df_years_by_count)
# Display as a 'Bar' graph

In [12]:
#2 Which year has the best average rating?
df_years_extracted\
              .join(df_ratings, 'movieId')\
              .groupby('year')\
              .agg(F.avg('rating').alias('avg_rating'), F.count('*').alias('count'))\
              .filter('count >= 10')\
              .sort('avg_rating', ascending=False)\
              .show()