# Exercise #4 - Netflix Recommendations
As you remember, we implemented several MapReduce algorithms on the `MovieLens` dataset.
You will need to solve the same issues using Spark.

## Task 1 - Ranking Breakdown
Develop a Spark application to know how many movies got a rating of 5, 4, 3, 2 and 1.

Use as input the file `FileStore/tables/ratings.csv`.

Remember: 
1. If a rating has a decimal, round it up. For example: 4.5 will be rounded to 5.0.
2. Remove the word 'rating' from the loaded data.

In [3]:
# Write your code here
import math
from operator import add

def parse_line(line):
  (userID, movieID, rating, timestamp) = line.split(',')
  if rating != 'rating':
    return [(math.ceil(float(rating)), 1)]
  else:
    return []

sc.textFile('FileStore/tables/ratings.csv')\
  .flatMap(parse_line)\
  .reduceByKey(add)\
  .takeOrdered(5)

## Task 2 - Movies By Popularity
1. Write a Spark job that ranks movies (movie ID) by their popularity.
2. Print the top 20 movies (movieID, average rating).
3. Show only movies with more than 10 rankings.

In [5]:
# Write your code here
import math
from operator import add

def parse_line(line):
  (userID, movieID, rating, timestamp) = line.split(',')
  if rating != 'rating':
    return [(movieID, [math.ceil(float(rating))])]
  else:
    return []

sc.textFile('FileStore/tables/ratings.csv')\
  .flatMap(parse_line)\
  .reduceByKey(add)\
  .mapValues(lambda ratings: sum(ratings)/len(ratings) if len(ratings)> 10 else 0)\
  .takeOrdered(20, key=lambda x: -x[1])

## Task 3 - Netflix Recommendations
This task is similar to Exercise #3 in MapReduce.

For each movie, find the top 10 movies with highest similarity (and their 'similarity value')

Use the file: `/FileStore/tables/ratings_small.csv`.

This time there's no need for you to merge the files nor build a table from the output.

In [7]:
# Write your code here
import math
from operator import add

def parse_line(line):
  (userID, movieID, rating, timestamp) = line.split(',')
  if rating != 'rating':
    return [(userID, [int(movieID)])]
  else:
    return []

def movies_permutations(user_movies_pair):
  permutations = []
  user, movies = user_movies_pair
  for i in range(len(movies)):
    for j in range(i+1, len(movies)):
      m1 = movies[i]
      m2 = movies[j]
      if m1 <= m2:
        permutations.append(((m1,m2), 1))
      else:
        permutations.append(((m2,m1),1))
  return permutations
  
def sim_between_two_movies(movies_pair_and_simval):
  pair, simval = movies_pair_and_simval
  m1, m2 = pair
  return [(m1, [(m2, simval)]), (m2, [(m1, simval)])]
  
def top_10_similar_movies(movie2_simval_values): # This might be dangerous if the amount of values is big, can you think about other solution?
  # This can be improved by using a MaxHeap of size 10 so we don't load all the list into memory.
  sorted_pairs = sorted(movie2_simval_values, key=lambda x: x[1], reverse=True)
  top_10 = sorted_pairs[:min(len(sorted_pairs), 10)]
  
  return top_10
  
sc.textFile('FileStore/tables/ratings_small.csv')\
  .flatMap(parse_line)\
  .reduceByKey(add)\
  .flatMap(movies_permutations)\
  .reduceByKey(add)\
  .flatMap(sim_between_two_movies)\
  .reduceByKey(add)\
  .mapValues(top_10_similar_movies)\
  .take(20)