# Tutorial: Taming Big Data With Apache Spark and Python - Hands On!
## Exercise 5 (Part 2) - Popular Movies

### Setup

FindSpark

This will circumvent many issues with your system finding spark

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!wget https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.5-bin-hadoop2.7.tgz
!mv spark-2.4.5-bin-hadoop2.7 spark-2.4.5

In [None]:
import os
# Install java
!apt-get update -qq
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null 

!pip install -q findspark
 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-2.4.5"
!java -version

openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)


In [None]:
!git clone https://github.com/bangkit-pambudi/resource-spark.git

Cloning into 'resource-spark'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 38 (delta 7), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (38/38), done.


In [None]:
import findspark
findspark.init()

Load Libraries

In [None]:
from pyspark import SparkConf, SparkContext

Set the file path

In [None]:
data_folder = "/content/resource-spark/data/ml-100k/"

Define the broadcast variable.

In [None]:
def loadMovieNames():
    movieNames = {}
    with open(data_folder + "u.ITEM") as f:
        for line in f:
            fields = line.split('|')
            movieNames[int(fields[0])] = fields[1] #key movie ID and value movie name
    return movieNames

Create the Spark Context

In [None]:
# configure your Spark context; master node is local machine
conf = SparkConf().setMaster("local").setAppName("PopularMovies")

# create a spark context object
sc = SparkContext(conf = conf)

### Load the Data

Broadcast the movie names.

In [None]:
nameDict = sc.broadcast(loadMovieNames())

In [None]:
# path to file of interest
file_to_open = data_folder + "u.data"

# load the file; textFile breaks up a data file so that each row represents a single value in an RDD
input = sc.textFile(file_to_open)

Inspect the RDD

*USERID, MOVIEID, RATING, Time Stamp*

In [None]:
input.top(5)

['99\t98\t5\t885679596',
 '99\t978\t3\t885679382',
 '99\t975\t3\t885679472',
 '99\t963\t3\t885679998',
 '99\t931\t2\t886780147']

### Formatting

For each row (x) split the entry and grab the second element (movie ID) as an integer. We are also adding a value of 1 to each movie ID. This will be used to count.

In [None]:
movies = input.map(lambda x: (int(x.split()[1]), 1))

For each element (x) in movies reduce to unique keys and sum the values of all like keys (i.e, frequency).

In [None]:
movieCounts = movies.reduceByKey(lambda x, y: x + y)

Reverse. Make value the key and vice versa.

In [None]:
flipped = movieCounts.map(lambda x: (x[1],x[0]))
sortedMovies = flipped.sortByKey()

### Results

In [None]:
# for the key value pairs in sortedMovies, use the movieID to return movieName from nameDict.
sortedMoviesWithNames = sortedMovies.map(lambda countMovie : (nameDict.value[countMovie[1]], countMovie[0]))

In [None]:
results = sortedMoviesWithNames.collect()

In [None]:
for result in results:
    print (result)