# Tutorial: Taming Big Data With Apache Spark and Python - Hands On!
## Exercise 1.0 - Frequency of Movie Ratings

### FindSpark
This will circumvent many issues with your system finding spark

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!wget https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.5-bin-hadoop2.7.tgz
!mv spark-2.4.5-bin-hadoop2.7 spark-2.4.5

In [15]:
import os
# Install java
!apt-get update -qq
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null 

!pip install -q findspark
 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-2.4.5"
!java -version

openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)


In [25]:
!git clone https://github.com/bangkit-pambudi/resource-spark.git

Cloning into 'resource-spark'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 38 (delta 7), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (38/38), done.


In [16]:
import findspark
findspark.init()

### Setup

In [17]:
from pyspark import SparkConf, SparkContext
import collections

Definitions
* SparkConf: to run a Spark application on a local/cluster you need to set a few configurations and parameters
    * setMaster: set master URL to connect to
    * setAppName: create application name
* SparkContext: main entry to Spark functionality, connection to a Spark cluster

Set the file path

In [28]:
data_folder = "/content/resource-spark/data/ml-100k/"

### Create the Spark Context

In [19]:
# configure your Spark context; master node is local machine
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")

# create a spark context object
sc = SparkContext(conf = conf)

### Frequency of Movie Ratings
File was previously downloaded.

In [29]:
# path to file of interest
file_to_open = data_folder + "u.data"

# load the file; textFile breaks up a data file so that each row represents a single value in an RDD
lines = sc.textFile(file_to_open)

# transform the RDD
# split the string (i.e., each row) and take the third element (i.e., ratings)
ratings = lines.map(lambda x: x.split()[2])

# call an action on RDD
# counts the number of times each unique value occurs
result = ratings.countByValue()

# use collection package to create an ordered dictionary
sortedResults = collections.OrderedDict(sorted(result.items()))

# print out each key and value in the ordered dictionary
for key, value in sortedResults.items():
    print("%s %i" % (key, value))

1 6110
2 11370
3 27145
4 34174
5 21201
