# Tutorial: Taming Big Data With Apache Spark and Python - Hands On!
## Exercise 3.2 - Filtering Maximum Temperatures

### Setup

FindSpark

This will circumvent many issues with your system finding spark

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!wget https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.5-bin-hadoop2.7.tgz
!mv spark-2.4.5-bin-hadoop2.7 spark-2.4.5

In [None]:
import os
# Install java
!apt-get update -qq
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null 

!pip install -q findspark
 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-2.4.5"
!java -version

openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)


In [None]:
!git clone https://github.com/bangkit-pambudi/resource-spark.git

Cloning into 'resource-spark'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 38 (delta 7), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (38/38), done.


In [None]:
import findspark
findspark.init()

Load Libraries

In [None]:
from pyspark import SparkConf, SparkContext

Set the file path

In [None]:
data_folder = "/content/resource-spark/data/"

Create the Spark Context

In [None]:
# configure your Spark context; master node is local machine
conf = SparkConf().setMaster("local").setAppName("FriendsByAge")

# create a spark context object
sc = SparkContext(conf = conf)

### Load the Data

In [None]:
# path to file of interest
file_to_open = data_folder + "1800.csv"

# load the file; textFile breaks up a data file so that each row represents a single value in an RDD
lines = sc.textFile(file_to_open)

Inspect the RDD

In [None]:
lines.top(5)

['ITE00100554,18001231,TMIN,25,,,E,',
 'ITE00100554,18001231,TMAX,50,,,E,',
 'ITE00100554,18001230,TMIN,31,,,E,',
 'ITE00100554,18001230,TMAX,50,,,E,',
 'ITE00100554,18001229,TMIN,16,,,E,']

### Define a Parse Line Function

In [None]:
def parseLine(line):
    fields = line.split(',') # split on common
    stationID = fields[0] # first element 
    entryType = fields[2] # third element
    temperature = float(fields[3]) * 0.1 * (9.0/5.0) + 32.0 # fourth element; convert to F, cause we ain't scientists
    return (stationID, entryType, temperature)

### Transformations

Return key pair values of age and number of friends

In [None]:
parsedLines = lines.map(parseLine)

parsedLines.top(5)

[('ITE00100554', 'TMIN', 75.38),
 ('ITE00100554', 'TMIN', 74.84),
 ('ITE00100554', 'TMIN', 74.84),
 ('ITE00100554', 'TMIN', 74.30000000000001),
 ('ITE00100554', 'TMIN', 74.30000000000001)]

Filter by entryType = TMIN

In [None]:
maxTemps = parsedLines.filter(lambda x: "TMAX" in x[1])

Collect the stationID and temperature from filtered set. Basically remove 'TMIN' column.

In [None]:
stationTemps = maxTemps.map(lambda x: (x[0], x[2]))

stationTemps.top(5)

[('ITE00100554', 90.14000000000001),
 ('ITE00100554', 89.42),
 ('ITE00100554', 88.34),
 ('ITE00100554', 87.80000000000001),
 ('ITE00100554', 87.62)]

Aggregate by stationID taking the minimum temperature

In [None]:
maxTemps = stationTemps.reduceByKey(lambda x, y: max(x,y))

maxTemps.top(5)

[('ITE00100554', 90.14000000000001), ('EZE00100082', 90.14000000000001)]

### Actions

Print out the results

In [None]:
results = maxTemps.collect();

for result in results:
    print(result[0] + "\t{:.2f}F".format(result[1]))

ITE00100554	90.14F
EZE00100082	90.14F
