# Tutorial: Taming Big Data With Apache Spark and Python - Hands On!
## Exercise 6 - Popular Super Hero

### Setup

FindSpark

This will circumvent many issues with your system finding spark

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!wget https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.5-bin-hadoop2.7.tgz
!mv spark-2.4.5-bin-hadoop2.7 spark-2.4.5

In [None]:
import os
# Install java
!apt-get update -qq
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null 

!pip install -q findspark
 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-2.4.5"
!java -version

openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)


In [None]:
!git clone https://github.com/bangkit-pambudi/resource-spark.git

Cloning into 'resource-spark'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 38 (delta 7), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (38/38), done.


In [None]:
import findspark
findspark.init()

Load Libraries

In [None]:
from pyspark import SparkConf, SparkContext

Set the file path

In [None]:
data_folder = "/content/resource-spark/data/"

Create the Spark Context

In [None]:
# configure your Spark context; master node is local machine
conf = SparkConf().setMaster("local").setAppName("PopularHero")

# create a spark context object
sc = SparkContext(conf = conf)

### Load the Data

In [None]:
# path to file of interest
file01_to_open = data_folder + "marvel-names.txt" # hero IDs
file02_to_open = data_folder + "marvel-graph.txt" # give hero ID followed by hero IDs appeared with
# a hero may span multiple lines

# load the file; textFile breaks up a data file so that each row represents a single value in an RDD
names = sc.textFile(file01_to_open)
lines = sc.textFile(file02_to_open)

Define functions.

In [None]:
# break rows; return the first element and number of elements minus 1
def countCoOccurences(line):
    elements = line.split()
    return (int(elements[0]), len(elements) -1)

# break rows and return key/value of index and name
def parseNames(line):
    fields = line.split('\"')
    return (int(fields[0]), fields[1].encode("utf8"))

In [None]:
namesRdd = names.map(parseNames) #key-value Rdd

### Formatting

Return a key/value of the first element from lines and the number of elements associated with it.

In [None]:
pairings = lines.map(countCoOccurences)

For each element (x) reduce to unique keys and sum the values of all like keys (i.e, frequency). Need to aggregate since heroes can span multiple lines.

In [None]:
totalFriendsByCharacter = pairings.reduceByKey(lambda x, y : x + y)

Reverse. Make value the key and vice versa.

In [None]:
flipped = totalFriendsByCharacter.map(lambda xy: (xy[1],xy[0]))
mostPopular = flipped.max()

### Results

Convert the ID from mostPopular(lines) to a Super Hero name.

In [None]:
mostPopularName = namesRdd.lookup(mostPopular[1])[0]

In [None]:
print(str(mostPopularName) + "is the most popular superhero, with " + \
     str(mostPopular[0]) + " co-appearances.")

b'CAPTAIN AMERICA'is the most popular superhero, with 1933 co-appearances.
