# Tutorial: Taming Big Data With Apache Spark and Python - Hands On!
## Exercise 6 - Popular Super Hero

### Setup

FindSpark

This will circumvent many issues with your system finding spark

In [1]:
import findspark
findspark.init('c:/users/andy/spark')

Load Libraries

In [2]:
from pyspark import SparkConf, SparkContext

Set the file path

In [3]:
data_folder = "C:/Users/Andy/Dropbox/FactoryFloor/Repositories/Tutorial_Udemy_SparkPython/Course_Resources/"

Create the Spark Context

In [4]:
# configure your Spark context; master node is local machine
conf = SparkConf().setMaster("local").setAppName("PopularHero")

# create a spark context object
sc = SparkContext(conf = conf)

### Load the Data

In [5]:
# path to file of interest
file01_to_open = data_folder + "marvel-names.txt" # hero IDs
file02_to_open = data_folder + "marvel-graph.txt" # give hero ID followed by hero IDs appeared with
# a hero may span multiple lines

# load the file; textFile breaks up a data file so that each row represents a single value in an RDD
names = sc.textFile(file01_to_open)
lines = sc.textFile(file02_to_open)

Define functions.

In [24]:
# break rows; return the first element and number of elements minus 1
def countCoOccurences(line):
    elements = line.split()
    return (int(elements[0]), len(elements) -1)

# break rows and return key/value of index and name
def parseNames(line):
    fields = line.split('\"')
    return (int(fields[0]), fields[1].encode("utf8"))

In [25]:
namesRdd = names.map(parseNames) #key-value Rdd

### Formatting

Return a key/value of the first element from lines and the number of elements associated with it.

In [27]:
pairings = lines.map(countCoOccurences)

For each element (x) reduce to unique keys and sum the values of all like keys (i.e, frequency). Need to aggregate since heroes can span multiple lines.

In [28]:
totalFriendsByCharacter = pairings.reduceByKey(lambda x, y : x + y)

Reverse. Make value the key and vice versa.

In [29]:
flipped = totalFriendsByCharacter.map(lambda xy: (xy[1],xy[0]))
mostPopular = flipped.max()

### Results

Convert the ID from mostPopular(lines) to a Super Hero name.

In [30]:
mostPopularName = namesRdd.lookup(mostPopular[1])[0]

In [31]:
print(str(mostPopularName) + "is the most popular superhero, with " + \
     str(mostPopular[0]) + " co-appearances.")

b'CAPTAIN AMERICA'is the most popular superhero, with 1933 co-appearances.
