# Who's the most popular superhero?

When I say 'popular', I don't mean popular among comic readers. What we are defining as popularity is that how many times a certain character appears in another superhero's comic. We'll imagine that if, for example, spiderman appears in a comic with Captain America, that means they are friends. So, the superhero with the most friends (appears with the most other superheroes) will be considered the most 'popular'.

We'll be using two different files: 

* Marvel-Graph.txt: each line contains firstly the superhero's ID and then all the superheroes he/she has appeared with in other comic books
* Marvel-Names.txt: contains the ID and name of every superhero.

In [8]:
def countCoOccurences(line):
    elements = line.split()
    return (int(elements[0]), len(elements) - 1)

def parseNames(line):
    fields = line.split('\"')
    return (int(fields[0]), fields[1].encode("utf8"))
    

names = sc.textFile('/Users/jacquesthibodeau/big-data-datasets/Marvel-Names.txt')
namesRDD = names.map(parseNames)

lines = sc.textFile('/Users/jacquesthibodeau/big-data-datasets/Marvel-Graph.txt')
pairings = lines.map(countCoOccurences)

totalFriendsByCharacter = pairings.reduceByKey(lambda x, y : x + y)
flipped = totalFriendsByCharacter.map(lambda x : (x[1], x[0]))

mostPopularHeroID = flipped.max()

mostPopularHeroName = namesRDD.lookup(mostPopularHeroID[1])[0]

print(str(mostPopularHeroName) + " is the most popular superhero, with " + \
     str(mostPopularHeroID[0]) + " co-appearances.")

b'CAPTAIN AMERICA' is the most popular superhero, with 1933 co-appearances.


There we have it, Captain America is the most popular superhero.

---

Now, we will use Breadth First Search (BFS) in order to figure out the degrees of seperation between superheroes.

In [None]:
# The characters we wish to find the degree of seperation between:
startCharacterID = 5306
targetCharacterID = 14

# We use what is called an accumulator. It signals us when we have
# found the target we are looking for during out BFS traversal.
hitCounter = sc.accumulator(0)

def convertToBFS(line):
    fields = line.split()
    heroID = int(fields[0])
    connections = []
    for connection in fields[1:]:
        connections.append(int(connection))
    
    # initial conditions for BFS
    color = 'WHITE'
    distance = 9999
    
    if (heroID == startSuperheroID):
        color = 'GRAY'
        distance = 0
    
    return (heroID, (connections, distance, color)) # key/value pair

def createStartingRDD():
    inputFile = sc.textFile('/Users/jacquesthibodeau/big-data-datasets/Marvel-Graph.txt')
    return inputFile.map(convertToBFS)

def bfsMap(node):
    characterID = node[0]
    data = node[1]
    connections = data[0]
    distance = data[1]
    color = data[2]
    
    results = []
    
    # If this node needs to be expanded...
    if (color == 'GRAY'):
        for connection in connections:
            newCharacterID = connection
            newDistance = distance + 1
            newColor = 'GRAY'
            if (targetCharacterID == connection):
                hitCounter.add(1)
                
            newEntry = (newCharacterID, ([], newDistance, newColor))
            results.append(newEntry)
            
        # We've processed this node, so we color it black
        color = 'BLACK'
        
    # Emit the input node so we don't lost it.
    results.append(characterID, (connections, distance, color))
    return results

def bfsReduce(data1, data2):
    

for iteration in range(0, 10):
    print("Running BFS iteration# " + str(iteration+1))
    
    # Create new vertices as needed to darken or reduce distances in the
    # reduce stage. If we encounter the node we're looking for as a GRAY
    # node, increment our accumulator to signal that we're done.
    mapped = flatMap(bfsMap)
    
    # Note that mapped.count() action here forces the RDD to be evaluated, 
    # and that's the only reason out accumulator is actually updated.
    print("Processing " + str(mapped.count()) + " values.")
    
    if (hitCounter.value > 0):
        print("Hit the target character! From " + str(hitCounter.value) \
             + "different direction(s)")
        break
    

pairingsBFS = lines.map(convertToBFS)