In [1]:
## this notebook assumes that you have numpy, matlplotlib and networkx in your python environment
## if not, run "pip install" or your favorite package manager to install them first

import csv
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline


In [25]:
# download this file http://syntagmatic.github.io/exposedata/marvel/data/hero-network.csv
# CHANGE THE NEXT LINE TO YOUR FILE PATH
marvel_hero_file = './graph-dataset/hero-network.csv'

In [26]:
# borrowing functions and processing from here 
# https://medium.com/@jmandalihan.msds2023/exploring-the-marvel-universe-through-network-analysis-a-starters-guide-88eb40e85c47

# Load data from the CSV file

def load_data_from_csv(file_path):
    data = []
    with open(file_path, newline='') as csvfile:
        reader = csv.reader(csvfile)
        next(reader)  # Skip the header row
        for row in reader:
            data.append((row[0], row[1]))
    return data


In [29]:

raw_data = load_data_from_csv(marvel_hero_file)

len(raw_data)

574466

In [30]:
# Create an undirected graph
hero_graph = nx.Graph()

# Add edges from the dataset
for edge in raw_data:
    hero_graph.add_edge(edge[0], edge[1])
    
print('total number nodes and edges', hero_graph.number_of_nodes(), hero_graph.number_of_edges() ) 
print ('average number of edges per node', hero_graph.number_of_edges()/hero_graph.number_of_nodes())
print ('density of the Hero social network', nx.density(hero_graph))


total number nodes and edges 6426 167219
average number of edges per node 26.022253345782758
density of the Hero social network 0.00810031232553549


In [31]:
# Create a directed graph
hero_di_graph = nx.DiGraph()

# Add edges from the dataset
for edge in raw_data:
    hero_di_graph.add_edge(edge[0], edge[1])
    
print('total number nodes and edges:', hero_di_graph.number_of_nodes(), hero_di_graph.number_of_edges() ) 
print ('average number of edges per node', hero_di_graph.number_of_edges()/hero_di_graph.number_of_nodes())
print ('density of the Hero social network', nx.density(hero_di_graph))
## why is the number of edges smaller than the number of lines in the CSV?

total number nodes and edges: 6426 224181
average number of edges per node 34.88655462184874
density of the Hero social network 0.005429813948925874


## Question 1

NetworkX inbuild adjacency reader can also read this file `graph_from_adj = nx.read_adjlist(marvel_hero_file)`. 
BUT the outcome is different those from our custom CSV reader. How is it different? What commands/functions from the shell or networkx can you use to expose the differences.

What could have caused this discrepancy? 
Examine the file content, explain the differences, and determine which version you would use? 


## Question 2

Compare how many lines were in the input CSV file, and how many nodes and edges are in either of the networkX graphs. 

Why is the number of edges of the two graphs different?
Why is the number of edges smaller than the number of lines in the CSV?

hint: shell commands such as this may provide a hint `sort graph-dataset/hero-network.csv | uniq | wc -l`

Sort: sorts the lines of files and outout in line
uniq: remove duplicate lines
According to the shell command the unique lines in the csv is 224181 lines,
which is the same as the number of edges in the directed graph 
The undirected graph have less edge may be casuse connections in both way make it differet

In [22]:
# Who are the most connected nodes? does this list surprise you

degree_tuple = sorted(hero_graph.degree(), key=lambda x: x[1], reverse=True)

print(degree_tuple[:15])

[('CAPTAIN AMERICA', 1908), ('SPIDER-MAN/PETER PAR', 1737), ('IRON MAN/TONY STARK ', 1522), ('THING/BENJAMIN J. GR', 1416), ('MR. FANTASTIC/REED R', 1379), ('WOLVERINE/LOGAN ', 1371), ('HUMAN TORCH/JOHNNY S', 1361), ('SCARLET WITCH/WANDA ', 1325), ('THOR/DR. DONALD BLAK', 1289), ('BEAST/HENRY &HANK& P', 1267), ('VISION ', 1241), ('INVISIBLE WOMAN/SUE ', 1236), ('HAWK', 1175), ('WASP/JANET VAN DYNE ', 1091), ('ANT-MAN/DR. HENRY J.', 1082)]


## Question 3

Wait ... my favorite hero is Black Panther / Wong / whoever, 
Write few lines of code that prints the top friends (e.g. top 10 by degree) of any (your favorite) character. 

In [38]:
# put your code here 
# Do it on LITTLE, ABNER
neighbours_of_ABNER = hero_graph.neighbors('LITTLE, ABNER')
degrees = hero_graph.degree(neighbours_of_ABNER)
sorted_neighbours = sorted(degrees, key=lambda x: x[1], reverse=True)
print(sorted_neighbours[:15])

[('IRON MAN/TONY STARK ', 1522), ('THOR/DR. DONALD BLAK', 1289), ('JARVIS, EDWIN ', 986), ("BLACK PANTHER/T'CHAL", 711), ('BINARY/CAROL DANVERS', 537), ('IRON MAN IV/JAMES R.', 465), ('HOGAN, VIRGINIA PEPP', 260), ('JOCASTA', 217), ('FUJIKAWA, RUMIKO', 73), ('MADAME MENACE/SUNSET', 65), ('FIN FANG FOOM/MIDGAR', 62), ('CARNIVORE/COUNT ANDR', 45), ('DECAY II/YOSHIRO HAC', 38), ('TEMPEST II/NICOLETTE', 36), ('INFERNO III/SAMANTHA', 36)]


## Question 4

Who are in the "power center" of the Marvel universe? 
Obtain the mutual connections amongst the top $N_0=20$ characters **by their degree** (computed above). 
* How many edges are there between these characters?
* Are these characters all connected, or do the Marvel network break into several pieces if we only look at the top characters? 
* Do the set of top characters change if we look at the *directed graph* instead?

In [39]:
# put your code here 
degree_tuple = sorted(hero_graph.degree(), key=lambda x: x[1], reverse=True)
degree_tuple[:20]


## Question 5 Transitivity and clustering


* Compute the transitivity of the whole marvel hero graph. 
* Compute the clustering coefficients of each node, examine the nodes with the top clustering coefficient, and also the clustering coefficients of the top nodes (by degree) 
* Do the above for the topK subgraph. 
  * Compared to the whole graph, are the transitivity and clustering coefficients larger or smaller? why? 

In [None]:
# put your code here 

## Question 6 Weighted network

One possible interpretation for the duplicated edges is that the data is collected over different comic issues / films, and characters appearing together across different publications will have their edges represented multiple times. 

* Now generate a weighted network using `raw_data` read from csv above. 


In [None]:
# put your code here 

* What is the maximum weight of an edge in the entire Marvel network, between which two heroes? 
* Does the top K=50 network get more interesting using weighted edges? What is the degree distribution like for the topK network?

In [None]:
# put your code here

## Question 7 Your own network question

What computational question do you want to ask about the Marvel network? 
Try asking one that can be answered by running some algorithm over it. 