<a href="https://colab.research.google.com/github/Advanced-Data-Science-TU-Berlin/Data-Science-Training-Python-Part-2/blob/main/interactive_notebooks/3_2_social_graph_communities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing Massive Graphs - Marvel Universe
In this exercise we will be using a dataset from the comics and movies associated with the Marvel Universe. This dataset can be found [here](https://www.kaggle.com/datasets/csanhueza/the-marvel-universe-social-network). It contains 40,616 characters and 65,870 relationships connecting them. We will be using this dataset to discover its hidden characteristics, how the different characters influenced each other and their communities structure.

## About DataSet
The Marvel dataset is composed of a list of co-occurrences of super heros. For example, every time Spider Man appears in a comic book with Captain America, we will have a line with both their names. Additionally, the characters have numerous properties that can be associated with each node.
The dataset is divided into three files:
- nodes.csv: Contains two columns (node, type), indicating the name and the type (comic, hero) of the nodes.
- edges.csv: Contains two columns (hero, comic), indicating in which comics the heroes appear.
- hero-edge.csv: Contains the network of heroes which appear together in the comics. This file was originally taken from http://syntagmatic.github.io/exposedata/marvel/

First, let's download the dataset and look at some samples from the data:

In [None]:
!pip install opendatasets
!pip install pyvis

In [2]:
import opendatasets as od
od.download("https://www.kaggle.com/datasets/csanhueza/the-marvel-universe-social-network", force=True)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: mahnaznmz
Your Kaggle Key: ··········
Downloading the-marvel-universe-social-network.zip to ./the-marvel-universe-social-network


100%|██████████| 2.72M/2.72M [00:00<00:00, 158MB/s]







In [5]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


In [None]:
# Read data from /content/the-marvel-universe-social-network
# Read nodes.csv, edges.csv and hero-network.csv into pandas
nodes_df = <your-code-here>
edges_df = <your-code-here>
hero_net_df = <your-code-here>

print("Sample Nodes:------------------------")
display(nodes_df.head())

print("Sample Edges:------------------------")
display(edges_df.head())

print("Sample Heros Network:----------------")
display(hero_net_df.head())

Now let's use this data to construct an undirected graph and look at some of its basic statistics:

In [6]:
import networkx as nx

# Construct an undirected graph from hero_net_df
# Hint: use nx.from_pandas_edgelist and path hero_net_df as the first arg
# for source parameter user 'hero1' and for target use 'hero2'
# set create_using=nx.Graph() in the inputs
G = <your-code-here>


Now let's look at some basic graph statistics together.

In [None]:
# Graph statistics

# Print number of edges
# Hint: use either G.size() or nx.number_of_edges(G) (feel free to try both)
print("Number of edges:", <your-code-here>)

# Print number of nodes
# Hint: use either G.order() or nx.number_of_nodes(G) (feel free to try both)
print("Number of nodes:", <your-code-here>)

# The density for undirected graphs is d = 2m / (n(n-1)), where n is
# the number of nodes and m is the number of edges
# Print the density of the graph
# Hint: use nx.density(G)
print("Network Density:", <your-code-here>)

As we can see out network is not quite dense but rather sparse.

Let's use the graph structure to answer several questions.


### Q1. Who is the most friendly superhero?
We can translate this question to `who is the superhero with the highest number of friends?`. If we wanted to answer this question using DataFrame, it wouldn't be trivial because for each hero we would need to count the number of distinct friends both when the hero appears in the Hero1 column and the Hero2 column. However, answering this question using a graph object is relatively easy; we simply need to find the node with the maximal degree.

In [None]:
# Get each node's degree
# Hint: use G.degree() the output is a dictionary
node_degrees = <your-code-here>

# Sorting the nodes with highest degree. Let's take top-5
# Hint: use sorted function and path the node_degrees,
# use lambda x: x[1] for key and reverse=True
# yse [:5]
display(<your-code-here>)

So, using the degree, we discovered that the `most friendly` superhero is `Captain America` who is connected to 1908 heroes.

Let's use seaborn to calculate the graph degree distribution:

In [None]:
fig, ax = plt.subplots(figsize=(15,5))

# Plot the degree distribution
# Hint: use sns.histplot and pass [v for v in dict(d).values()] as input
# use kde=True (compute a kernel density estimate to smooth the distribution and show a line)
# use stat="density" (normalize such that the total area of the histogram equals 1)
# don't forget to pass ax=ax
<your-code-here>

ax.set_title('Degrees Distribution')
ax.set_xlabel('Degree')


From the above plot, we can see that many nodes have low degrees, i.e. these heroes are connected to only a few heros. Let's create a subgraph without these nodes:

In [None]:
# let's create a list with nodes that have degree > 50
# Hint: use [n for n, d in dict(d).items() if d > 50]
selected_nodes_list = <your-code-here>

# create a subgraph with only nodes from the above list
# Hint: use G.subgraph and path selected_nodes_list
sub_G = <your-code-here>

print(sub_G)

As we can see, we could reduce the size of the graph.
### Q2: What is the largest community in the graph?
Among the wonderful things that are useful using graphs as data structures is the ability to separate them into communities, i.e., disjoint subgraphs. Let's use [Clauset-Newman-Moore greedy modularity maximization](https://arxiv.org/abs/cond-mat/0408187) to separate the graph into communities.

In [None]:
%%time
from networkx.algorithms.community import greedy_modularity_communities

# Apply the Community Detection Algorithm [be patient this can take a while]
# Hint: use greedy_modularity_communities and pass sub_G
cc = greedy_modularity_communities(sub_G)

# print the number of communities
# Hint: use len(cc)
print(<your-code-here>)

Using the community detection algorithm, we detected 6 communities of different sizes. Let's view the size of the distribution of the community sizes:

In [None]:
# Let's get the number of nodes inside each community as their size
# Hint: use [len(c) for c in cc]
community_size_list = [len(c) for c in cc]

# Let's use a bar plot to show different community sizes
fig, ax = plt.subplots()

# Hint: use ax.hist and pass community_size_list
<your-code-here>

ax.set_title("Community Sizes Distribution")
ax.set_xlabel("Community Size")
ax.set_ylabel("#Nodes")

We can see that 2 communities are relatively smaller than others. Let's take a look at this two communities:

In [None]:
# Let's select smaller communities to investigate
# Hint: use [c for c in cc if len(c)<50]
selected_community_list = <your-code-here>

len(selected_community_list)

Let's plot both communities with a helper function

In [47]:
from IPython.core.display import display, HTML # We need this because of colab
from pyvis.network import Network

# A helper function to plot sub-graphs
def plot_subgraph(g, nodes_list, title):
  # select sub-graph based on the given nodes_list
  # Hint: use g.subgraph and pass nodes_list
  sub_g = <your-code-here>

  # Create vis network
  net = Network(height="700px", width="75%", notebook=True, cdn_resources='in_line')

  # Now load the networkx graph into the vis network
  net.force_atlas_2based()
  net.show_buttons(filter_=['physics'])
  net.from_nx(sub_g, default_node_size=20)

  # Let's visualize the network using pyvis
  net.show(f"{title}.html")
  display(HTML(f"{title}.html"))

In [None]:
# Let's plot the first subgraph
# Hint: use the helper function plot_subgraph and pass sub_G
# also selected_community_list[0] to only show the first community
# also set a title Ex title='Community-1'
<your-code-here>