<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Networkx dataset examples

---

Appart from functions for generating particular types of graphs, Networkx contains several benchmark social network datasets. As they contain a fairly low number of nodes and links they are ideal for practising the concepts learned in the lesson.

### 1. Zachary's Karate Club graph: 

[Zachary's Karate Club](https://www.jstor.org/stable/3629752?seq=1#page_scan_tab_contents) graph is a result of a social science field study recording if 34 members of a Californian Karate Club were interacting outside the club. It is a benchmark for detecting communities in graphs as during the study a conflict between the administrator and the instructor of the club lead to a split with some members forming a new club.

If you are the first to mention this dataset on a network conference, you can win a trophy and become a member of the *Zachary Karate Club Club*:

<img src="./assets/images/Karate_Club_Trophy.jpg" style="width:500px;height:400px;">


### 2. The Florentine family graph: 

This is data about how the most important Florentine families in the early 15th century were connected by marriages among the families (Ronald L. Breiger and Philippa E. Pattison Cumulated social roles: The duality of persons and their algebras,1 Social Networks, Volume 8, Issue 3, September 1986, Pages 215-256). In fact, the time covered is just around the time that the Medici family had been expelled from Florence before returning to power, and the families represented here are known to have shown opposition to the Medici family.

### 3. The Davis Southern Women Graph

This dataset was collected by Davis and Gardner in the 1930s. It contains the observed attendance by 18 Southern women at 14 social events, presumably comparable to modern meet ups (A. Davis, Gardner, B. B., Gardner, M. R., 1941. Deep South. University of Chicago Press, Chicago, IL.).

## For each of the graphs, find 

* the graph info
* the list of nodes
* the list of edges
* the degree distribution (plot a histogram)
* the clustering coefficients (for each node and the average, plot a histogram)
* the shortest paths for the nodes and their average length
* obtain the vertex and edge betweenness ranking for each graph
* plot the graph
* use the Girvan-Newman algorithm for community detection
* use hierarchical clustering for community detection
    - do you see any differences in the community structure resulting from the two algorithms?

**Hints:** 
- load the datasets with the commands below
- the last graph is bipartite - obtain the clustering coefficients and shortest paths for the projections on top and bottom nodes
- for the Karate club graph, hierarchical clustering with two clusters should give you communities representing the actual split that occured in the group
- for the bipartite graph, consider communities of the whole graph and the graphs obtained by projecting on the two groups

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
import networkx as nx
from networkx.algorithms import bipartite
from scipy.cluster import hierarchy
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, jaccard
plt.style.use('ggplot') 
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
G_karate = nx.karate_club_graph()
G_florentine = nx.florentine_families_graph()
G_davis = nx.davis_southern_women_graph()