<h1 align="center">TD 1: Introduction to graph theory with Networkx</h3> 

<h3 align="center">Vincent Gauthier</h3>
<h4 align="center">vincent.gauthier@telecom-sudparis.eu</h4>

<img src="./Images/webmap.jpg"></img>

# Getting Started

## Installing Python (Anaconda)
We recommend using the Anaconda Python distribution, as it bundles together many useful packages (including numpy, matplotlib, and IPython) that we will use throughout this class. To install Anaconda, please see [here](http://docs.continuum.io/anaconda/install). Make sure you get the version compatible with your OS (Mac, Linux, or Windows). If you'd like a more comprehensive document, please check out CS109's guide [here](https://github.com/cs109/content/wiki/Installing-Python).

## Python Shell
The Python Interactive Shell is a light-weight way to explore Python. It's commonly used for debugging, quick tasks, and as a calculator! To get to the shell, open Terminal and type `python`. More instructions are located [here](https://docs.python.org/2/tutorial/interpreter.html). The following sections will be done in the shell to allow you to get a feel for Python.

# Python Basic
## Lists

Lists are basic Python structures that function as arrays. They can store all types of data. You can index a specific element using `[]` brackets, which comes in handy very frequently.

In [1]:
# Style pour le Notebook
from IPython.core.display import HTML

def css_styling():
    styles = open("./Styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

In [None]:
# Variable assignment
a = 3
b = 2
print(a + b)  # 5

c = []
d = [1,2]

In [None]:
# List basics
print('List basics:')
list_a = []
list_b = [1,2,3,4,5,6]
print('list b:', list_b)
second_elt_in_list_b = list_b[1]  # remember, we index from 0
print('second element in list b:', second_elt_in_list_b)
tmp = [1,3,4]
sub_list_list_b = [list_b[i] for i in tmp]
print('sub list:', sub_list_list_b)
list_b.append(7)
print('new list b:', list_b)
last_elt_b = list_b[-1]
last_elt_b_2 = list_b[len(list_b)-1]
print('last element two ways:', last_elt_b, last_elt_b_2)
print('\n')  # newline character

# List of lists
print('List of lists:')
list_of_lists_c = [list_a,list_b]
print('list of lists c:', list_of_lists_c)
list_of_lists_c.append([3,4,5])
print('new list of lists c:', list_of_lists_c)
third_list_in_c = list_of_lists_c[2]
print('third list in c:', third_list_in_c)
second_elt_of_third_list_in_c = list_of_lists_c[2][1]
print('second element of third list in c:', second_elt_of_third_list_in_c)
print('\n')

## Conditional Statements

Conditional statements and loops are useful control structures that allow you to structure the way your program will execute. Note that Python cares about whitespace; proper indentation after conditional statements and function definitions (in the next section) is key to avoiding syntax errors, even in the Interactive Shell.

In [None]:
# Conditional statements
if 1 > 0:
    print('1 > 0')
    
if 1 < 0:
    print('1 < 0')
else:
    print('1 !< 0')
    
if 1 < 0:
    print('1 < 0')
elif 1 < 2:
    print('1 < 2')
else:
    print('none of the above')

## Loops

In [None]:
# For loops
for i in range(0,10):
    print(i)
    
# looping through lists
for x in list_b:
    print(x)

# While loops
j = 0
while j < 10:
    print(j)
    j += 1

### List comprehensions

In [None]:
# List comprehensions
list_b = [1,3,4,2,5,6,4,5,7,5,6,4,5,3,1,2]

# Normal for loop
tmp1 = []
for x in list_b:
    if x > 3:
        tmp1.append(x)
print('Normal for loop:', tmp1)

# List comprehension
tmp2 = [x for x in list_b if x > 3]
print('List comprehension:', tmp2)

In [None]:
# Using LL from the previous example, do the following:
# Given a list of lists, do the following:
LL = [[1,2],[2,3],[3,4,5],[4,5,6,7],[5,6],[6,7,8,8,9]]

# a. Create a new list first_LL of the first elements of each sublist
first_LL = []
for L in LL:
    first_LL.append(L[0])
    
print(first_LL)

# b. Create a new list last_LL of the last elements of each sublist
last_LL = []
for L in LL:
    last_LL.append(L[-1])  # or L[len(L)-1]
    
print(last_LL)

# c. Print all values in all sublists in the range [4,8]
for L in LL:
    for x in L:
        if x >= 4 and x <= 8:
            print(x)

# d. Print all sublists of length 3 or greater
for L in LL:
    if len(L) >= 3:
        print(L)

In [None]:
# Checking membership

# Lists
list_d = [1,2,3,4,5]

if 1 in list_d:
    print('1 is in list d')

if 6 not in list_d:
    print('6 is not in list d')

## Dictionaries

In Python, dictionaries are essentially hash tables. They store (key,value) pairs and can contain data of all types.

In [None]:
# Dictionary example
my_dict = {}  # declare empty dictionary
my_dict['key1'] = 'value1'  # insert ('key1','value1') pair in dictionary
print('my dict:', my_dict)  # prints entire dictionary
print('first entry:', my_dict['key1'])  # index into dictionaries with keys

# Second dictionary example
dict1 = {}
for x in range(1,21):
    # each dictionary entry is a list of all whole numbers less than that value
    dict1[x] = list(range(0,x))
print('list dict:', dict1)
print('last entry:', dict1[20])


In [None]:
# Dictionaries
# dict.keys() and dict.values() are lists of the keys and values in the dictionary
dict_a = {'a': 3, 'b': 2, 'c': 1}

if 'a' in dict_a.keys():
    print('a is a key in dict a')
if 'd' not in dict_a.keys():
    print('d is not a key in dict a')
    
if 3 in dict_a.values():
    print('3 is a value in dict a')
if 4 not in dict_a.values():
    print('4 is not a value in dict a')

## Functions

Functions are useful ways of encapsulating blocks of code to perform tasks. Each function is declared with the `def` keyword and takes a list of arguments. Additionally, most functions return something via the `return` keyword, which exits the function.

In [None]:
# Functions are declared with the 'def' keyword followed by the 
# function name and a list of arguments to the function.
def my_function(arg1,arg2):
    tmp = arg1 + arg2
    return tmp

return_value = my_function(1,3)
print('sum:', return_value)

def sum_list_naive(L):
    sum = 0
    for x in L:
        sum += x
    return sum

print('sum list:', sum_list_naive([1,4,5,3]))

### Exercises 1

Write a function that goes through the following list and finds the smallest element. Do not use any built-in functions such as min(L).

In [3]:
# Find the smallest element of this list
L = [3,1,2,4,5,3,1,4,2,5,6,3,6,7,5,4,7,1,2,4,3,5,6,7,5]

# Pseudocode:
# Initialize current smallest element to the first element in list
# Loop through list, updating current smallest element if necessary

def find_smallest(input_list):
    # your code
    min_elem = input_list[0]
    for elem in input_list:
        if elem < min_elem:
            min_elem = elem
    return min_elem

L_smallest = find_smallest(L)
print(L_smallest)

1


# Representing Graphs

The most common data structures for representing graphs are adjacency matrices and adjacency lists. In general, adjacency matrices are better for dense graphs (because they take up $O(n^2)$ room), and they support very quick edge lookups. However, it's slow to iterate over all edges. Adjacency lists are better for sparse graphs and allows quick iteration over all edges in a graph, but finding the existence of a specific edge takes longer. For these examples, we will focus on using adjacency lists.

The most basic way of representing a graph is through a list of lists. In this way, we can say that the $n^{th}$ list represents all of the $n^{th}$ vertex's connections. We will go through an example on the board.

In [None]:
# List of lists representation of a graph
# Graph G: Vertices = {0,1,2,3}, Edges = {(0,1),(1,2),(2,3),(3,0)}

G_LL = [[1,3],[0,2],[1,3],[0,2]]

# Vertex 0's neighbors
print(G_LL[0])

# Vertex 1's neighbors
print(G_LL[1])

Another way of representing a graph is through a dictionary. See the below section for a quick introduction to a dictionary. The keys in this case will be the vertex names, and the corresponding values will be a list of all adjacent vertices. Note that using a dictionary is more convenient for graphs that have strings as labels for each vertex, but using a dictionary is more expensive in terms of memory and time. For the example above, we have the following.

In [None]:
# Dictionary representation of a graph
# Graph G: Vertices = {0,1,2,3}, Edges = {(0,1),(1,2),(2,3),(3,0)}

G_dict = {0:[1,3], 1:[0,2], 2:[1,3], 3:[0,2]}

# Vertex 0's neighbors
print(G_dict[0])

# Vertex 1's neighbors
print(G_dict[1])

# Plotting Data

In [None]:
%matplotlib nbagg

import matplotlib.pylab as plt  # use matplotlib's pyplot package

plt.plot([1,3,4],[3,6,7])  # plt.plot draws a line
plt.show()

# Matrix Operations in Python

When dealing with matrices in Python, the general best practice is to use the NumPy package. This codelab will go over how to carry out operations on matrices useful for the next problem set (and in general).

In [None]:
import numpy as np
# numpy has packages called 'matrix' and 'linalg' that will be useful

mat1 = np.matrix( [[1.,2.,3.],[4.,5.,6.],[7.,8.,9.]] )
mat2 = np.matrix( [[0,1,0],[1,0,0],[0,1,1]] )

print('matrices: \n', mat1, '\n', mat2)

# indexing into matrices
print('first row:\n', mat1[0,:])
print('first column:\n', mat1[:,0])
print('first two rows:\n', mat1[0:2,:])
print('first two cols:\n', mat1[:,0:2])

# matrix attributes
print('shape: ', mat1.shape)
print('size: ', mat1.size)

Here are some matrix operations that you should know: (useful link here: http://sebastianraschka.com/Articles/2014_matlab_vs_numpy.html)

In [None]:
# matrix operations
print ('plus: \n', mat1 + mat2)
print ('minus: \n', mat1 - mat2)

# element-wise multiplication
print ('element-wise multiplication: \n', np.multiply(mat1,mat2))

# matrix multiplication
print('matrix multiplication: \n', np.dot(mat1,mat2))
# print mat1*mat2

# element-wise division
print('element-wise division: \n', np.divide(mat2,mat1))
# print mat2/mat1

# raising matrices to powers (element-wise)
print('element-wise power: \n', np.power(mat1,2))

# raising matrices to powers
print('matrix power: \n', np.linalg.matrix_power(mat1,2))

# Breadth First Search

Breadth first search (BFS) is a common and useful graph traversal algorithm. A graph traversal algorithm describes a process for visiting the nodes in a tree or graph, where traversing from one node to another is only allowed if there is an edge connecting those vertices. A motivational property of BFS is that it can find connectivity and shortest distances between two nodes in graphs. For this class, we will apply BFS to compute centrality measures and other graph properties. In this lab we will study how BFS runs, how to implement it, and how it can be used.

### Queue
Before we can begin to study BFS, we need to understand the queue data structure. Similar to the non-computer science understanding of the word, a queue is essentially a line. Data, in our case nodes or vertices, enter the queue and are stored there. The operation of adding data to a queue is called enqueue() or push(). The order in which they enter the queue is preserved. When we need to retrieve data, we take the earliest element from the queue and it is removed from the queue. This operation is called dequeue() or pop(). This is referred to as <i>first-in-first-out</i> (FIFO).

In Python, we can implement a queue naively as a list. However, lists were not built for this purpose, and do not have efficient push() and pop() operations. Alternatives include implementations included in libraries (which we will be using) or defining your own class and operations. For this lab we will use the queue data structure from the 'collections' library (https://docs.python.org/2/library/collections.html#collections.deque).

In [5]:
from collections import deque

In [7]:
# initializes the queue with [S, W, M]
queue = deque(["Scarlet", "White", "Mustard"])
queue.append("Plum")
# Now the queue is [S, W, M, P]
queue.popleft()
# Now the queue is [W, M, P]
queue.append("Greene")
# etc
queue.popleft()
queue.popleft()
murderer = queue.popleft()
print(murderer)

# who is Boddy's murderer?
# ANSWER: Plum

Plum


### Algorithm

Recall that a graph traversal is a route of nodes to follow or traverse, where we can only move between two nodes if there is an edge between them. Breadth first search specifies a process for doing so. In the simplest implementation, BFS visits a given node, adds its neighbors to the queue, then visits the element returned by the queue. To begin the BFS, we start at a given node, then visit all of that node's nieghbors, then visit all of one neighbor's neighbors, then another neighbor's neighbors, etc. Whenever we visit a node for the first time, we mark it as visited; otherwise we do nothing when re-visiting a node. The algorithm terminates once we have visited everything reachable from the given node, and we have a list of nodes reachable from that node.

<br />
<div align="center">
<img width=400 src="./Images/BFS.png"></img>
**Algorithm 1.: BFS**
</div>


In [8]:
# graph G represented as an adjacency list dictionary, a node v to start at
def simpleBFS(G, v):
    reachable = [-1] * len(G.keys())   # -1 indicates unvisited
    reachable[v-1] = 1                 # note: we assume vertices have integer names
    
    queue = deque()
    queue.append(v)
    while len(queue) != 0:           # while we still have things to visit
        current = queue.popleft()
        for node in G[current]:
            if reachable[node-1] == -1:
                reachable[node-1] = 1
                queue.append(node)
    
    return reachable

In [28]:
# G as an adjacency list
G = {1:[2,3], 2:[1,5], 3:[4,1,5], 4:[3], 5:[2,3,6], 6:[5]}

In [29]:
simpleBFS(G, 1)

[1, 1, 1, 1, 1, -1]

<div align="center"> <img width=500 src="./Images/SimpleGraph.png"></img> </div>


all the graph above $G$. If we call simpleBFS($G$, $a$), the steps the BFS will take are (assume nodes are added to the queue in alphabetical order)
1. start at $a$, push nodes $b$ and $c$ on the queue and mark them as visited
2. pop off $b$ from the queue, visit it, and add $e$ to the queue
4. pop off $c$ from the queue, visit it, and add $d$ to the queue
4. pop off $e$ from the queue, upon seeing that $c$ and $b$ have been visited, no nodes are added to the queue
5. pop off $d$ from the queue, no nodes are added to the queue
6. queue is now empty, simpleBFS terminates and returns a list from which we can iterate through and see that only $f$ is not reachable from $a$

## Exercises: 
**Question**: What is the order of traversal for simpleBFS($G$, $f$)? simpleBFS($G$,$b$)?

**Result**:

### Modifications
Armed with an understanding of how BFS operates, we can make improvements on the algorithm so that we can extract more information about the graph in the running of BFS. Consider the following:

In [30]:
# graph G represented as an adjacency list dictionary, a node v to start at
def distBFS(G, v):
    dist = [-1] * len(G.keys())
    dist[v-1] = 0                 # the distance from v to itself is 0
    
    queue = deque()
    queue.append(v)
    while len(queue) != 0:           # while we still have things to visit
        current = queue.popleft()
        for node in G[current]:
            if dist[node-1] == -1:
                dist[node-1] = dist[current-1] + 1
                queue.append(node)
    
    return dist

In [31]:
print(distBFS(G, 1))

[0, 1, 1, 2, 2, -1]


With this implementation of BFS, we not only are able to find out which vertices of the graph are reachable from $v$, but also how far away (in terms of nodes in between) each vertex is from $v$. Moreover, the path length returned by distBFS() is provably the shortest path from $v$ to any other vertex in $G$, provided there are no edge weights associated with the graph. However, even if we know the shortest path, it would be nice to know which nodes precisely we would need to traverse to get from $v$ to that vertex. We can still do that with BFS. Consider the following:

In [13]:
# graph G represented as an adjacency list dictionary, a node v to start at
def BFS(G, v):
    dist = [-1] * len(G.keys())       
    prev = [None] * len(G.keys())
    dist[v-1] = 0                     
    
    queue = deque()
    queue.append(v)
    while len(queue) != 0:
        current = queue.popleft()
        for node in G[current]:
            if dist[node-1] == -1:
                dist[node-1] = dist[current-1] + 1
                prev[node-1] = current
                queue.append(node)
    
    return dist, prev

In [14]:
print(BFS(G,1))

([0, 1, 1, 2, 2, -1], [None, 1, 1, 3, 2, None])


With this implementation of BFS, we introduce the prev array, which for a given node $w$ keeps track of the node that we visited that led us to that node. This node is a <i>predecessor</i> of $w$. Returned this information, if we want to find the path from $v$ to $w$, first inspect the dist array to make sure $w$ is reachable from $v$. If it is reachable, we simply inspect the prev array for prev[w]. If that is not $v$, we look for the predecessor of prev[w], and on until we find $v$.

### Exercice

- Walk through BFS(G,B) on the graph above, keeping track of the order visited and the state of the arrays. (We will use the following format: current node, dist, prev):

    * B
    * E
    * C
    * D
    
**Result:**

## Compute Clustering

Recall that the clustering coefficient $C(v)$ of a vertex $v$ is the fraction over all pairs of neighbors of $v$ of pairs between which an edge exists. In real world social networks we usually see high clustering, and therefore we want our random graph models to also have this property. Ultimately we want to find the clustering coefficient of the graph, but we begin by calculating the clustering coefficient for a single vertex.

In [44]:
def vertexCC(G,v):
    v_neighbors = G[v]
    edges = 0.0
    for i in range(len(v_neighbors)):
        for j in range(i+1, len(v_neighbors)):
            if v_neighbors[j] in G[v_neighbors[i]]:
                edges += 1
    if edges == 0.0:
        return 0.0
    else:
        return edges / ((len(v_neighbors)^2 - len(v_neighbors))/2)

In [51]:
'''
Use vertexCC() to calculate the clustering coefficient of the entire graph, 
which is the average clustering coefficient of its nodes.
'''
import numpy as np

def graphCC(G):
    CC = []
    for n in G.keys():
        CC.append(vertexCC(G,n))
    return np.array(CC).mean()       
    
graphCC(G)

1.0

In [52]:
G = {1:[2,3], 2:[1,3], 3:[1,2]}
graphCC(G)

1.0

# Networkx

## Overview
NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and function of complex networks.

With NetworkX you can load and store networks in standard and nonstandard data formats, generate many types of random and classic networks, analyze network structure, build network models, design new network algorithms, draw networks, and much more.

## Who uses NetworkX?
The potential audience for NetworkX includes mathematicians, physicists, biologists, computer scientists, and social scientists.

## Goals
NetworkX is intended to provide tools 

* tools for the study of the structure and dynamics of social, biological, and infrastructure networks,
* a standard programming interface and graph implementation that is suitable for many applications,
* a rapid development environment for collaborative, multidisciplinary projects,
* an interface to existing numerical algorithms and code written in C, C++, and FORTRAN,
* the ability to painlessly slurp in large nonstandard data sets

for more information fo to the [Networkx Documentation](http://networkx.github.io/documentation/latest/index.html)

In [60]:
import networkx as nx
%matplotlib nbagg

In [55]:
G = nx.Graph()

G.add_node(1)
G.add_node(2)
G.add_node(3)
G.add_node(4)
G.node[1]['name'] = 'Paul'
G.node[2]['name'] = 'Pierre'
G.node[3]['name'] = 'Sophie'
G.node[4]['name'] = 'Caroline'

G.add_edge(1,2)
G.add_edge(2,3)
G.add_edge(3,4)

nx.draw_networkx(G)

<IPython.core.display.Javascript object>

## We load the graph of the Florentine Families

In [85]:
import csv
import networkx as nx

with open('Data/Florence/FlorentineFamiliesName.csv', 'r') as csvfile:
    nodeReader = csv.reader(csvfile, delimiter=' ')
    line = 0
    labels = {}
    for row in nodeReader:
        if line > 0:
            labels[int(row[0])] = row[1]
        line += 1


# Create the graph
G = nx.Graph()
with open('Data/Florence/FlorentineFamiliesNodes.csv', 'r') as csvfile:
    nodeReader = csv.reader(csvfile, delimiter=' ')
    line = 0
    for row in nodeReader:
        G.add_node(int(row[1]),label=labels[int(row[1])])
        G.add_node(int(row[2]),label=labels[int(row[2])])
        G.add_edge(int(row[1]),int(row[2]), weight=float(row[3]))
    

labels=dict((n,d['label']) for n,d in G.nodes(data=True))

## Plot the Graph of the Florentine Families with labels

[Plotting Graph with networkx](https://networkx.github.io/documentation/latest/reference/generated/networkx.drawing.nx_pylab.draw_networkx.html#networkx.drawing.nx_pylab.draw_networkx)

In [86]:
%matplotlib nbagg

pos=nx.spring_layout(G)
nx.draw_networkx_nodes(G,pos,
                       node_color='r',
                       node_size=500,
                       alpha=0.8)

nx.draw_networkx_edges(G,pos,width=1.0,alpha=0.5)
nx.draw_networkx_labels(G,pos,labels,font_size=9)

<IPython.core.display.Javascript object>

{1: <matplotlib.text.Text at 0x10f010f28>,
 2: <matplotlib.text.Text at 0x10f06deb8>,
 3: <matplotlib.text.Text at 0x10f087518>,
 4: <matplotlib.text.Text at 0x10f0872b0>,
 5: <matplotlib.text.Text at 0x10f087d68>,
 6: <matplotlib.text.Text at 0x10ed94630>,
 7: <matplotlib.text.Text at 0x10ed94c50>,
 8: <matplotlib.text.Text at 0x10ed94320>,
 9: <matplotlib.text.Text at 0x10f046198>,
 10: <matplotlib.text.Text at 0x10f046f28>,
 11: <matplotlib.text.Text at 0x10c292b00>,
 13: <matplotlib.text.Text at 0x10ed595c0>,
 14: <matplotlib.text.Text at 0x10ed59668>,
 15: <matplotlib.text.Text at 0x10ed59eb8>,
 16: <matplotlib.text.Text at 0x10ed566d8>}

## Compute the closeness centrality of the graph

In [89]:
# TODO
w=nx.get_edge_attributes(G, 'weight')
betweenness = nx.closeness_centrality(G)
for n, b in betweenness.items():
    print(labels[n], b)

ACCIAIUOL 0.4
ALBIZZI 0.5
BARBADORI 0.56
BISCHERI 0.4375
CASTELLAN 0.4827586206896552
GINORI 0.5
GUADAGNI 0.5
LAMBERTES 0.42424242424242425
MEDICI 0.6363636363636364
PAZZI 0.4117647058823529
PERUZZI 0.5
RIDOLFI 0.5384615384615384
SALVIATI 0.4117647058823529
STROZZI 0.4827586206896552
TORNABUON 0.5185185185185185


## Plot the graph with de node size as function of the closeness centrality

In [3]:
# TODO