# Networks and Neo4j

This chapter will go over the basics on how to connect to a neo4j database, query the database, and analyse data.  

Before you begin this lesson,

* Read though the Instalation Guide 
* Start you neo4j server (You should be able to over console in the browser.)
    


In [None]:
# Imports 
import sys

#from py2neo import authenticate, neo4j, Graph, Relationship
from py2neo import authenticate, Graph, Relationship
#from py2neo.cypher import CypherWriter
#from py2neo import cypher
import cypher

#import MySQLdb
import pymysql
import numpy

import networkx
# Allows plots to be showed inline 
import matplotlib
%matplotlib inline

# load the cypher cell magic extention
#%load_ext py2neo.cypher
%load_ext cypher


### Connecting to the Neo4j Database 

There are two ways we are going to connect to our Neo4j Databases: The first method is by using the py2neo modual and the second is by using ipython cell magic.

We will begin by creating a connection via py2neo. 

In [None]:
# Set up connection to Ne04j local Database 
# You will need to authenticate your connection, use the next line as an example 
# authenticate("localhost:7474", "<YOUR USENAME HERE>, Default is: neo4j", "<YOUR_PASSWORD_HERE>")

database_host = "localhost"
database_port = "7474"
database_username = "neo4j"
database_password = "<YOUR_PASSWORD_HERE>"

# set up authentication parameters
authenticate( database_host + ":" + database_port, database_username, database_password )

# Create a variable for our graph and print our connection infomation
graph_db = Graph()
print( graph_db )


### Testing our connection
To extract data from our database, we can pass cypher commands to py2neo using the Graph object instance's `data()` command.

Run the cell below to tell py2neo to return a single node.

In [None]:
graph_db.data( "MATCH (n) RETURN n limit 1" )

## Connecting via Cell Magic 

Since we are using ipython notebook, we can also extract data from our database using the Cypher ipython cell magic. Cell magic are commands that begin with the '%' symbol .

The % command means the current line is using cell magic
The %% command mean the entire cell is using cell magic 

To connect to the database via cell magic, use the command, 

** %%cypher http://<YOUR USENAME HERE, Default is: neo4j>: <YOUR PASSWORD HERE>@localhost:7474/db/data  **

followed by a cypher command.

Run the cell below to tell py2neo to return a single node.


In [None]:
%%cypher http://neo4j:<YOUR_PASSWORD_HERE>@localhost:7474/db/data
Match (n) RETURN n limit 1

## Quick Review of Neo4j and Cypher 

When you think of data, you probably imagine an excel table where each row is an individual observation or data point. 
For example, 

|name | age | employee_id|
|-----|-----|------------|
|Joe  |  34 |   12345    |
|Ann  |  54 |   12346    |
    
In a graph database, that row is more like a ball. 


##### (node: Employee {name:Joe, age: 34, employee_id: 12345})

This ball is called a "node" in Neo4j. In fact, the little parentheses around Joe's information are designed to help the user think of Joe's information as a little ball. 

All the information about Joe is still there, but it's just not in a flat table format. Instead of storing each peice of Joe's information in a variable column, Joe's information is stored as **properties**, ie name, age, employee_id.
Joe's node also has a **label**, *Employee*, to identify this node as belonging to an employee.

Neo4j uses the Cypher Query Language to get information out of the Database. 

##### Components of a simple Cypher Query:
+ MATCH      
     - Essentially the same thing as SELECT in SQL 
+ (n)        
    - Any node (the n is just a variable, could be any letter) 
+ RETURN     
    - Needed in every query 
+ LIMIT      
    - Same as in SQL 

The following query will return 20 nodes from the database:  
##### graph_db.cypher.execute("MATCH (n) RETURN n LIMIT 20")

If I wanted to query just Award nodes, I would run this query;  
##### graph_db.cypher.execute("MATCH (n:Award) RETURN n LIMIT 20")


In [None]:
# You can use this cell to Test the Cypher Queries 
graph_db.data( "MATCH (n:Award) RETURN n LIMIT 20" )

### Cypher Examples 

The below are some examples of different cypher commands.

You can comment and uncomment the different commands to get the same infomation with py2neo and %cypher

Note: While both tools are extracting the same infomation, they return that data back to use slightly differently. Py2neo returns a RecordList object. Cypher will return a dataframe type object

In [None]:
# Collect 20 Employee Nodes - Cypher query via py2neo 3
node_query_single =  graph_db.data( "MATCH (a:Award), (e:Employee) RETURN a,e LIMIT 20" )

print( node_query_single )

In [None]:
# Collect 20 Employee Nodes - cell magic
node_query_single =  %cypher http://neo4j:<YOUR_PASSWORD_HERE>@localhost:7474/db/data MATCH (e:Employee) RETURN e LIMIT 20

print( node_query_single )

In [None]:
# Collect 20 Award Nodes and 20 employee Nodes 

node_query_multiple =  graph_db.data("MATCH (a:Award), (e:Employee) RETURN a,e LIMIT 20")

#node_query_multiple = %cypher http://neo4j:<YOUR_PASSWORD_HERE>@localhost:7474/db/data MATCH (a:Award), (e:Employee) RETURN a,e LIMIT 20
print( node_query_multiple )

In [None]:
# Return 20 relationshipswhere an Employee worked on an Award

relationsip_query = graph_db.data("MATCH (a:Award) <-[r:WORKED_ON]- (e:Employee) RETURN r LIMIT 20")

#relationsip_query = %%cypher http://neo4j:<YOUR_PASSWORD_HERE>@localhost:7474/db/data MATCH (a) <-[r:WORKED_ON]- (e) RETURN a,e,r LIMIT 20
print( relationsip_query )

In [None]:
# Pattern Query 
# Retrun 20 instances of the pattern where two employees worked on the same award. Return only the employeeid of each employee  

pattern_query = graph_db.data("MATCH (e1:Employee) --> (a:Award) <-- (e2:Employee) RETURN e1.employeeid, e2.employeeid LIMIT 20")

#pattern_query = %cypher http://neo4j:<YOUR_PASSWORD_HERE>@localhost:7474/db/data MATCH (e1:Employee) --> (a:Award) <-- (e2:Employee) RETURN e1.employeeid, e2.employeeid LIMIT 20

print( pattern_query )

## Using Python to Automate Cypher Tasks 

An advantage of using python to interface with neo4j is that you can send commands to neo4j in a saved and reproducable manner. In theory, all the work we did copying and pasting commands into neo4j in the instalation chapter can be replaced with python commands.

To demonstrate this we will add some new Nodes and Relationships to our graph.

Roke College prides itself on providing research opertunities for its Students. Students are often employeed by the university to provide research assistance to falcitly members on the awards they work on. 

To fully add the Students to the database we will need to 
    1. Load in the student nodes via a csv file 
    2. Create a WORKED_ON relationship between the students and the awards on which they work
    3. Createa WORKED_WITH relationship between students and their peers who worked on the same awards 
    

_NOTE:_

_You may nottice that here we are using the MERGE command instead of the CREATE command to create relationships. This is done to prevent duplicate nodes and relationships from being created. A CREATE command will create a new node or relationship regarless of wheather that entitiy already exsists. The MERGE command will create that node or relationship only if it does not already exsist._


##### Load in Student Nodes

In [None]:
# store data directory path for re-use
data_directory_path = ""

cypher_string = '''USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM \"file://'''
cypher_string += data_directory_path
cypher_string += '''/student_data.csv\"
AS row CREATE (:Student {employeeid: row.employeeid, position: row.occupation_orig});'''

graph_db.data( cypher_string )

##### Create Worked on Relationships with Awards 

In [None]:
cypher_string = '''USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM \"file://'''
cypher_string += data_directory_path
cypher_string += '''/award_data.csv" AS row
MATCH (a:Award {award_num: row.uniqueawardnumber})
MATCH (s:Student {employeeid: row.employeeid})
MERGE (s)-[r:WORKED_ON]->(a);'''

graph_db.data( cypher_string )

##### Create Worked With relationships between the students and exsstiing staff 

In [None]:
#%%cypher http://neo4j:upintheA1R!!@localhost:7474/db/data 
cypher_string = '''MATCH (n1)-[:WORKED_ON]->(a:Award)<-[:WORKED_ON]-(n2)
MERGE (n1)-[r:WORKED_WITH]-(n2);'''

graph_db.data( cypher_string )

### If you would like to see the new data, open your neo4j console. you'll see that we now have a new node label, Student. 

# Networkx
### Plotting and Graph Anlysis 

Networkx is a python modual for creating, displaying and analysing graph data. 

We can load data from our neo4j graph into networkx by preforming queries and passing that data to networkx.

Run the cell below to see an example

In [None]:
#Imports that should really be at the top of the notebook 
#import networkx
# Allows plots to be showed inline 
#import matplotlib
#%matplotlib inline

# Preform a cypher query to get 50 instances where a person worked on an award
# Note, I am using %cypher here beasue that dataframe formate is easier for networkx to convert 
results = %cypher http://neo4j:<YOUR_PASSWORD_HERE>@localhost:7474/db/data MATCH d = (p) -[r]-> (a:Award) RETURN d LIMIT 50

# Convert to graph object
graph = results.get_graph()

# Create a Color Map so the Graph will be colored 
color_map =[]
for node in graph.nodes(data =True):
    this_labels = node[1]['labels']
    if 'Employee' in this_labels:
        color_map.append('green')
    elif 'Student' in this_labels:
        color_map.append('yellow')
    elif 'Award' in this_labels:
        color_map.append('red')

# Draw the graph 
networkx.draw(graph, node_color = color_map)

"""
Notes 
Green Nodes are Employees
Yellow Nodes are Students
Red Nodes are Awards

"""

## Network Analysis 
The networkx module also has built in tools to analyse graphs. The following measurements are used to acess how the nodes are related to each other in the graph database.  

### Network Measurments 
This is the vacab for studing a network 

 **Degree Centrality ** - counts the number of edges that a node has 
     - Nodes with a high degree of connections usally play an important role in a network 
 **Betweenness ** - indicator of a nodes centality in a network. 
     - Equal to the number of shortest paths from all vertices to all others that pass through that node 
 **Diameter** - The longest shrortest path over all pairs of nodes 
     - Often we want to find the shortest distance between two nodes, the diameter is the longest of theses paths 
     - Nodes that occur on many shortest paths between other nodes in the graph have a high betweenness centrality score  
 **Cliques ** - A clique is a subset of vertices of an undirected graph such that every two distinct vertices in the clique are adjacent.  
     
     
Lets look how these measurements look on a sample set of data.
  

#### Run the Cell Below to create one of Networkx eample graphs, the Maze Graph

In [None]:
# Create The graph
maze=networkx.sedgewick_maze_graph()

# Draw the graph
networkx.draw(maze)

print( "Tuttle Graph" )
print( "-------------" )
print( "Number of Nodes: " + str( maze.number_of_nodes() ) )

### Degree and Centrality

- Counts the number of edges that a node has 
- Nodes with a high degree of connections usally play an important role in a network 

In [None]:
# Maxinum number of connections  
print( "The Maximum number of Edges is " + str( max(maze.degree().values()) ) )

# Lowest Number of Connections 
print( "The Minimum number of Edges is " + str( min(maze.degree().values()) ) )

# Average Number of Connections 
print( "The Average number of Edges is " + str( numpy.mean(maze.degree().values()) ) )


# median number of connections 
print( "The Median number of Edges is " + str( numpy.median(maze.degree().values() ) ) )

In [None]:
# The Degree is is divided by the maximum possible number of connections 
# The bigger the number the more connections 

# Maxinum number of Normalized Degree Centrality   
print( "The Maximum Degree Centrality is " + str( max(networkx.degree_centrality(maze).values()) ) )


# Lowest Number of Normalized Degree Centrality  
print( "The Minimum Degree Centrality is " + str( min(networkx.degree_centrality(maze).values()) ) )

# Average Number of Normalized Degree Centrality  
print( "The Average Degree Centrality is " + str( numpy.mean(networkx.degree_centrality(maze).values()) ) )


# median number of Normalized Degree Centrality  
print( "The Median Degree Centrality is " + str( numpy.median(networkx.degree_centrality(maze).values()) ) )

In [None]:
# Centralization: How equal are the nodes 
# How much variation is ther in the centrality scroes amoung thenodes?
# http://cs.brynmawr.edu/Courses/cs380/spring2013/section02/slides/05_Centrality.pdf
# The closer to 1, there are popular nodes that interacts with many nodes
# the closer to zero, the interactions between nodes are more evenly distributed 

max_degree = max(maze.degree().values())
all_degrees = maze.degree().values()
nodes_num = maze.number_of_nodes()

centrality = sum([max_degree - x for x in all_degrees]) / float(((nodes_num - 1)*(nodes_num - 2)))
print( "The Centrility of this graph is" + str( centrality ) )

### Betweenness
- Equal to the number of shortes paths from all vertices to all others that pass through that node 

In [None]:

print( "The Maximum Betweenness measure is " + str( max(networkx.betweenness_centrality(maze).values()) ) )

print( "The Minimum Betweenness measure is " + str( min(networkx.betweenness_centrality(maze).values()) ) )

print( "The Average Betweenness measure is " + str( numpy.mean(networkx.betweenness_centrality(maze).values()) ) )

print( "The Median Betweenness measure is " + str( numpy.median(networkx.betweenness_centrality(maze).values()) ) )

### Diameter
- Often we want to find the shortest distance between two nodes, the diameter is the longest of theses paths 
- Nodes that occur on many shortest paths between other nodes in the graph have a high betweenness centrality score  

In [None]:
print( "The Diameter of this graph is " + str( networkx.diameter(maze) ) )


### Cliques

- A clique is a subset of vertices of an undirected graph such that every two distinct vertices in the clique are adjacent.  

In [None]:
print( "The Cliques in the Maze graph are, " + str( list(networkx.find_cliques(maze)) ) )

##### The above list of cliques are a little unintresting.
Run the code cell below to see the clique list of another built-in graph, the lolipop graph

You can see that the lollipop graph has a cluster of 10 nodes that make up the "candy" part of the lollipop while the "stem" of the lollipop is a line of nodes.

In [None]:
# Create the Graph 
lolli_g = lollipop=networkx.lollipop_graph(10,20)

print( "The Cliques in the Lollipop graph are, " + str( list(networkx.find_cliques(lolli_g)) ) )

# Draw the Graph 
networkx.draw(lolli_g)