# Graph Analytics - Basic Node

In this notebook, we're going to get more familiar with our dataset and do some analytics.

In [1]:
from py2neo import Graph

#################################################
# Update UPDATE-ME in the connection code with 
# The server you were assigned (see the schedule 
# notebook) to connect to using the 
# Links below.
#################################################
# Server 0 - neo4j.dsa.missouri.edu
# Server 1 - neo4j-1.dsa.missouri.edu
# Server 2 - neo4j-2.dsa.missouri.edu
# Server 3 - neo4j-3.dsa.missouri.edu
#################################################

graph = Graph("bolt://wikiread:wikireader@neo4j-1.dsa.missouri.edu:9000")

## Wikipedia Statistics
Our Wikipedia database, all 6.1GB of it, is the total of articles and their links as of 6/1/18.

It's always good to know how much data we're working with.
Let's see how many articles and links we have.
The [cypher reference card](https://neo4j.com/docs/cypher-refcard/3.3/) is a useful resource if you need a quick language guide.

Let's first get an idea of the number of pages and links we have.

But before that, let's drill into the data types of the objects we are chaining together.

In [2]:
# What does a graph.run() return?
page_total = graph.run("MATCH (a) RETURN COUNT(a) as pages")
print(type(page_total))
# a Cursor object, which is a structure with a set of results

<class 'py2neo.database.Cursor'>


You may have encountered cursors in your prior database studies, or you may have overlooked them.

Take a moment to review the concept: https://en.wikipedia.org/wiki/Cursor_%28databases%29
 * Just review Usage and Scrollable cursors for now


In [3]:
# We can convert that into a table
page_total = graph.run("MATCH (a) RETURN COUNT(a) as pages").to_table()
print(type(page_total))

<class 'py2neo.data.Table'>


In [4]:
# Note that letting jupyter print the table gives different formatting.
page_total = graph.run("MATCH (a) RETURN COUNT(a) as pages").to_table()
page_total

pages
13835767


In [5]:
# We can also convert it to list of dictionaries, one per row returned
page_total = graph.run("MATCH (a) RETURN COUNT(a) as pages").data()
print(page_total)

[{'pages': 13835767}]


##### Be careful with conversions of cursors to tables.  
If the result set is large, the Python environment can drastically slow down as memory usage increases.

In [6]:
# How many nodes are in our graph?
#  This line pulls the first row and column from the graph, the page count
page_total = graph.run("MATCH (a) RETURN COUNT(a) as pages").to_table()[0][0]

# How many edges?
outgoing_links = graph.run("MATCH (a)-[r]->(b) RETURN COUNT(r) as links").to_table()[0][0]

print(page_total, "pages with", outgoing_links, "page links.")


13835767 pages with 146045322 page links.


Looks like we have around 13M articles and 146M page links.

It's important to keep in mind how you phrase (draw) your edge directions. 
Your first thought may be to query edges of any direction, 
but an outgoing edge is just another node's incoming edge, 
so we'd expect this count to be twice as large.

In [8]:
total_links = graph.run("MATCH (a)-[r]-(b) RETURN COUNT(r) as links").to_table()[0][0]

print(total_links, "total links")


292090610 total links


But that number isn't quite right, is it? 
We'd expect 292,090,**644** links since we're counting each edge twice, not 292,090,**610**. 
We're missing 34 or maybe 17 edges, since we'd expect each end of the link to be counted.

Ideally, our database can do basic math, and there's not much to the query. 
Perhaps we have broken links? 
Our incoming and outgoing totals may not be the same.

In [9]:
incoming_links = graph.run("MATCH (a)<-[r]-(b) RETURN COUNT(r) as links").to_table()[0][0]

print(incoming_links, "incoming page links,", outgoing_links, "outgoing page links,", total_links, "in total")


146045322 incoming page links, 146045322 outgoing page links, 292090610 in total


But they seem to be the same number. The numbers don't lie, or at least shouldn't. 
If these really are the totals, there must be something wrong with our assumptions. 
We have the same number of outgoing edges and incoming edges, 
but our total is slightly less than the sum of the two. 
Do we have duplicates? 
Is there another edge type we're not counting somehow?

We've counted `a-->b`, `b-->a`, and `a<-->b`, but what about `a-->a`? 
We assumed there would be two distinct ends to our edges, and that's what we've been counting. 
We counted `a-->b` and `b<--a`, and we expected that to be the same as `a<-->b`, but when `b` is `a`, we end up with `a-->a` and `a<--a`.

`a-->a` and `a<--a` are considered the same when we ask for `a<-->b`, 
so it's only counted once. 
This really could be argued either way, 
but there's not much to gain from arguing with database software. 
Let's see who these troublesome articles are.


Identify the pages that link to themselves.

In [10]:
# Notice the where clause
data = graph.run("MATCH (a)-->(b) WHERE a = b RETURN a.title").to_table()
print(data)

 a.title                                       
-----------------------------------------------
 Etawah                                        
 Greater Chennai Corporation                   
 Empty calorie                                 
 Sunnybank Rugby                               
 Souths Rugby                                  
 Botswana Telecommunications Corporation       
 Pococí (canton)                               
 London Buses route 98                         
 Star Maa                                      
 King Edward VII School                        
 Melrose–Rugby, Roanoke, Virginia              
 Jalalpur Pirwala                              
 Cobhams Asuquo                                
 Gayaza                                        
 Clapham Town (ward)                           
 Stockwell (ward)                              
 Knight's Hill (ward)                          
 Phillip Ingram                                
 Teru               

Inter-page links are formatted differently than intra-page links in Wikipedia. 
These 34 pages are most likely _style guide violations_ that had not been corrected at the time of download.

With our edge mysteries accounted for, let's look at our nodes.

----------
### <span style="background:yellow">Your Turn</span>

In theory, we can have four types of nodes.
One with no links, one with only incoming links, 
one with only outgoing links, and one with both. 
Let's verify this and see if the sum of all four cases matches our node total.

**As a starter hint: Which set of those will this match?**
```
MATCH (n) WHERE NOT (n)--() RETURN COUNT(n)
```

In [11]:
# M4:P1:Q1
nodes_none,nodes_in,nodes_out,nodes_in_out = (0,0,0,0)

# Construct where clauses to count these four
#  types of nodes and verify our hypothesis.
# HINT: node and edge connections can be 
#       used in where clauses
# ----------------------------

nodes_none = graph.run("""
MATCH (n) WHERE NOT (n)--() RETURN COUNT(n)
""").to_table()[0][0]


nodes_in = graph.run("""
MATCH (n) WHERE (n)<--() and NOT (n)-->()  RETURN COUNT(n)
""").to_table()[0][0]


nodes_out = graph.run("""
MATCH (n) WHERE (n)-->() and NOT (n)<--() RETURN COUNT(n)
""").to_table()[0][0]


nodes_in_out = graph.run("""
MATCH (n) WHERE (n)-->() and (n)<--() RETURN COUNT(n)
""").to_table()[0][0]


query_total = nodes_none + nodes_in + nodes_out + nodes_in_out

print(
    (
     "{} with no links ({:.2f}%)\n"
     "{} with only incoming links ({:.2f}%)\n"
     "{} with only outgoing links ({:.2f}%)\n"
     "{} with both ({:.2f}%)\n{} total, {} expected\n"   
    ).format(
        nodes_none, 
        nodes_none/query_total*100, 
        nodes_in, 
        nodes_in/query_total*100, 
        nodes_out, 
        nodes_out/query_total*100, 
        nodes_in_out, 
        nodes_in_out/query_total*100, 
        query_total, 
        page_total)
)

500848 with no links (3.62%)
301756 with only incoming links (2.18%)
5205846 with only outgoing links (37.63%)
7827317 with both (56.57%)
13835767 total, 13835767 expected



---
As you can see, this is just the tip of the iceberg when it comes to analytics using graph databases.
The lessons continue by diving deeper into link analysis.

# Save your Notebook, then `File > Close and Halt`

---