# Introduction to Neo4j and Cypher

Cypher, originally created for Neo4j, is a query language supported by many graph databases. 
This lab will walk you through the very basics: setting up a connection, querying nodes, and identifying edges.

## Topics covered
 1. Graph Data Structure Concepts
 1. Graph Database (Neo4j)   

## Node Assignments 

* Server `neo4j-1.dsa.lan`: Last name A - G 
* Server `neo4j-2.dsa.lan`: Last name H - M
* Server `neo4j-3.dsa.lan`: Last name N - S
* Server `neo4j-4.dsa.lan`: Last name T - Y

## Readings
  * [What is a Graph Database](https://neo4j.com/developer/graph-database/)

### Basic Cypher Query Language
In the examples below, you are seeing basic cypher language queries. 

[You can read a complete over view of the Cypher language here](https://neo4j.com/developer/cypher-query-language/)

**A few of its highlights are pasted below:**

__About Cypher__

Cypher is a declarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax.

It allows us to state what we want to select, insert, update or delete from our graph data without requiring us to describe exactly how to do it.

__Nodes__

Cypher uses ASCII-Art to represent patterns. We surround nodes with parentheses which look like circles, e.g. (node). If we later want to refer to the node, we’ll give it an variable like (p) for person or (t) for thing. In real-world queries, we’ll probably use longer, more expressive variable names like (person) or (thing). If the node is not relevant to your question, you can also use empty parentheses ().

Usually, the relevant labels of the node are provided to distinguish between entities and optimize execution, like (p:Person).

We might use a pattern like (person:Person)-->(thing:Thing) so we can refer to them later, for example, to access properties like person.name and thing.quality.

__The more general structure is:__

MATCH (node:Label) RETURN node.property

MATCH (node1:Label1)-->(node2:Label2) WHERE node1.propertyA = {value} RETURN node2.propertyA, node2.propertyB

Please note that node-labels, relationship-types and property-names are case-sensitive in Cypher. All the other clauses, keywords and functions are not, but should be cased consistently according to the style used here.

__Relationships__

To fully utilize the power of our graph database we want to express more complex patterns between our nodes. Relationships are basically an arrow --> between two nodes. Additional information can be placed in square brackets inside of the arrow.

__This can be__

relationship-types like -[:KNOWS|:LIKE]->

a variable name -[rel:KNOWS]-> before the colon
additional properties -[{since:2010}]->
structural information for paths of variable length -[:KNOWS*..4]->
To access information about a relationship, we can assign it a variable, for later reference. It is placed in front of the colon -[rel:KNOWS]-> or stands alone -[rel]->.

### Connecting
Connecting to a Neo4j database is slightly easier than a standard database, as each database contains only one graph. No schemas or databases to pick from, just a username, a password, and a port. Bolt is a binary protocol for database connections designed by the Neo4j team, much like ones used by other database applications. This avoids the overhead of text-based connections like HTTP.

In [3]:
from py2neo import Graph

#################################################
# Update UPDATE-ME in the connection code with 
# The server you were assigned (see the schedule 
# notebook) to connect to using the 
# Links below.
#################################################
# Server 1 - neo4j-1.dsa.lan
# Server 2 - neo4j-2.dsa.lan
# Server 3 - neo4j-3.dsa.lan
# Server 4 - neo4j-4.dsa.lan
#################################################

graph = Graph("bolt://wikiread:wikireader@neo4j-1.dsa.lan:9000")

# Identifying a Node

As you may have guessed from the connection information, we're using some simple Wikipedia data. Each node represents a single Wikipedia article, and each edge is a link from one article to another.

Let's lookup a single Wikipedia page in the graph. There's only one type of node in this graph, a Page. Each node has an id number, and each Page contains a title. With this, let's lookup a page using a MATCH statement.

In Cypher, a node is identified by a set of parentheses, and its parameters with braces. Here we ask the server to lookup p, a node with the Page label, which has the title Neo4j and return it to us.

In [4]:
data = graph.run("MATCH (p:Page {title: 'Neo4j'}) RETURN p")

print(data.to_table())

 p                                
----------------------------------
 (_6745383:Page {title: 'Neo4j'}) 



In the returned data, we see the node's ID (prefixed with an underscore), its label, and its properties. But this didn't tell us much that we didn't already know.

Note above that we ran the query with a filter specifying we only wanted to look for Page nodes even know we know the database only contains Page nodes. 
It's a good idea to specify types, even if there's only one. Just like any other query language, it's important to know both what to ask and how. 

 * Label Filter for `Page` nodes
```
:Page
```

Try running that query without the label filter, it's much slower.

In [5]:
from timeit import default_timer as timer

begin = timer()
graph.run("MATCH (p:Page {title: 'Neo4j'}) RETURN p")
end = timer()
print("With type specifier: {0:0.4f} (s)".format(end-begin))

begin = timer()
graph.run("MATCH (p {title: 'Neo4j'}) RETURN p")
end = timer()
print("Without type specifier: {0:0.4f} (s)".format(end-begin))

With type specifier: 0.0100 (s)
Without type specifier: 8.2775 (s)


# Finding Edges

Looking at a single page doesn't tell you much. Let's see what kind of links are going on.

In Cypher, edges (links) are indicated by lines or arrows between nodes. 
Instead of parentheses, edges are identified using square braces. 
Links can be filtered by type and properties as well. 
The direction of the arrow indicates the direction of the edge. 
Having neither (or both) arrowheads searches both incoming and outgoing edges.

Here, we're asking for our Neo4j page, `p`, and any outgoing links, `l`, to other pages.

In [6]:
data = graph.run("MATCH (p:Page {title: 'Neo4j'})-[l:Link]->(:Page) RETURN p,l")

print(data.to_table())

 p                                | l                                  
----------------------------------|------------------------------------
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_9539638)  
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_7546967)  
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_1496086)  
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_881262)   
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_8648946)  
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_12188)    
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_13622)    
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_1917072)  
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_6911747)  
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_8241569)  
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_1970728)  
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Li

We got some data back, sure, but it's not actually very useful. In the database, edges are just connections between node id numbers. While edges may have useful properties just like nodes, our links do not.

----

Previously, we asked for every outgoing link, `l`, from our page, `p`, to any arbitrary page. 
Given what you've seen so far, how do you think we could change the following query to get us our page `p` and all pages it connects to?

In [7]:
# This is the code we ran above. Notice how it has been edited to give the pages that connect via the links
# -----------------------------------------
# data = graph.run("MATCH (p:Page {title: 'Neo4j'})-[l:Link]->(:Page) RETURN p,l")
    
    # Carefully review the difference between the two lines
    
data = graph.run("MATCH (p:Page {title: 'Neo4j'})-[l:Link]->(q:Page) RETURN p,l,q")

print(data.to_table())



 p                                | l                                  | q                                                         
----------------------------------|------------------------------------|-----------------------------------------------------------
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_9539638)  | (_9539638:Page {title: 'Apache Giraph'})                  
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_7546967)  | (_7546967:Page {title: 'Nikolaj Nyholm'})                 
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_1496086)  | (_1496086:Page {title: 'United States'})                  
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_881262)   | (_881262:Page {title: 'Remote backup service'})           
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_8648946)  | (_8648946:Page {title: 'Gremlin (programming language)'}) 
 (_6745383:Page {title: 'Neo4j'}) | (_6745383)-[:Link {}]->(_12188)  

Now that we're capturing the page being linked to we can actually see the relationships between articles. 
The results are still a little cluttered though. 



# Language Exploration
You can return a property from an object using a dot notation (ex: `a.x`)

Tweak your query from above to return only the titles of the pages.

In [8]:
# This is the code we ran above. Notice how it has been edited to give only the titles of the pages.

# ------------------
# data = graph.run("MATCH (p:Page {title: 'Neo4j'})-[l:Link]->(q:Page) RETURN p,l,q")
    # Carefully review the difference between the two lines
    
data = graph.run("MATCH (p:Page {title: 'Neo4j'})-[l:Link]->(q:Page) RETURN p.title,q.title")


print(data.to_table())



 p.title | q.title                        
---------|--------------------------------
 Neo4j   | Apache Giraph                  
 Neo4j   | Nikolaj Nyholm                 
 Neo4j   | United States                  
 Neo4j   | Remote backup service          
 Neo4j   | Gremlin (programming language) 
 Neo4j   | Malmö                          
 Neo4j   | North America                  
 Neo4j   | GPLv3                          
 Neo4j   | Neo Technology                 
 Neo4j   | OrientDB                       
 Neo4j   | Sweden                         
 Neo4j   | Java (programming language)    
 Neo4j   | ArangoDB                       
 Neo4j   | San Francisco Bay Area         
 Neo4j   | Cypher Query Language          
 Neo4j   | ACID                           
 Neo4j   | AGPLv3                         
 Neo4j   | Europe                         
 Neo4j   | Graph database                 
 Neo4j   | Affero General Public License  
 Neo4j   | GNU General Public Li

That should look much better.

Right now, we're looking up outgoing edges, let's change that to incoming edges.

In [9]:
# This is the code we ran above. Notice how it has been edited to give the pages that connect via the links

# ------------------
# data = graph.run("MATCH (p:Page {title: 'Neo4j'})-[l:Link]->(q:Page) RETURN p.title,q.title")
    # Carefully review the difference between the two lines 
    
data = graph.run("MATCH (p:Page {title: 'Neo4j'})<-[l:Link]-(q:Page) RETURN p.title,q.title")



print(data.to_table())



 p.title | q.title                                           
---------|---------------------------------------------------
 Neo4j   | Paradise Papers                                   
 Neo4j   | Linkurious                                        
 Neo4j   | List of TCP and UDP port numbers                  
 Neo4j   | Query language                                    
 Neo4j   | Connection pool                                   
 Neo4j   | Spatial database                                  
 Neo4j   | Paxos (computer science)                          
 Neo4j   | Well-known text                                   
 Neo4j   | List of JVM languages                             
 Neo4j   | List of software under the GNU AGPL               
 Neo4j   | DataNucleus                                       
 Neo4j   | NoSQL                                             
 Neo4j   | Graph database                                    
 Neo4j   | Comparison of structured storage software   


---



As mentioned before, we can ask for edges of either direction as well with either `--` or `<-->`

In [10]:
data = graph.run("MATCH (a:Page {title: 'Neo4j'})<-[:Link]->(b:Page) RETURN a.title,b.title")

print(data.to_table())

 a.title | b.title                                           
---------|---------------------------------------------------
 Neo4j   | Apache Giraph                                     
 Neo4j   | Nikolaj Nyholm                                    
 Neo4j   | United States                                     
 Neo4j   | Remote backup service                             
 Neo4j   | Gremlin (programming language)                    
 Neo4j   | Malmö                                             
 Neo4j   | North America                                     
 Neo4j   | GPLv3                                             
 Neo4j   | Neo Technology                                    
 Neo4j   | OrientDB                                          
 Neo4j   | Sweden                                            
 Neo4j   | Java (programming language)                       
 Neo4j   | ArangoDB                                          
 Neo4j   | San Francisco Bay Area                      

But, since we removed the link information, we can't tell which way the edge goes. Even with it, we'd have to compare id numbers either visually or write some code to process the results.

While you can't ask which direction an edge is going (at least at the time of writing), you can ask which node the edge starts with, which can let us hack together a direction indicator and get a taste of what can be done with more complex queries. We'll also be experimenting with graph visualization in later notebooks.

In [11]:
data = graph.run("""
MATCH (a:Page {title: 'Neo4j'})<-[l:Link]->(b:Page)

RETURN a.title,
 CASE WHEN STARTNODE(l) = a THEN '-------->' ELSE '<--------' END AS direction,
 b.title""")

print(data.to_table())

 a.title | direction | b.title                                           
---------|-----------|---------------------------------------------------
 Neo4j   | --------> | Apache Giraph                                     
 Neo4j   | --------> | Nikolaj Nyholm                                    
 Neo4j   | --------> | United States                                     
 Neo4j   | --------> | Remote backup service                             
 Neo4j   | --------> | Gremlin (programming language)                    
 Neo4j   | --------> | Malmö                                             
 Neo4j   | --------> | North America                                     
 Neo4j   | --------> | GPLv3                                             
 Neo4j   | --------> | Neo Technology                                    
 Neo4j   | --------> | OrientDB                                          
 Neo4j   | --------> | Sweden                                            
 Neo4j   | --------> | Ja

**That's it for this brief introduction to using Cypher and Neo4j.** 


# Save your Notebook, then `File > Close and Halt`

---