# Module 13: Graph Databases: Neo4j

## Graph Databases

- Store information as __nodes__ (i.e., entities) and __edges__ (i.e., relationships).  Every __node__ and __edge__ is assigned a unique identifier. See a simple example: https://towardsdatascience.com/graph-databases-whats-the-big-deal-ec310b1bc0ed#:~:text=Based%20upon%20the%20concept%20of,that%20expresses%20key%20value%20pairs


- In a graph database, we can define one or more __edges__ (a.k.a., __relationships__) between nodes. Within a graph-based data storage scheme, __nodes__ that are connected by __edges__ will physically "point" to one another, thereby enabling fast searches for information via related nodes.


- __Nodes__ contain __properties__, which are the attributes of the information stored by a given __node__. __Properties__ are typically comprised of __key:value__ pairs. Multiple __properties__ can be contained within a single __node__.


- __Nodes__ can be assigned one or more __labels__. Within a graph database, __labels__ allow us to group similar __nodes__ together during our graph searches. In other words, we can think of __labels__ as a tool we can use for purposes of grouping similar nodes together as "sets".


- One effective way to contemplate graph databases is to consider each __node__ as __a noun__, while __edges__ represent __verbs or actions that connect to nouns to one another__. Each node within a graph database is represented by a unique identifier as well as one or more __properties__ that pertain to the person, place, or thing represented by the node. Each __edge__ within a graph database is also represented by a unique identifier and can contain one or more properties that describe the direction and characteristics of a relationship between two nodes.


- Since __nodes__ can have one or more __relationships__ (a.k.a. "__edges__") defined between them, graph databases are referred to as being __"Multidimensional"__ databases.


- For an overview of basic graph database terminology and concepts, see this link: https://neo4j.com/docs/getting-started/current/graphdb-concepts/#graphdb-labels


- Graph database __performance tends to be very high__ since relationships between nodes do not need to be calculated at the time of a query (as is the case in a relational database). __Performance tends to remain relatively constant as the volume of information stored within a graph database increases__: query execution time is proportional only to the size of the part of the graph traversed to satisfy the query, rather than the overall size of the graph.


- Graphs are __naturally additive__, meaning we can add new kinds of relationships, new nodes, new properties, and new subgraphs to an existing graph structure without disturbing existing queries and application functionality.


- Graph databases are often used for __fraud detection__, __recommendations systems__, __social networks, logistics__, and __spatial data__, but usage is currently expanding to many other fields as well. 


- Examples of graph database tools include the __Neo4J, Infinite Graph, OrientDB__, and __FlockDB__ graph database systems.

# Graph Databases vs. Relational Databases

- Relational databases can provide a high degree of data integrity if organized correctly. Tables represent __entities__; Each table can contain one or more __attributes__; __Relationships__ between tables are formed via both explicit __primary key__ and __foreign key__ relationships and the use of shared unique data values across two or more tables.


- By contrast, in a graph database we instatiate __relationships__ between __nodes__ via __edges__ that provide us with direct access to the __properties__ contained within edge-connected __nodes__.


- In relational databases, __joins__ are computed at query time by matching primary keys, foreign keys, and/or other shared unique data values of all rows within the tables to be joined. These operations are computational expensive and their execution time grows exponentially as the volume of data contained within the tables increases.


- To improve the performance of relational database __join__ statements, it is often necessary to __de-normalize__ the data to some extent, the effect of which is to diminish the overall integrity of the relational data model.


- In graph databases, each __node__ in the model explicitly contains a list of relationship records (a.k.a., __edges__) that represent the node's relationships to other nodes. These relationship records are __organized by type and direction__ and may hold additional attributes. When we run the equivalent of a JOIN operation within a graph database, the graph database uses these pre-built lists to __directly access__ connected __nodes__, thereby and eliminating the need for expensive search-and-match computations.


- As a result of eliminating the need for expensive search-and-match computations (i.e., the type of computations required within a relational database model), __graph database performance can be several orders of magnitude more efficient than that of a relational database__ when performing join-intensive queries.


### Data Modeling Differences

An easy-to-follow example of the differences between a fully normalized RDBMS data model and a graph database model for the same data can be seen here: https://neo4j.com/developer/graph-db-vs-rdbms/
Note how within the RDBMS model a total of three tables must be searched and joined to identify all departments to which a person has been assigned, whereas within the graph model the needed information can be accessed directly from the node associated with the desired person. In effect, we've eliminated the need for any join operation since the department information is accessible __directly__ from the "person" node.



# Graph Database Data Modeling

- Graph databases are not required to follow any pre-defined schema; We can add, delete, and modify __nodes__ and __relationships__ freely at any time and for any reason.


- Start by deciding upon whether you should model a given aspect of your data as a __node__ (i.e., one or more key:value pairs) or an __edge__ (i.e., a __relationship__ that exists between two nodes).


- Each __node__ within a graph model can contain one or more __properties__ that are defined as __key:value__ pairs, i.e., __nodes__ are somewhat analogous to RDBMS __table rows__, in that each row within an RDBMS table row contains one or more __attributes__, with each attribute sharing an common attribute name. Within a graph database, the RDBMS attribute / column name would become the __key__ component of a graph database property's __key:value__ pair.


- Each __node__ within a graph model can be tagged with one or more __labels__. We apply __labels__ to nodes for purposes of grouping similar nodes together as "sets" within our graph database searches. For example, in a graph database comprised of data on movies, we might have nodes representing actors, directors, movie names, movie studios, etc. If so, we could apply appropriate labels to each node, and then use those labels as part of our data retrieval queries. Note that multiple labels could be assigned to any given node, e.g., many actors have also directed films. For an expanded discussion of __labels__, see this link: https://medium.com/neo4j/graph-modeling-labels-71775ff7d121


- When defining __relationships__, try to craft the different user defined "types" of relationships you will instantiate within your graph database such that they are likely to represent common verbs or actions that reflect both the characteristics of the data your are modeling AND the types of queries you are likely to apply to the data. For example, within a graph database comprised of data on movies, we might have __relationships__ such as "acted in" or "directed". For an expanded discussion of __relationships__, see this link: https://medium.com/neo4j/graph-data-modeling-all-about-relationships-5060e46820ce


- Each __relationship__ we define between nodes can be of only one user defined relationship "type". However, __relationships__ may have one or more __properties__ associated with them.


- When defining __relationships__ it is also useful to consider the "direction" or __traversal__ in which your queries are likely to search the graph. For example, contininuing the movie database example, if we wanted to identify all films in which a given actor had appeared, our graph __traversal__ would likely start by locating the node pertaining to the given actor, and then follow and "acted in" relationships to access the associated "movie" nodes.


- Remember: There is no "best" or "optimal" way in which to organize any data within a graph database; How we organize our data within a graph database should be dependent on our data retrieval needs, e.g., what sorts of queries do we plan to execute against the data?

# Neo4j

- Neo4J is one of the most widely used graph database platforms due to its scalability, performance, and ease of use. Real-world applications of the Neo4j platform can be found within many large businesses, and its use continues to expand.


- As a company, Neo4j is highly user-oriented: The company has developed many effective data migration and API tools that enable the import of data from other types of database systems and computing environments into a Neo4j graph database. Some examples: https://neo4j.com/developer/data-import/


- Prospective users of Neo4j are required to learn __Cypher__, Neo4J’s query language. __Cypher__ was designed to apply the logic of SQL statements within the context of a graph database. As such, SQL users should find the use of __Cypher__ to be relatively straightforward.


## Cypher


A detailed tutorials on __Cypher__ syntax can be seen here: https://www.tutorialspoint.com/neo4j/neo4j_cql_introduction.htm

Additional __Cypher__ reference materials are available from Neo4j: https://neo4j.com/developer/cypher/intro-cypher/


### Create a Neo4j Node

A useful narrated explanation of Cypher __CREATE__ syntax can be seen here: https://www.youtube.com/watch?v=3VHgmB0SPxQ

To __create a node having a single label__, we use the following syntax, wherein 'node' is simply a __temporary variable name__ we can use later within our current series of Cypger queries to refer to the node we've just created, and 'label' is a __label__ we are applying to the new node for purposes of enabling grouping of the new node with other similarly labeled nodes:

- _CREATE (node:label)_ 

An example: 

- CREATE (cloudAtlas:Movie)

This Cypher command creates a new node with one __label__ ("Movie"). We've also created a __temporary variable name__ ("cloudAtlas") which we can use to reference the new node within our current sequence of Cypher queries.

To create a node having __multiple labels__:

- _CREATE (node:label1:label2:...labeln)_

An example: 

- CREATE (cloudAtlas:Movie:Book)

This Cypher command creates a new node with two __labels__ ("Movie" and "Book"). We've also created a __temporary variable name__ ("cloudAtlas") which we can use to reference the new node within our current sequence of Cypher queries.


To create a node with one label and one or more properties, the syntax is similar to that of a Python dictionary, i.e., we specify a series of key:value pairs separated by commas and enclosed within parentheses:

- _CREATE (node:label { key1: value, key2: value, . . . . . . . . .  })_ 

An example:

- CREATE (cloudAtlas:Movie { title:"Cloud Atlas", released:2012 })

This Cypher command creates a new node with one __label__ ("Movie") and two __properties__ (title:"Cloud Atlas" and released:2012). We've also created a __temporary variable name__ ("cloudAtlas") which we can use to reference the new node within our current sequence of Cypher queries.

One more example:

- CREATE (tom:Person { name:"Tom Hanks", born:1956 })

This Cypher command creates a new node with one __label__ ("Person") and two __properties__ (name:"Tom Hanks" and born:1956). We've also created a __temporary variable name__ ("tom") which we can use to reference the new node within our current sequence of Cypher queries.


### Create a Neo4j Relationship

Now that we've created two nodes in the above examples (one pertaining to a movie; one pertaining to an actor), we can define a __relationship__ between them. The Cypher syntax required for creating a __relationship__ is as follows:

- _CREATE (node1)-[:RelationshipType]->(node2)_

As per the syntax, we start by specifying the node which we want our graph to use as the starting point for our __traversal__ of the information contained within the graph, followed by the __-[:RelationshipType]->__ component of the syntax, followed by the node to which we are establishing a relationship to our first node. Within the __-[:RelationshipType]->__ component of the syntax we define a name for the type of relationship we are establishing. As noted above, the __relationship types__ we establish within our graphs should represent common verbs or actions that reflect both the characteristics of the data we are modeling AND the types of queries we are likely to apply to the data.

Continuing our example from above, let's establish a __relationship__ between our movie and our actor that is indicative of the fact that the given actor was a cast member of the given movie:

- CREATE (tom)-[:ACTED_IN { roles: ['Zachry']}]->(cloudAtlas)

This Cypher command establishes a relationship type of __ACTED_IN__ between __person__ "tom" and __movie__ "cloudAtlas", and we are assigning one property to our new relationship (__{roles: ['Zachry']}__).  Therefore, we are indicating that actor Tom Hanks __ACTED_IN__ the move Cloud Atlas, wherein he played the role of a character by the name of "Zachry".



### Create a Small Graph: Movie Info

An example of using a series of Cypher commands to instantiate six nodes + some useful relationships between those six nodes:

https://neo4j.com/docs/getting-started/current/cypher-intro/schema/#cypher-intro-schema-example-graph

Note how the temporary variable names assigned to the nodes are used repeatedly throughout the sequence of Cypher commands.

## Retrieving Data from a Neo4j Graph

Within the Cypher language we make use of the __MATCH__ command for purposes of locating and retrieving data from a Neo4j graph. The generic syntax is as follows:

- _MATCH (result:label)<-[: Relationship]-(n)_

- _RETURN result;_

The __RETURN__ command is required for purposes of conveying the retrieved information to your programming enviornment or Neo4j terminal application.

An example: 

- MATCH (actor:Person { name: "Tom Hanks" })

- RETURN actor;

In the above example, we are limiting our search to a specific node within the nodes labeled as "Person". 

Another example

- MATCH (:Person {name: 'Robert Zemeckis'})-[:DIRECTED]->(movie:Movie)

- RETURN movie

In this example, we are searching for all movies directed by Robert Zemeckis.

### Other Cypher Commands

The Cypher language contains many commands that are very similar to those found in other query languages, including __WHERE, ORDER BY, LIMIT, DELETE__, etc. A thorough introduction can be found here: https://www.tutorialspoint.com/neo4j/neo4j_cql_introduction.htm

# Migrating Relational Database Content to Neo4j

When deriving a graph model from a relational model, use the following general guidelines:

- An RBMS __table row__ is analogous to a Neo4j __node__: Each __row__ becomes a __node__ within the graph database.


- Within a Neo4j __node__, the attributes contained within the RDBMS __table row__ become __properties__ of the __node__.


- An RDBMS __table name__ is analogous to a Neo4j __label name__: Each __table name__ becomes a __label__ for the nodes that are derived from the original RDBMS table. __Nodes__ can have more than one __label__.


- An RDBMS __join__ or __foreign key__ is analogous to a Neo4j __relationship__ (a.k.a., __"edge"__). In a graph database, we instatiate __relationships__ between __nodes__ explicitly within the nodes themselves.

A guide on how to map an RDBMS schema to a graph database schema can be seen here: https://neo4j.com/developer/relational-to-graph-modeling/

A guide to data modeling in Neo4j can be seen here: https://neo4j.com/developer/guide-data-modeling/

A guide to importing data into Neo4j from a relational database can be seen here: https://neo4j.com/developer/guide-importing-data-and-etl/

## Accessing Neo4j From a Python Environment

Two widely-used Neo4j API's are available for use within a Python environment: __Neo4j-python-driver__ and __Py2Neo__.

### Neo4j-python-driver

Installation instructions for the Anaconda-compatible __Neo4j-python-driver__ API can be see here:

https://anaconda.org/conda-forge/neo4j-python-driver

A Python shell example of using the __Neo4j-python-driver__ API can be seen here:

https://towardsdatascience.com/neo4j-cypher-python-7a919a372be7


### Py2Neo

Installation instructions for the Anaconda-compatible __Py2Neo__ API can be seen here:

https://anaconda.org/conda-forge/py2neo

A Jupyter Notebook example of using the Py2Neo API to interact with a Neo4j database can be seen here:

https://medium.com/@technologydata25/connect-neo4j-to-jupyter-notebook-c178f716d6d5