# Py2Neo Quickstart

Neo4j is, without question, one of the most flexible and powerful database engines available, allowing its users to query information, model complex relationships, and detect trends in ways that are simply unachievable with standard relational databases. That said, its documentation is often disjointed and opaque, rendering it a sometimes frustrating experience-- particularly if you are looking to connect to a graph via Python or R. Major changes are often implemented without updates to the official guides, and most of the answers on StackOverflow are out-of-date after recent changes. I love this platform, but using it isn't exactly straightforward.

The below notebook is a simple guide to connecting to the Graph via the latest version of py2neo, loading data into the graph, and reading out queries via Cypher.

If you have any questions (or if you find better documentation) hit me up at gray.ian.hunter@gmail or on Twitter @IanHGray.

In [3]:
#Import dependencies
from py2neo import Graph, Node, Relationship
import pandas as pd

In [4]:
#Open connection to the Neo4j graph via bolt. In this case, we're using Neo4j desktop.
graph = Graph("bolt://localhost:7687", auth=('neo4j', 'PASSWORD_HERE'))
graph.delete_all()

Create some dummy data. In this case, we're looking at classic rock bands that have considerable overlap
in their membership. Fun fact, did you know that Eric Clapton is the only triple inductee to the Rock & Roll hall of fame?
He was awarded for his work in the Yardbirds (often-overlooked precursor to Led Zeppelin), Cream, and as a solo artist.

In [5]:
dataframe = pd.DataFrame({'Musician':['Eric Clapton','Jeff Beck','Jimmy Page',\
 'Eric Clapton','Ginger Baker','Jack Bruce','Jeff Beck','Tim Bogert','Carmine Appice',\
 'Jimmy Page','Robert Plant','John Paul Jones','John Bonham'],

'Band':['Yardbirds','Yardbirds','Yardbirds','Cream','Cream','Cream',\
'Jeff Beck Group','Jeff Beck Group','Jeff Beck Group','Led Zeppelin','Led Zeppelin',\
'Led Zeppelin','Led Zeppelin'],
                             
'Instrument':['Guitar','Guitar','Guitar','Guitar','Bass','Drums',\
'Guitar','Bass','Drums','Guitar','Vocal','Keyboard','Drums']})

#Lets print the dataframe to see how it looks in tabular form.
print (dataframe)

           Musician             Band Instrument
0      Eric Clapton        Yardbirds     Guitar
1         Jeff Beck        Yardbirds     Guitar
2        Jimmy Page        Yardbirds     Guitar
3      Eric Clapton            Cream     Guitar
4      Ginger Baker            Cream       Bass
5        Jack Bruce            Cream      Drums
6         Jeff Beck  Jeff Beck Group     Guitar
7        Tim Bogert  Jeff Beck Group       Bass
8    Carmine Appice  Jeff Beck Group      Drums
9        Jimmy Page     Led Zeppelin     Guitar
10     Robert Plant     Led Zeppelin      Vocal
11  John Paul Jones     Led Zeppelin   Keyboard
12      John Bonham     Led Zeppelin      Drums


A powerful feature of graph databases is their ability to assign uniqueness to properties.
We call that explicitly here, as there is only one instance of a musician and a band, although they can
have different arrangements.

In [6]:
graph.schema.create_uniqueness_constraint('musician', 'musician_name')
graph.schema.create_uniqueness_constraint('band', 'band_name')

Because this is a relatively small dataset, we can iterate over its elements to insert into the graph.
For larger datasets, it may be necessary to use Neo4j's native data loading procedures.

In [7]:
for index, rows in dataframe.iterrows():
    musician_name = rows['Musician']
    band_name = rows['Band']
    instrument = rows['Instrument']
    
    #Create a Node object with information about the musician. Note that in this newer version of
    #py2neo, we have to explicitly declare primarylabel and primarykey.
    musician = Node('musician', musician_name = musician_name, instrument = instrument)
    musician.__primarylabel__='musician'
    musician.__primarykey__='musician_name'
    
    #Create a Node object for the band
    band = Node('band', band_name = band_name)
    band.__primarylabel__='band'
    band.__primarykey='band_name'
    
    #Create a Relationship object, describing the connection
    PLAYED_IN = Relationship.type("PLAYED_IN")
    
    #Now insert nodes and relationship to the graph
    graph.merge(PLAYED_IN(musician, band), 'musician_name', 'band_name')

Now that the data is stored in Neo4j, we can run cypher queries against it, returning the results as
pandas dataframe. In this case, we'll look for musicians who played in both the Yardbirds and Cream.

In [8]:
neo_df = pd.DataFrame(graph.run(
"MATCH (m:musician)-[:PLAYED_IN]->(y:band {band_name:'Yardbirds'})\
MATCH (m)-[:PLAYED_IN]->(c:band {band_name:'Cream'})\
RETURN m.musician_name, m.instrument").data())
print (neo_df)

  m.instrument m.musician_name
0       Guitar    Eric Clapton


The result is Eric Clapton, because he's the best. But what if we want to add data from another source, which
pairs the bands with the albums they released? 

In [10]:
albums_df = pd.DataFrame({'album':['For Your Love', 'Having a Rave Up with The Yardbirds',\
                                   'Yardbirds', 'Over Under Sideways Down', 'Little Games',\
                                 'Fresh Cream', 'Disraeli Gears', 'Wheels of Fire', 'Goodbye',\
                                 'Led Zeppelin', 'Led Leppelin II', 'Led Zeppelin III',\
                                   'Led Zeppelin IV', 'Houses of the Holy', 'Physical Graffiti',\
                                 'Presence', 'In Through the Out Door', 'Truth', 'Beck-Ola', 'Rough and Ready',\
                                  'Jeff Beck Group'],
                        'year':[1965, 1965, 1966, 1966, 1967, 1966, 1967, 1968, 1969,\
                               1969, 1969, 1970, 1971, 1973, 1975, 1976, 1979, 1968, 1969, 1971, 1972],
                          
                        'band':['Yardbirds', 'Yardbirds', 'Yardbirds', 'Yardbirds', 'Yardbirds',\
                               'Cream', 'Cream', 'Cream', 'Cream', 'Led Zeppelin', 'Led Zeppelin',\
                                'Led Zeppelin', 'Led Zeppelin', 'Led Zeppelin', 'Led Zeppelin',\
                               'Led Zeppelin', 'Led Zeppelin', 'Jeff Beck Group', 'Jeff Beck Group',\
                               'Jeff Beck Group', 'Jeff Beck Group']})
print(albums_df)

                                  album  year             band
0                         For Your Love  1965        Yardbirds
1   Having a Rave Up with The Yardbirds  1965        Yardbirds
2                             Yardbirds  1966        Yardbirds
3              Over Under Sideways Down  1966        Yardbirds
4                          Little Games  1967        Yardbirds
5                           Fresh Cream  1966            Cream
6                        Disraeli Gears  1967            Cream
7                        Wheels of Fire  1968            Cream
8                               Goodbye  1969            Cream
9                          Led Zeppelin  1969     Led Zeppelin
10                      Led Leppelin II  1969     Led Zeppelin
11                     Led Zeppelin III  1970     Led Zeppelin
12                      Led Zeppelin IV  1971     Led Zeppelin
13                   Houses of the Holy  1973     Led Zeppelin
14                    Physical Graffiti  1975     Led Z

We can merge the above dataset into the graph using the same technique as before.
The MERGE operation will check to see if a node already exists, and then connect new data to it, or it will
create a new node.

In [11]:
graph.schema.create_uniqueness_constraint('album', 'album_name')

for index, rows in albums_df.iterrows():
    album_name = rows['album']
    band_name = rows['band']
    year = rows['year']
    
    band = Node('band', band_name = band_name)
    band.__primarylabel__='band'
    band.__primarykey__='band_name'
    
    album = Node('album', album_name = album_name, year = year)
    album.__primarylabel__='album'
    album.__primarykey__='album_name'
    
    RELEASED = Relationship.type("RELEASED")
    
    graph.merge(RELEASED(band, album), 'band_name', 'album')

The final product should look like this:

In [16]:
%%html
<img src="yardbirds_network.png", width = 1000, height=800>

Using Cypher, we can calculate basic statistics about the network, for instance, the musicians who have the greatest number
of connections within the dataset:

In [13]:
degree_df = pd.DataFrame(graph.run("MATCH (m:musician)\
                            RETURN m.musician_name, size((m)-[]-()) AS degree\
                            ORDER BY degree DESC").data())
print(degree_df)

   degree  m.musician_name
0       2     Eric Clapton
1       2        Jeff Beck
2       2       Jimmy Page
3       1     Ginger Baker
4       1       Jack Bruce
5       1       Tim Bogert
6       1   Carmine Appice
7       1     Robert Plant
8       1  John Paul Jones
9       1      John Bonham


As this dataset is focused on the Yardbirds and its subsequent acts, the above results make sense. Each *original* member of the Yardbirds (so not New Yardbirds/proto-Led Zeppelin era) has a degree of 2 as they appear in multiple bands, while every other musician has a degree of 1.

This is a highly simplistic example, but should demonstrate how Python interracts with the most recent version of Neo4j and Py2Neo.

Also, go check out the Yardbirds.