The Neo4J REST interface
========================

We generally want to use a graph database with large and unwieldy datasets - that is more or less the point of the thing. So this means that we usually won't be creating or maintaining a graph exclusively, or even primarily, with Cypher. It also means that we can get at the data without going through the user interface, pretty as it is.

Neo4J has [a REST API](http://neo4j.com/docs/stable/rest-api.html) - this means that you can use URLs and the `requests` library to do more or less anything you like!

In [1]:
import requests

r = requests.get('http://localhost:7474/db/data/')
r.status_code

401

Got to authenticate first! See the docs.

In [2]:
import base64

authstring = 'Basic ' + base64.b64encode(
    b"neo4j:YOURPASSWORDGOESHERE").decode('ascii')


r = requests.get('http://localhost:7474/db/data/', 
                 headers={'Authorization': authstring})
r.status_code

200

In [3]:
r.json()

{'batch': 'http://localhost:7474/db/data/batch',
 'constraints': 'http://localhost:7474/db/data/schema/constraint',
 'cypher': 'http://localhost:7474/db/data/cypher',
 'extensions': {},
 'extensions_info': 'http://localhost:7474/db/data/ext',
 'indexes': 'http://localhost:7474/db/data/schema/index',
 'neo4j_version': '2.3.3',
 'node': 'http://localhost:7474/db/data/node',
 'node_index': 'http://localhost:7474/db/data/index/node',
 'node_labels': 'http://localhost:7474/db/data/labels',
 'relationship_index': 'http://localhost:7474/db/data/index/relationship',
 'relationship_types': 'http://localhost:7474/db/data/relationship/types',
 'transaction': 'http://localhost:7474/db/data/transaction'}

The Neo4J interface was designed to be *discoverable*, meaning you can start at the beginning and it tells you a bunch of places you can go. You see that you can get the node and relationship types, and an index for each of the nodes, and so on.

In [4]:
r = requests.get('http://localhost:7474/db/data/labels', 
                 headers={'Authorization': authstring})
r.json()

['Movie', 'BOOK', 'Person', 'PERSON', 'BORROWER', 'AUTHOR']

You can execute Cypher statements through the REST API. Reload the Hollywood database if you have deleted it, to get some data in there.

In [6]:
import json

cypher_stmt = 'MATCH (cloudAtlas {title: "Cloud Atlas"}) RETURN cloudAtlas'

payload = json.dumps(
    {'statements': [
        { 'statement': cypher_stmt, "resultDataContents" : [ "REST" ] }
    ]})

r = requests.post('http://localhost:7474/db/data/transaction/commit', 
                  headers={'Authorization': authstring}, data=payload)
r.json()

{'errors': [],
 'results': [{'columns': ['cloudAtlas'],
   'data': [{'rest': [{'all_relationships': 'http://localhost:7474/db/data/node/105/relationships/all',
       'all_typed_relationships': 'http://localhost:7474/db/data/node/105/relationships/all/{-list|&|types}',
       'create_relationship': 'http://localhost:7474/db/data/node/105/relationships',
       'data': {'released': 2012,
        'tagline': 'Everything is connected',
        'title': 'Cloud Atlas'},
       'incoming_relationships': 'http://localhost:7474/db/data/node/105/relationships/in',
       'incoming_typed_relationships': 'http://localhost:7474/db/data/node/105/relationships/in/{-list|&|types}',
       'labels': 'http://localhost:7474/db/data/node/105/labels',
       'metadata': {'id': 105, 'labels': ['Movie']},
       'outgoing_relationships': 'http://localhost:7474/db/data/node/105/relationships/out',
       'outgoing_typed_relationships': 'http://localhost:7474/db/data/node/105/relationships/out/{-list|&|types}'

But this is quickly getting rather complicated! We will come back to the REST interface to discuss traversals, but in the meantime there is a library to handle this REST interface for us. It is called py2neo and [its documentation is here](http://py2neo.org/v3/).

In [None]:
!pip install py2neo==3b2

In [9]:
from py2neo import Graph
db = Graph(password="cyK-Jek-Va")
result = db.run(cypher_stmt)

cloudatlas = result.evaluate()
cloudatlas["title"] 

'Cloud Atlas'

Here is how you add a node to the graph:

In [10]:
from py2neo import Node, Relationship

transaction = db.begin()
jennifer = Node("Person", name="Jennifer Lawrence", born=1990)
jennifer

(jennifer_lawrence:Person {born:1990,name:"Jennifer Lawrence"})

In [11]:
db.exists(jennifer)

False

In [12]:
transaction.create(jennifer)
db.exists(jennifer)

False

In [13]:
transaction.commit()
db.exists(jennifer)

True

And here is how you add a relationship.

In [None]:
hungergames = Node("Movie", title="The Hunger Games", released=2012)
transaction = db.begin()
transaction.create(hungergames)
katniss = Relationship(jennifer, 'ACTED_IN', hungergames, 
                       roles=['Katniss Everdeen'])
transaction.create(katniss)
transaction.commit()

Now we can put this together with what we have learned about Python to create a lot of nodes at once. IMDB makes its data freely available as a bunch of text files, and we know how to read text files. We also know how to create nodes. So let's give it a try.

The "actors" and "actresses" lists come with a helpful set of guidelines about their own format:

    "xxxxx"        = a television series
    "xxxxx" (mini) = a television mini-series
    [xxxxx]        = character name
    <xx>           = number to indicate billing position in credits
    (TV)           = TV movie, or made for cable movie
    (V)            = made for video movie (this category does NOT include TV 
                     episodes repackaged for video, guest appearances in 
                     variety/comedy specials released on video, or 
                     self-help/physical fitness videos)

and a single entry looks like this:

	Lawrence, Jennifer (III)	14 Actors Acting (2010) (V)  [Herself]
							16th Annual Critics' Choice Movie Awards (2011) (TV)  [Herself - Presenter]
							17th Annual Screen Actors Guild Awards (2011) (TV)  [Herself]
							18th Annual Critics' Choice Movie Awards (2013) (TV)  [Herself]
							19th Annual Critics' Choice Movie Awards (2014) (TV)  (credit only)  [Herself - Nominee]  <25>
                            
so we have some hints for interpreting it. Anything that is in quotes, or followed by (TV), is a TV show. Anything else is a movie, and for now let's just care about movies.

The `movies.list` comes with a bunch of things in a very similar format. An actual movie looks like this:

    A Monster Is Loose in the City (1976)                   1976

whereas a TV episode is more likely to look like this:

    "Una Maid en Manhattan" (2011) {Vida obra y milagro (#1.40)}    2012

So we can go through the movie list, getting all the actual movies and making nodes out of them. We can then go through the taglines list, adding a tagline for each movie. Finally we can go through the actors and actresses list, adding those people who have appeared in the movies we select, and so have a much larger movie graph!



In [32]:
import re

movienodes = []
with open("imdb/movies.list", encoding="ISO-8859-15") as f:
    found_movies_list = False
    for line in f:
        if found_movies_list:
            # Our movie line parsing logic goes here.
            if re.match("^\s*$", line):
                continue
            elif re.match('^\"', line):
                continue
            else:
                # Strip off the line ending
                line = line.rstrip()
                lineparts = re.split("\t+", line)
                if len(lineparts) != 2:
                    print("Weird line: %s" % line)
                    continue
                title = lineparts[0]; released = lineparts[1]

                # Now we know the release date is in the second line part.
                # Let's look at the first one - if it has a (TV) or a (V) 
                # then we don't care. 
                if re.search("\(T?V\)", title):
                    continue
                    
                # We can also see that the release date is repeated in the
                # title. We don't care about that. Let's get rid of it.
                reldate = re.compile(" \(%s\)" % re.escape(released))
                title = reldate.sub('', title)
                
                # Finally, change the released date to an integer type if it's a year.
                if re.match("^\d+$", released):
                    released = int(released)
                
                # Now we can make our graph node.
                newmovie = Node("Movie", title=title, released=released)
                movienodes.append(newmovie)
        else:
            if line.find("======") > -1:
                found_movies_list = True

print("Done, with %d movies" % len(movienodes))

Weird line: --------------------------------------------------------------------------------
Done, with 950430 movies


In [33]:
print(movienodes[125632])

(d8af168:Movie {released:2010,title:"C'est comme ça que ça finit"})
