# Almacenando datos de WDB en Neo4j con py2neo

![png](http://d20tdhwx2i89n1.cloudfront.net/image/upload/t_next_gen_article_large_767/btp7c4imyevfdt9icxlo.jpg)

http://localhost:7474/

## py2neo

py2neo is one of Neo4j's Python drivers. It offers a fully-featured interface for interacting with your data in Neo4j.  
http://py2neo.org/v3/#

In [67]:
import py2neo 
print py2neo.__version__

3.1.2


In [68]:
!echo 'learner' | sudo -S pip install py2neo --upgrade

[sudo] password for learner: Requirement already up-to-date: py2neo in /usr/local/lib/python2.7/dist-packages
Cleaning up...


## Connect
Connect to Neo4j with the Graph class.

In [69]:
from py2neo import Graph

graph = Graph()

## Nodes
Create nodes with the Node class. The first argument is the node's label. The remaining arguments are an arbitrary amount of node properties or key-value pairs.

In [70]:
#!pwd
#!ls -lrt /home/learner/notebooks/neo4j
#!ls -lrt /home/learner/notebooks/neo4j/scripts
#!ls -lrt /home/learner/notebooks/SWP-ScotWitchProject
#!ls -lrt /home/learner/notebooks/SWP-ScotWitchProject
#!ls -lrt /home/learner/notebooks/SWP-ScotWitchProject/scripts
#!cp -r /home/learner/notebooks/neo4j/scripts /home/learner/notebooks/SWP-ScotWitchProject/.

#!cp -r /home/learner/notebooks/neo4j/figures /home/learner/notebooks/SWP-ScotWitchProject/.
#!cp -r /home/learner/notebooks/neo4j/images /home/learner/notebooks/SWP-ScotWitchProject/.
#!cp -r /home/learner/notebooks/neo4j/util /home/learner/notebooks/SWP-ScotWitchProject/.


In [71]:
from py2neo import Node

#### Inicializamos la BBDD

In [72]:
graph.delete_all()

#### Seleccionamos del fichero solo cuatro acusados para cargar. Con ello conseguimos:  
1. Que los gráficos sean legibles.
2. No sobrecargar el sistema.

In [73]:
!cat /home/learner/notebooks/data/WDB_Accused.txt             \
    | perl -pe 's/^(\r\n)//g'                                 \
    | perl -pe 's/([^:]..)\r\n/\1\[CR\]/g'                    \
    | sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta'        \
    | grep -E "A/EGD/142\"|A/EGD/143\"|A/EGD/1212|A/EGD/1729" \
    | cut -d "|" -f 1,4,5,6 | sed 's/"//g'                      \
    | sed -e '/|$/d'                                          \
    | sed -e '/||/d'                                          \
    > /home/learner/notebooks/data/WDB_Accused_loadNeo4j_py2neo.txt

!wc /home/learner/notebooks/data/WDB_Accused.txt
!wc /home/learner/notebooks/data/WDB_Accused_loadNeo4j_py2neo.txt
!echo "\nRegistros de WDB_Accused para cargar:"
!cat /home/learner/notebooks/data/WDB_Accused_loadNeo4j_py2neo.txt

  3228  23093 645767 /home/learner/notebooks/data/WDB_Accused.txt
  4   4 137 /home/learner/notebooks/data/WDB_Accused_loadNeo4j_py2neo.txt

Registros de WDB_Accused para cargar:
A/EGD/1212|Katherine|Wilson|Katherine
A/EGD/142|Annas|Erskine|Anne
A/EGD/143|Issobell|Erskine|Isobel
A/EGD/1729|Margret|Jackson|Margaret


#### Creamos una restricción de unicidad para evitar cargar el mismo registro dos veces

In [74]:
%load_ext cypher

The cypher extension is already loaded. To reload it, use:
  %reload_ext cypher


In [75]:
%cypher CREATE CONSTRAINT ON (a:Accused) ASSERT a.AccusedRef IS UNIQUE

0 rows affected.


#### Realizamos la carga, creando un nodo por cada acusado

In [76]:
WDB_Accused_dataPath = '/home/learner/notebooks/data/WDB_Accused_loadNeo4j_py2neo.txt'
WDB_Accused_csvFile = open(WDB_Accused_dataPath, "r")
n = 0
for WDB_Accused_csvLine in WDB_Accused_csvFile:
    try:
        graph.create(Node("Accused"
                      , AccusedRef = WDB_Accused_csvLine.split("|")[0]
                      , FirstName = WDB_Accused_csvLine.split("|")[1]
                      , LastName = WDB_Accused_csvLine.split("|")[2]
                      , NodeLabel = WDB_Accused_csvLine.split("|")[1]
                                     + "_" + WDB_Accused_csvLine.split("|")[2] + "_Accused"
                     )
                )
        n = n + 1
    except py2neo.GraphError:
        print "Error al crear nodo para", WDB_Accused_csvLine.split("|")[0]
        break
print n, "nodos creados."

WDB_Accused_csvFile.close()

4 nodos creados.


#### Verificamos el resultado
#### ¿Cuantos nodos tenemos creados?

In [77]:
%%cypher
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
RETURN count(1)

1 rows affected.


count(1)
4


#### ¿Que contienen?

In [78]:
%%cypher
MATCH (Acusado:Accused)
RETURN Acusado

4 rows affected.


Acusado
"{u'AccusedRef': u'A/EGD/1212', u'NodeLabel': u'Katherine_Wilson_Accused', u'FirstName': u'Katherine', u'LastName': u'Wilson'}"
"{u'AccusedRef': u'A/EGD/142', u'NodeLabel': u'Annas_Erskine_Accused', u'FirstName': u'Annas', u'LastName': u'Erskine'}"
"{u'AccusedRef': u'A/EGD/143', u'NodeLabel': u'Issobell_Erskine_Accused', u'FirstName': u'Issobell', u'LastName': u'Erskine'}"
"{u'AccusedRef': u'A/EGD/1729', u'NodeLabel': u'Margret_Jackson_Accused', u'FirstName': u'Margret', u'LastName': u'Jackson'}"


### Chequeamos la carga gráficamente

In [79]:
from scripts.vis import draw

options = { "Accused": "NodeLabel"
          }
draw(graph, options, physics=True)

#### A continuación crearemos los nodos de sus familiares

In [80]:
!cat /home/learner/notebooks/data/WDB_Accused_family.txt \
    | perl -pe 's/^(\r\n)//g' \
    | perl -pe 's/([^:]..)\r\n/\1\[CR\]/g' \
    | grep -E "A/EGD/142\"|A/EGD/143\"|A/EGD/1212|A/EGD/1729" \
    | sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta' \
    | cut -d "|" -f 1,4,5,14,15,16 | sed 's/"//g'         \
    | sed -e '/|$/d'                                   \
    | sed -e '/||/d'                                   \
    > /home/learner/notebooks/data/WDB_Accused_family_loadNeo4j_py2neo.txt

In [81]:
!echo "Familiares a incluir y relacion con cada acusado:"
!cat /home/learner/notebooks/data/WDB_Accused_family_loadNeo4j_py2neo.txt

Familiares a incluir y relacion con cada acusado:
AF/JO/195|Stewart|Thomas|Husband|A/EGD/1729|jhm
AF/JO/696|Erskine|Johnne|Nephew|A/EGD/142|jhm
AF/JO/697|Erskine|Alexnder|Nephew|A/EGD/142|jhm
AF/JO/698|Erskine|Johnne|Nephew|A/EGD/143|jhm
AF/JO/699|Erskine|Alexander|Nephew|A/EGD/143|jhm
AF/LA/334|Stewart|Annabell|In-law|A/EGD/1729|LEM
AF/LA/335|Stewart|John|In-law|A/EGD/1729|LEM
AF/LA/336|Stewart|Hugh|In-law|A/EGD/1729|LEM
AF/LA/337|Mathie|Jonet|In-law|A/EGD/1729|LEM
AF/LA/338|Stewart|John|In-law|A/EGD/1729|LEM
AF/LA/44|Ruchheid|James|Son|A/EGD/1212|LEM
AF/LA/583|Erskine|Robert|Brother|A/EGD/142|LEM
AF/LA/584|Erskine|Robert|Brother|A/EGD/143|LEM
AF/LA/716|Ruchheid|Margaret|Daughter|A/EGD/1212|LEM
AF/LA/717|Ruchheid|Thomas|Husband|A/EGD/1212|LEM
AF/LA/718|Wilson|Margaret|Sister|A/EGD/1212|LEM
AF/LA/719|Home|John|Son-in-law|A/EGD/1212|LEM
AF/LA/720|Prat|Johne|Brother-in-law|A/EGD/1212|LEM
AF/LA/75|Erskine|Helene|Sister|A/EGD/142|LEM
AF/LA/76|Erskine|Issobell|Sister|A/EGD/142|LEM
AF/LA/77|

In [82]:
%cypher CREATE CONSTRAINT ON (af:Accused_family) ASSERT af.Accused_familyRef IS UNIQUE

0 rows affected.


In [83]:
WDB_Accused_family_dataPath = '/home/learner/notebooks/data/WDB_Accused_family_loadNeo4j_py2neo.txt'
WDB_Accused_family_csvFile = open(WDB_Accused_family_dataPath, "r")
n = 0
for WDB_Accused_family_csvLine in WDB_Accused_family_csvFile:
    try:
        graph.create(Node("Accused_family"
                      , Accused_familyRef = WDB_Accused_family_csvLine.split("|")[0]
                      , SurName           = WDB_Accused_family_csvLine.split("|")[1]
                      , FirstName         = WDB_Accused_family_csvLine.split("|")[2]
                      , Relationship      = WDB_Accused_family_csvLine.split("|")[3]
                      , AccusedRef        = WDB_Accused_family_csvLine.split("|")[4]
                      , NodeLabel         = WDB_Accused_family_csvLine.split("|")[1]
                                    + "_" + WDB_Accused_family_csvLine.split("|")[2]
                                    + "_" + "Accused_family"
                     )
                )
        n = n + 1
    except py2neo.GraphError:
        print "Error al crear nodo para Accused_familyRef:", WDB_Accused_csvLine.split("|")[0]
        break

print n, "nodos creados."

WDB_Accused_family_csvFile.close()

24 nodos creados.


#### Verificamos el resultado, ¿cuantos nodos tenemos ahora?

In [84]:
%%cypher
MATCH (n)
RETURN count(1)

1 rows affected.


count(1)
28


#### ¿Cuantos "acusados"?

In [85]:
%%cypher
MATCH (Acusado:Accused)
RETURN count(1) as ACUSADOS

1 rows affected.


ACUSADOS
4


#### ¿Cuantos "familiares"?

In [86]:
%%cypher
MATCH (Familiar:Accused_family)
RETURN count(1) as FAMILIARES

1 rows affected.


FAMILIARES
24


In [87]:
%%cypher
MATCH (Familiar:Accused_family)
RETURN Familiar

24 rows affected.


Familiar
"{u'AccusedRef': u'A/EGD/1729', u'SurName': u'Stewart', u'FirstName': u'Thomas', u'Relationship': u'Husband', u'Accused_familyRef': u'AF/JO/195', u'NodeLabel': u'Stewart_Thomas_Accused_family'}"
"{u'AccusedRef': u'A/EGD/142', u'SurName': u'Erskine', u'FirstName': u'Johnne', u'Relationship': u'Nephew', u'Accused_familyRef': u'AF/JO/696', u'NodeLabel': u'Erskine_Johnne_Accused_family'}"
"{u'AccusedRef': u'A/EGD/142', u'SurName': u'Erskine', u'FirstName': u'Alexnder', u'Relationship': u'Nephew', u'Accused_familyRef': u'AF/JO/697', u'NodeLabel': u'Erskine_Alexnder_Accused_family'}"
"{u'AccusedRef': u'A/EGD/143', u'SurName': u'Erskine', u'FirstName': u'Johnne', u'Relationship': u'Nephew', u'Accused_familyRef': u'AF/JO/698', u'NodeLabel': u'Erskine_Johnne_Accused_family'}"
"{u'AccusedRef': u'A/EGD/143', u'SurName': u'Erskine', u'FirstName': u'Alexander', u'Relationship': u'Nephew', u'Accused_familyRef': u'AF/JO/699', u'NodeLabel': u'Erskine_Alexander_Accused_family'}"
"{u'AccusedRef': u'A/EGD/1729', u'SurName': u'Stewart', u'FirstName': u'Annabell', u'Relationship': u'In-law', u'Accused_familyRef': u'AF/LA/334', u'NodeLabel': u'Stewart_Annabell_Accused_family'}"
"{u'AccusedRef': u'A/EGD/1729', u'SurName': u'Stewart', u'FirstName': u'John', u'Relationship': u'In-law', u'Accused_familyRef': u'AF/LA/335', u'NodeLabel': u'Stewart_John_Accused_family'}"
"{u'AccusedRef': u'A/EGD/1729', u'SurName': u'Stewart', u'FirstName': u'Hugh', u'Relationship': u'In-law', u'Accused_familyRef': u'AF/LA/336', u'NodeLabel': u'Stewart_Hugh_Accused_family'}"
"{u'AccusedRef': u'A/EGD/1729', u'SurName': u'Mathie', u'FirstName': u'Jonet', u'Relationship': u'In-law', u'Accused_familyRef': u'AF/LA/337', u'NodeLabel': u'Mathie_Jonet_Accused_family'}"
"{u'AccusedRef': u'A/EGD/1729', u'SurName': u'Stewart', u'FirstName': u'John', u'Relationship': u'In-law', u'Accused_familyRef': u'AF/LA/338', u'NodeLabel': u'Stewart_John_Accused_family'}"


#### ¿Y gráficamente?

In [88]:
from scripts.vis import draw

options = {   "Accused": "NodeLabel"
            , "Accused_family": "NodeLabel"
          }
draw(graph, options, physics=True)

### Creamos las relaciones

In [89]:
%%cypher
MATCH (a:Accused),(af:Accused_family)
WHERE a.AccusedRef = af.AccusedRef
CREATE (a)<-[r: Relationship { name: af.Relationship , label: af.Relationship }]-(af)

48 properties set.
24 relationships created.


In [90]:
from scripts.vis import draw

options = {   "Accused": "NodeLabel"
            , "Accused_family": "NodeLabel"
          }
draw(graph, options, physics=True)

#### No he conseguido que en la representación gráfica aparezca el nombre de la relación, pero sí puede forzarse la etiqueta o el nombre asignada a cada relación en el browser
<img src="http://localhost:8001/files/SWP-ScotWitchProject/images/graph.png">

### A continuación realizaremos las cargas completas para obtener algunas de las consultas realizadas con anteriores BBDD

In [91]:
!ls -lrt /usr/share/neo4j

total 16
drwxr-xr-x 2 neo4j adm  4096 Mar  7 18:19 tools
drwxr-xr-x 3 neo4j adm  4096 Mar  7 18:19 bin
drwxr-xr-x 2 neo4j adm  4096 Mar  7 18:19 lib
drwxrwxrwx 2 root  root 4096 Apr  8 08:23 import


In [92]:
!echo 'learner' | sudo -S -u root mkdir /usr/share/neo4j/import

[sudo] password for learner: mkdir: cannot create directory '/usr/share/neo4j/import': File exists


In [93]:
!echo 'learner' | sudo -S -u root chmod a+rw /usr/share/neo4j/import

[sudo] password for learner: 

### Previamente hemos tenido que llevar los ficheros al directorio **import** para poder realizar las cargas con LOAD

In [94]:
!cp /home/learner/notebooks/data/WDB_Accused.txt /usr/share/neo4j/import/.

In [95]:
%%cypher
// assert correct line count
LOAD CSV FROM "file:/WDB_Accused.txt" AS line
RETURN count(*);

1 rows affected.


count(*)
3219


In [96]:
%%cypher
// check first few raw lines
LOAD CSV FROM "file:/WDB_Accused.txt" AS line WITH line
RETURN line
LIMIT 5;

5 rows affected.


line
"[u'A/EGD/10', u'EGD', u'10', u'Mareon', u'Quheitt', u'Marion', u'White', None, None, None, u'Female', None, u'0', u'0', u'Sammuelston', u'P/JO/3539', u'Haddington', u'Haddington', None, None, None, None, None, None, None, None, None, u'SMD', u'15/5/2001 11:06:51', u'jhm', u'9/8/2002 11:40:51']"
"[u'A/EGD/100', u'EGD', u'100', u'Thom', u'Cockburn', u'Thomas', u'Cockburn', None, None, None, u'Male', None, u'0', u'0', None, None, None, u'Haddington', None, None, None, None, None, None, None, None, None, u'SMD', u'15/5/2001 11:06:51', u'jhm', u'2/10/2002 10:32:30']"
"[u'A/EGD/1000', u'EGD', u'1000', u'Christian', u'Aitkenhead', u'Christine', u'Aikenhead', None, None, None, u'Female', None, u'0', u'0', u'Rottinraw', None, None, u'Dumfries', None, None, None, None, None, u'Married', None, None, None, u'SMD', u'15/5/2001 11:06:51', u'jhm', u'1/10/2002 10:48:12']"
"[u'A/EGD/1001', u'EGD', u'1001', u'Janet', u'Ireland', u'Janet', u'Ireland', None, None, None, u'Female', None, u'0', u'0', u'Rottinraw', None, None, u'Dumfries', None, None, None, None, None, u'Widowed', None, None, None, u'SMD', u'15/5/2001 11:06:51', u'jhm', u'1/10/2002 10:49:00']"
"[u'A/EGD/1002', u'EGD', u'1002', u'Agnes', u'Hendersoun', u'Agnes', u'Henderson', None, None, None, u'Female', None, u'0', u'0', None, u'P/ST/1446', u'Stirling', u'Stirling', None, None, None, None, None, None, None, None, None, u'SMD', u'15/5/2001 11:06:51', u'jhm', u'1/10/2002 10:50:07']"


In [97]:
%%cypher
//Limpiando datos existentes
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r

24 relationship deleted.
28 nodes deleted.


In [98]:
%%cypher
USING PERIODIC COMMIT
LOAD CSV FROM "file:/WDB_Accused.txt" AS row
CREATE (:Accused { AccusedRef: row[0]
                 , FirstName: row[3]
                 , LastName: row[4]
                 , FirstName_LastName: row[3] + " " + row[4]
                 });

3219 labels added.
3219 nodes created.
12869 properties set.


### Y la información de los casos

In [99]:
!cp /home/learner/notebooks/data/WDB_Case.txt /usr/share/neo4j/import/.

In [100]:
%%cypher
USING PERIODIC COMMIT
LOAD CSV FROM "file:/WDB_Case.txt" AS row
CREATE (:Case { CaseRef: row[0]
              , AccusedRef: row[4]
});

3413 labels added.
3413 nodes created.
6625 properties set.


In [101]:
%%cypher
MATCH (n {AccusedRef: 'A/LA/3244'})
RETURN n.FirstName_LastName,n.CaseRef,n.AccusedRef

2 rows affected.


n.FirstName_LastName,n.CaseRef,n.AccusedRef
Cristeane Jak,,A/LA/3244
,C/LA/3406,A/LA/3244


#### Y ahora las relaciones entre acusados y casos

In [102]:
%%cypher
//Eliminamos relaciones
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE r

0 rows affected.


In [103]:
%%cypher
MATCH (a:Accused),(c:Case)
WHERE a.AccusedRef = c.AccusedRef
CREATE (a)-[r:WasAccusedInCase]->(c)

3212 relationships created.


In [104]:
!cp /home/learner/notebooks/data/WDB_Accused_family.txt /usr/share/neo4j/import/.

In [105]:
%%cypher
USING PERIODIC COMMIT
LOAD CSV FROM "file:/WDB_Accused_family.txt" AS row
CREATE (:Accused_family { Accused_familyRef: row[0]
                        , Surname: row[3]
                        , Firstname: row[4]
                        , Surname_Firstname: row[3] + ", " + row[4]
                        , Relationship: row[13]
                        , AccusedRef: row[14]
});

951 labels added.
951 nodes created.
5603 properties set.


## Y las relaciones familiares

In [106]:
%%cypher
//Eliminamos relaciones
MATCH (a:Accused),(af:Accused_family)
MATCH (a)-[r]-(af)
DELETE r

0 rows affected.


In [107]:
%%cypher
MATCH (a:Accused),(af:Accused_family)
WHERE a.AccusedRef = af.AccusedRef
CREATE (a)<-[r: Relationship { label:af.Relationship } ]-(af)

924 properties set.
951 relationships created.


## Acusados con más de 5 familiares

In [108]:
%%cypher
MATCH (Accused)<-[Relationship]-(Accused_family)
WITH Accused, count(Accused_family) AS Relatives
WHERE Relatives > 5
RETURN Accused.FirstName_LastName, Accused.LastName, Accused.AccusedRef

4 rows affected.


Accused.FirstName_LastName,Accused.LastName,Accused.AccusedRef
Annas Erskine,Erskine,A/EGD/142
Issobell Erskine,Erskine,A/EGD/143
Margret Jackson,Jackson,A/EGD/1729
Katherine Wilson,Wilson,A/EGD/1212


#### Familiares de Annas Erskine ( A/EGD/142)

In [109]:
%%cypher
MATCH (Accused)<-[Relationship]-(Accused_family)
WHERE Accused.AccusedRef in [ "A/EGD/142" , "A/EGD/143", "A/EGD/1212", "A/EGD/1729"]
RETURN Accused_family.Firstname + ' ' + Accused_family.Surname + ' is the ' + Accused_family.Relationship
    + ' of ' + Accused.FirstName + ' ' + Accused.LastName as RelationShip
ORDER BY Accused.AccusedRef

28 rows affected.


RelationShip
James Ruchheid is the Son of Katherine Wilson
John Home is the Son-in-law of Katherine Wilson
Margaret Wilson is the Sister of Katherine Wilson
Thomas Ruchheid is the Husband of Katherine Wilson
Margaret Ruchheid is the Daughter of Katherine Wilson
Johne Prat is the Brother-in-law of Katherine Wilson
""
Robert Erskine is the Brother of Annas Erskine
Alexnder Erskine is the Nephew of Annas Erskine
Johnne Erskine is the Nephew of Annas Erskine


## Tratamos de utilizar la potencia de Neo4j para detectar nuevas relaciones interesantes
### Añadimos relación entre acusados con el mismo apellido

In [110]:
%%cypher
MATCH (a:Accused),(af:Accused_family)
WHERE a.AccusedRef = af.AccusedRef
CREATE (a)<-[r: Relationship { label:af.Relationship } ]-(af)

924 properties set.
951 relationships created.


In [111]:
%%cypher
MATCH (a:Accused), (b:Accused)
WHERE a.LastName = b.LastName
  AND a.AccusedRef < b.AccusedRef
CREATE (a)-[r1:sameLastname { label:"same Lastname" } ]->(b)

6030 properties set.
6030 relationships created.


In [117]:
from scripts.vis import draw

options = { "Accused": "FirstName_LastName"
          , "Accused_family": "Surname_Firstname"
          , "Case": "CaseRef"
          }

draw(graph, options, physics=True, limit=256)

### Haciendo un poco de zoom podemos ver que destacan los apellidos *Young*, *Stewart*, *Anderson* y *Hunter*
### Detectamos también que hay muchos casos sin relación con acusados (nodos amarillos sin conexión)

#### A continuación un esquema con las entidades y relaciones utilizadas en el ejercicio
<img src="http://localhost:8001/files/SWP-ScotWitchProject/images/Neo4j.png">

### Conclusión:
### Considero que Neo4j puede aportar mucho en este dataset a la hora de detectar nuevas relaciones entre los datos más allá de las identificadas y reflejadas en el modelado original en MS Access.
### La del parentesco es un ejemplo, pero se me ocurren muchas más que espero continuar en algún momento para la publicación del trabajo:
* Estados por los que pasa un acusado, desde la denuncia hasta la aplicación de la sentencia.  
* Se podría incluir información geográfica para obtener el itinerario delacusado en el mapa. Por ejemplo, si la residencia del acusado no es la misma en la que se realizó el juicio implica un traslado.
* Identificar relaciones entre los intervinientes en los juicios. Podrían identificarse denuncias orquestadas contra una determinada persona o familia.