# Graph creation and entity reconciliation (Datashare + neo4j + OpenRefine + OLDB) 

## Resources (TODO)

## Setup

### Demo

Follow the instructions to install [poetry](https://python-poetry.org/docs/#installation), then setup the demo from the repo root:

```bash
poetry install
```

### OpenRefine

Follow the instructions to install [OpenRefine](https://openrefine.org/download).


### neo4j

Start neo4j using Docker:
```bash
docker compose up -d
```

Check that the service is running via:
```bash
docker ps
```

### Datashare

Download [Datashare](https://datashare.icij.org/) and install it.

If you struggle with the installation have a look at Datashare's [documentation](https://icij.gitbook.io/datashare) and follow the installation instructions for your OS.



## Add documents to Datashare and extract entities
<br>

### Download the documents
TODO: put the correct download URL here


Download the data sample and place it at the root of the repository. The extract the archive:
TODO: update the path here

```bash
tar xzvf .... 
``` 

Now, let's have a quick look at the corpus. It's composed of:
- 3 emails (`.eml`) with embedded documents
- some ICIJ's website articles **in many different formats**: `html`, `png` (screenshots), `pdf`...

<br>

### Start Datashare

Start Datashare and set the sample document directory as Datashare's data home: 
```bash
DS_DOCKER_NEO4J_HOST=localhost datashare -m EMBEDDED -d $(pwd)/cyprus
```
you can now navigate to [http://localhost:8080](http://localhost:8080) and use Datashare.

<br>

### Add documents

Follow Datashare's doc [instructions](https://icij.gitbook.io/datashare/local-mode/analyze-documents) to add new documents to Datashare.

You can now see how:
- we can preview documents
- Datashare extracted text content from all kind of document types including images
- document content is now searchable using [Datashare's search](https://icij.gitbook.io/datashare/usage/search-documents)	 

<br>

### Extract named entities from documents

Follow Datashare's doc [instructions](https://icij.gitbook.io/datashare/local-mode/analyze-documents#extract-names-of-people-organizations-and-locations) to detect namedentities found inside documents.

Make sure to detect:
- people, organizations, locations and email addresses
- as well as emails	 
You can now see:
- that the `Entity` tab of documents is full of entities
- that it's possible to search documents by named entities

<br>

## Create the Neo4j graph


Follow Datashare's doc [instructions](https://icij.gitbook.io/datashare/local-mode/create-the-neo4j-graph/run-datashare-with-the-neo4j-plugin) to install the Neo4j plugin.

Restart Datashare:
```bash
DS_DOCKER_NEO4J_HOST=localhost datashare -m EMBEDDED -d $(pwd)/cyprus
```

and follow Datashare's doc [instructions](https://icij.gitbook.io/datashare/local-mode/create-the-neo4j-graph/create-and-update-the-graph) to create the Neo4j graph

<br>

## Explore the graph

Open the Neo4j Browser at [http://localhost:7474/browser/](http://localhost:7474/browser/).

Visualize the conversation:
```cypher
MATCH (emailAddress:EMAIL)--(doc:Document)
RETURN emailAddress, doc
```

<center>
    <img src="./images/email_conv.jpg" width="600"/>
</center>

Notice:
- that some email addresses have the `:SENT`, `:RECEIVED` relationship type parsed from email headers
- that some email addresses only have `:APPEARS_IN` relationship type, they were found in the email content

Looking at the `:HAS_PARENT` relationships, and how some documents appear as attachment of other documents `:HAS_PARENT` relationship:

<center>
    <img src="./images/email_attachements.jpg" width="400"/>
</center>

Explore email document named entities and the different entity types: `PERSON`,  `ORGANISATION`, `LOCATION` and `EMAIL` (email addresses):
 
<center>
    <img src="./images/email_entities.jpg" width="800"/>
</center>
 
By looking at a given named entity:  
```cypher
MATCH (person:PERSON)
WHERE person.mentionNorm CONTAINS "putin"
RETURN person
```

notice that the same entity can appear multiple times in several documents:

<center>
    <img src="./images/unresolved_entity.jpg" width="600"/>
</center>
