# Exploring the ENRON email corpus with Neo4j Bloom and Graph Data Science 

## Introduction

This tutorial will help you replicate some of the explorations made in the second part of the [Datashare's Neo4j Plugin Tutorial](https://www.youtube.com/watch?v=GOQSGpjBMS0), where [Bloom](https://neo4j.com/product/bloom/) and Neo4j [Graph Data Science](https://neo4j.com/product/graph-data-science/) are used together to explore the [Enron email corpus](https://www.cs.cmu.edu/~enron/).

The Enron email corpus is a large dataset of emails released for research purposes after the [Enron company collapsed](https://en.wikipedia.org/wiki/Enron_scandal) in the early 2000's following a wide-spread use of fraudulent accounting techniques.

The video tutorial showcases how to combine Bloom's powerful UI together with Graph Data Science, to quickly gain knowledge about the graph created from Datashare using the Neo4j plugin.

Note that the Enron email corpus is **quite large** so adding documents to Datashare, extracting named entities and creating the graph from Datashare will potentially take **very long** (we will try to provide you with a neo4j dump to gain time in the future).

**This notebook is meant to be followed side by side with the [tutorial video](https://www.youtube.com/watch?v=GOQSGpjBMS0)**. Following it step-by-step with the video, you'll be able to perform Cypher queries in order to update your graph and perform the same explorations as in the tutorial.

Here are some of the exploration, you will perform in order to explore this data.

We'll use [centrality algorithms](https://neo4j.com/docs/graph-data-science/current/algorithms/centrality/) to identify employees centralizing information at Enron.

<center>
    <img src="./images/enron_central_entities.jpg" width="1000"/>
</center>

<br>

We'll use Bloom's timelines to dynamically visualize exchanges between different actors over time:

<center>
    <img src="./images/enron_timeline.png" width="1000"/>
</center>

<br>

We'll use Bloom’s Pattern Search to isolate exchanges between Enron employees and their financial auditors:
<center>
    <img src="./images/enron_pattern_search.png" width="1000"/>
</center>

<br>



## 1. Setup

### 1.1 Install dependencies

Follow the instructions to install [poetry](https://python-poetry.org/docs/#installation), then setup the demo from the repo root run:

```bash
poetry install
```

<br>

### 1.2 Install Bloom

Following [these instructions](https://neo4j.com/docs/bloom-user-guide/current/) to install Bloom in inside the neo4j service running in this [Dockerfile](../docker-compose-bloom.yml).

Neo4j Bloom is paid software so make sure you have a Enterprise License and license for Bloom. 

<br>

### 1.3 Neo4j

Start neo4j and Bloom using Docker, from the repo root run:
```bash
docker compose -f docker-compose-bloom.yml up -d
```

and verify that the service is running with:
```bash
docker ps
```

<br>

### 1.4 Datashare

Download [Datashare](https://datashare.icij.org/) and install it.

If you struggle with the installation have a look at Datashare's [documentation](https://icij.gitbook.io/datashare) and follow the installation instructions for your OS.

<br>

## 2. Add documents to Datashare and extract entities
<br>

### 2.1 Download the corpus

Download the [Enron ](https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz) from the [Enron Email Dataset
web page](https://www.cs.cmu.edu/~enron/), then place it at the repo root to and decompress the archive:

```bash
tar xzvf enron_mail_20150507.tar.gz
```

<br>

### 2.2 Start Datashare

Start Datashare and set the enron corpus directory as Datashare's data home: 
```bash
DS_DOCKER_NEO4J_HOST=localhost datashare -m EMBEDDED -d $(pwd)/maildir
```

The `DS_DOCKER_NEO4J_HOST=localhost` indicates to Datashare where Neo4j is running and the `-d $(pwd)/toy_dataset` set the Datashare's data directory.

You can now navigate to [http://localhost:8080](http://localhost:8080) and use Datashare.

**Make sure to [install the Neo4j plugin](https://icij.gitbook.io/datashare/local-mode/create-the-neo4j-graph/run-datashare-with-the-neo4j-plugin)** from the settings page, and restart Datashare after that.


<br>

### 2.3 Add documents

**Notes: this step can take hours, we'll try to provide dumps in the future**

Follow Datashare's doc [instructions](https://icij.gitbook.io/datashare/local-mode/analyze-documents) to add new documents to Datashare.

You can now see how:
- we can preview documents
- Datashare extracted text content from all kind of document types including images
- document content is now searchable using [Datashare's search](https://icij.gitbook.io/datashare/usage/search-documents)


<br>

### 2.4 Extract named entities from documents

**Notes: this step can take hours, we'll try to provide dumps in the future**

Follow Datashare's doc [instructions](https://icij.gitbook.io/datashare/local-mode/analyze-documents#extract-names-of-people-organizations-and-locations) to detect namedentities found inside documents.

Make sure to detect:
- people, organizations and locations
- as well as emails

You can now see:
- that the `Entity` tab of documents is full of entities
- that it's possible to search documents by named entities

<br>

## 3. Create the Neo4j graph
**Notes: this step can take hours, we'll try to provide dumps in the future**

Follow Datashare's doc [instructions](https://icij.gitbook.io/datashare/local-mode/create-the-neo4j-graph/run-datashare-with-the-neo4j-plugin) to install the Neo4j plugin.

Restart Datashare:
```bash
DS_DOCKER_NEO4J_HOST=localhost datashare -m EMBEDDED -d $(pwd)/maildir
```

and follow Datashare's doc [instructions](https://icij.gitbook.io/datashare/local-mode/create-the-neo4j-graph/create-and-update-the-graph) to create the Neo4j graph

<br>

## 4. Explore the graph with Bloom and Neo4j Graph Data Science


Follow the [tutorial video](https://www.youtube.com/watch?v=GOQSGpjBMS0), and replicate it steps.

When needed you can copy/paste Cypher queries to search or update the graph.


### 4.1 Explore the email domains involved in the conversations

```cypher
MATCH (e:EMAIL)-[r:APPEARS_IN]->(d:Document) // email addresses
RETURN e.emailDomain as emailDomain, count(r.mentionCount) AS numEmailsPerDomain
ORDER BY numEmailsPerDomain DESC
```

### 4.2 List the all the email domains of auditors at Arthur Andersen

```cypher
MATCH (emailAddress:EMAIL)
WHERE emailAddress.emailDomain CONTAINS "andersen"
RETURN distinct(emailAddress.emailDomain) AS auditorsDomains
```

### 4.3 Use the previously auditors domain to isolate exchanges between Enron employees and their auditors

Thanks to the query above we know that all auditor email addresses end with `andersen.com`, we can easily isolate a subgraph:
```cypher
MATCH p = (sender:EMAIL)-[:SENT]-(:Document)-[:RECEIVED]-(recipient:EMAIL)
WHERE (sender.emailDomain = "enron.com" AND recipient.emailDomain ENDS WITH "andersen.com") OR (sender.emailDomain ENDS WITH "andersen.com" AND recipient.emailDomain = "enron.com")
RETURN p
```

### 4.4 Use the [PageRank](https://neo4j.com/docs/graph-data-science/current/algorithms/page-rank/) algorithm to scale email addresses nodes based on their centrality (detect central players in Enron internal communication):

```cypher
CALL gds.graph.drop('EmailGraph', false);

// Collect sent email paths
MATCH (sender:EMAIL)-[s:SENT]->(d:Document)<-[r:RECEIVED]-(recipient:EMAIL)

// Count number of email per sender / recipient pair
WITH sender, recipient, count(*) as nEmails

// Project the graph and create a (sender)-[:EMAILED {nEmail: 4}]->(recipient) relationship
WITH gds.graph.project('EmailGraph', sender, recipient, {relationshipType: 'EMAILED', relationshipProperties: { nEmails: nEmails}}) AS g

// Compute page rank
CALL gds.pageRank.write('EmailGraph', {writeProperty: '_pageRank', relationshipWeightProperty: 'nEmails', scaler: 'StdScore', maxIterations: 1000}) YIELD computeMillis
RETURN computeMillis
```


### 4.4 Use the [PageRank](https://neo4j.com/docs/graph-data-science/current/algorithms/page-rank/) algorithm to scale email addresses nodes based on their centrality (detect central players in Enron internal communication):

```cypher
//5: Community Detection: Louvain
CALL gds.graph.drop('EmailGraph', false);

// Collect sent email paths
MATCH (sender:EMAIL)-[s:SENT]->(d:Document)<-[r:RECEIVED]-(recipient:EMAIL)

// Count number of email per sender / recipient pair
WITH sender, recipient, count(*) as nEmails

// Project the graph and create a (sender)-[:EMAILED {nEmail: 4}]->(recipient) relationship
WITH gds.graph.project('EmailGraph', sender, recipient, {relationshipType: 'EMAILED', relationshipProperties: { nEmails: nEmails}}) AS g

// Compute page rank
CALL gds.louvain.write('EmailGraph', {writeProperty: '_louvainId', relationshipWeightProperty: 'nEmails'}) YIELD computeMillis
RETURN computeMillis