Skip to content
Elasticsearch demo using Enron email dataset
JavaScript Shell
Branch: master
Clone or download
Pull request Compare This branch is even with ycombinator:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitattributes
.gitignore
README.md
analysis.md
load_into_elasticsearch.js
load_into_elasticsearch.sh
load_into_mysql.sh
package.json
parse_email_files.js
queries.md
transform_for_mysql_bulk_load.js

README.md

Pre-requisite

Download dataset.tgz from here into the same folder as where you clone this repository.

Preparation

The dataset.tgz file contains an archive of all Enron emails, de-duped, and parsed into JSON files. Each JSON file in the archive represents one email message.

The size of this compressed dataset is 252MB. Uncompressed into individual JSON files, the size becomes 1.3GB.

  1. Install Node.js, MySQL, and Elasticsearch. Make sure MySQL and Elasticsearch are running.

  2. Uncompress the archive.

tar xvf dataset.tgz
  1. Load the emails into Elasticsearch.
npm install   # if you haven't run this already
./load_into_es.sh
  1. Load the emails in MySQL.
./load_into_mysql.sh

Appendix

The original Enron email dataset was taken from https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz. This is an archive of all Enron emails in EML format, where each file represents one email message. Some of these messages are duplicated in multiple files.

The parse_email_files.js script will parse the original Enron email dataset into JSON files, after de-duplicating them.

The included dataset.tgz file is archive of exactly these JSON files.

You can’t perform that action at this time.