Enron Corporation was an American energy, commodities, and services company based in Houston, Texas. It was founded by Kenneth Lay in 1985 as a merger between Lay's Houston Natural Gas and InterNorth, both relatively small regional companies. Before its bankruptcy on December 3, 2001.
Enron Corpus is a database of over 500k real emails generated by 150 Enron employees mostly senior management; It was obtained by the Federal Energy Regulatory Commission during its investigation of Enron's collapse and was latter made public.
The dataset does not include attachments, and some messages have been deleted.
Given the size of available data, it can be overwhelming to explore and identify potential useful pieces of evidence or clues. This project is demonstrating one way of implementing Natural Language Processing (NLP) and programmatic data extraction in a large scale fraud investigation, using real data.
Along the way there are also some useful NLP and other methods deployed here that have general application, for example:
- Comparing content of text files through hashing
- Identifying unique and recurring words
- Text summarization using deep learning models
- Creating word cloud :D
This project is for demonstration purpose only and is not intended to draw conclusion whatsoever; detailed content of the emails will not be displayed despite the fact that it is publicly available elsewhere.