# Dataset structure notebook

This notebook collects information about the dataset and its structure. This is not a deep exploration of the dataset, just a preliminary step in order to have a clear idea of how to process the emails.

## Folder structure

We work inside the [github directory](https://github.com/qlambotte/enron_emails_clusturing) of the project. The data is stored i `./data/` but is **not** pushed remotely on github. 

The data is stored in `./data/maildir`, which is extracted from the website [https://www.cs.cmu.edu/~enron/], using
```bash
wget https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz
```

Inside `./data/maildir`, there are 150 folders, each named after a person of interest in the ENRON scandal. This number can be retrived using the command (inside `./data/maildir`)
```bash
ls | wc -l
```

Each of the first level folders corresponds to a mailbox of someone. Inside each such folders are several subfolders whose names are like `Inbox`, `Sent`, etc. Let's have a closer look the the structure.

The command
```bash
cd ./data/maildir
tree . -L 2
```
gives the tree structure of the folder `maildir` with depth 2. The result is printed in sdtout, but we can store it inside a json file:
```bash
tree . -J -L 2 > ../depth_2_tree.json
```
(the `../depth_2_tree.json` tells that I want to store the result in `./data` and not in `maildir`.


In [5]:
import json

In [7]:
with open("depth_2_tree.json", "r") as file:
    data = json.load(file)

The `data` object is a list that contains two dict. The second element is a report of the structure.

In [12]:
data[1]

{'type': 'report', 'directories': 2949, 'files': 7}

The first is the tree structure of depth 2. The label `"contents"` is of interest to us. It is a list of subdirectories, those of level 1.

In [16]:
len(data[0]["contents"])

150

I want to extract the names of the directories of level 2, to have a rough idea of what is available. I will use the set data structure in order to quickly deal with duplicates.

In [29]:
level_2_names = set()
level_2_files = set()

In [30]:
level_1 = data[0]["contents"]

Here is a sample of a level 1 dir

In [31]:
level_1[0]

{'type': 'directory',
 'name': 'allen-p',
 'contents': [{'type': 'directory', 'name': 'all_documents'},
  {'type': 'directory', 'name': 'contacts'},
  {'type': 'directory', 'name': 'deleted_items'},
  {'type': 'directory', 'name': 'discussion_threads'},
  {'type': 'directory', 'name': 'inbox'},
  {'type': 'directory', 'name': 'notes_inbox'},
  {'type': 'directory', 'name': 'sent'},
  {'type': 'directory', 'name': 'sent_items'},
  {'type': 'directory', 'name': '_sent_mail'},
  {'type': 'directory', 'name': 'straw'}]}

In [32]:
for folder in level_1:
    sub = folder["contents"]
    for subsub in sub:
        if subsub["type"] == "directory":
            level_2_names.add(subsub["name"])
        else:
            level_2_files.add((folder["name"],subsub["name"]))


The files in `level_2_files` are mails that are not classified.

In [34]:
level_2_files

{('baughman-d', '1.'),
 ('corman-s', '1.'),
 ('lokay-m', '1.'),
 ('richey-c', '1.'),
 ('scholtes-d', '1.'),
 ('shively-h', '1.'),
 ('shively-h', '2.')}

I went trough these emails. Some of them look like spams but others are relevant emails (from enron to enron employees). So we should keep them. I would think that these email weren't processed in time.

The set `level_2_names` contains the names of level 2 folders

In [36]:
len(level_2_names)

1425

It is quite a long set and I feel that it may be untreatable by hand. Do we need to include it in the data? Or is this info already present in the email files?

Yes it is in the header under the field `X-Folder`

NOTE: when you open a mail in some app, you may find the char `^M` at the end of lines. This is the symbol used by windows to indicate the return character.

Emails are encoded using the [MIME standard](https://en.wikipedia.org/wiki/MIME). This will allow a uniform treatment of the mails.

Python provides a lib [`email`](https://docs.python.org/3/library/email.html#module-email) to deal with emails, in complience with the MIME standard.

In [48]:
import email
with open("./maildir/baughman-d/ubsw_energy/enron_to_ubs_trans/4.", "r") as file:
    sample = email.message_from_file(file)

`sample` is a Message object. A message object is dict generator. We can therefore iterate over it.

In [80]:
for part in sample.walk():
    print(part)

Message-ID: <13886924.1075840348227.JavaMail.evans@thyme>
Date: Tue, 29 Jan 2002 06:55:31 -0800 (PST)
From: debra.bailey@enron.com
To: debra.bailey@enron.com, l..nicolay@enron.com, d..steffes@enron.com,
	bill.rust@enron.com, lloyd.will@enron.com, don.baughman@enron.com,
	corry.bentley@enron.com, w..donovan@enron.com, bill.abler@enron.com,
	reagan.rorschach@enron.com
Subject: Room ECS: 06106,   RE: Transmission Agreements Meeting Tues. Jan.
 29 10:00 a.m.
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Bailey, Debra </O=ENRON/OU=NA/CN=RECIPIENTS/CN=DBAILEY2>
X-To: Bailey, Debra </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Dbailey2>, Nicolay, Christi L. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Cnicola>, Steffes, James D. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jsteffe>, Rust, Bill </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Brust>, Will, Lloyd </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Lwill>, Baughman Jr., Don </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Dbaughm>, Bentley, Corry </O=ENRON/OU=N

As you can see the field `X-folder` contains the path to the mail. It is however in a weird format and should be cleaned.

In [74]:
email.iterators._structure(sample)

text/plain


The content of a mail is in the attribute `_payload`.

In [59]:
dir(sample)

['__bytes__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_charset',
 '_default_type',
 '_get_params_preserve',
 '_headers',
 '_payload',
 '_unixfrom',
 'add_header',
 'as_bytes',
 'as_string',
 'attach',
 'defects',
 'del_param',
 'epilogue',
 'get',
 'get_all',
 'get_boundary',
 'get_charset',
 'get_charsets',
 'get_content_charset',
 'get_content_disposition',
 'get_content_maintype',
 'get_content_subtype',
 'get_content_type',
 'get_default_type',
 'get_filename',
 'get_param',
 'get_params',
 'get_payload',
 'get_unixfrom',
 'is_multipart',
 'items',
 'keys',
 'policy',

In [89]:
print(sample._payload)

Room 6106 at 10:00 is reserved 

 -----Original Message-----
From: 	Bailey, Debra  
Sent:	Tuesday, January 29, 2002 8:42 AM
To:	Nicolay, Christi L.; Steffes, James D.
Cc:	Rust, Bill; Will, Lloyd; Baughman Jr., Don; Bentley, Corry; Donovan, Terry W.; Abler, Bill; Rorschach, Reagan
Subject:	RE: Transmission Agreements Meeting Tues. Jan. 29 10:00 a.m.

Jim, Christi,

Are you available to attend this meeting today to discuss the going forward plan for transmission agreements in UBS.  

Debra

Debra Bailey
East Power Trading

 -----Original Message-----
From: 	Baughman Jr., Don  
Sent:	Monday, January 28, 2002 4:39 PM
To:	Rust, Bill; Bentley, Corry; Bailey, Debra; Donovan, Terry W.; Abler, Bill; Rorschach, Reagan
Cc:	Will, Lloyd
Subject:	Transmission Agreements Meeting Tues. Jan. 29 10:00 a.m.

Hey,

After speaking with Reagan in origination, we thought it would be a good idea to evaluate the status of New Transmission Agreements project.

If it pleases all, let's all get together from 10:0

Likewise, we can isolate the header by reading the `_headers` attribute.

In [93]:
sample._headers

[('Message-ID', '<13886924.1075840348227.JavaMail.evans@thyme>'),
 ('Date', 'Tue, 29 Jan 2002 06:55:31 -0800 (PST)'),
 ('From', 'debra.bailey@enron.com'),
 ('To',
  'debra.bailey@enron.com, l..nicolay@enron.com, d..steffes@enron.com, \n\tbill.rust@enron.com, lloyd.will@enron.com, don.baughman@enron.com, \n\tcorry.bentley@enron.com, w..donovan@enron.com, bill.abler@enron.com, \n\treagan.rorschach@enron.com'),
 ('Subject',
  'Room ECS: 06106,   RE: Transmission Agreements Meeting Tues. Jan.\n 29 10:00 a.m.'),
 ('Mime-Version', '1.0'),
 ('Content-Type', 'text/plain; charset=us-ascii'),
 ('Content-Transfer-Encoding', '7bit'),
 ('X-From', 'Bailey, Debra </O=ENRON/OU=NA/CN=RECIPIENTS/CN=DBAILEY2>'),
 ('X-To',
  'Bailey, Debra </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Dbailey2>, Nicolay, Christi L. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Cnicola>, Steffes, James D. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jsteffe>, Rust, Bill </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Brust>, Will, Lloyd </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Lwill

Therefore, it would be relatively easy to agregate the data in a dataframe (or a dataframe generator?). I would include the path to the file in the data frame instead of using `X-Folder`. In any case, this info will be used by the webapp.