# Preprocessing

## Preprocessing reference code

The first cell below is the basic preprocessing as we call it for final preprocessing.  There is also a py script to perform the same.  This portion below is used for partial processing and as reference code.

## Preprocessing experimental data

The preprocessing code below is to generate data for preprocessing experiments. Standard preprocessing will remain in place for all emails:
1. Remove standard email addresses
1. Remove formatting including
    1. Visual formatting (e.g. =02, =09, \n\n)
    1. HTML tags
1. Remove stopwords
1. Expand contractions
1. Lemmatize words (i.e. change them into their root word)
1. Removal of single letters and possible double letter words.

We are interested in the following experiments in preprocessing:

1. The effect of leaving common names in an email.
1. Special formats of email addresses that does not conform to external email address notation.
1. Detecting single concept words and tying them together in the dictionary.
1. Specialised tokenization.
1. Filtering out "Forwaded" information from email bodies.

The code that follows cover incrementally adding the filtering as described for the experiments.  The experiments are performed in relevant notebooks to the specific model under investigation.

### Build a file list
The first step is to build a file list in memory before commencing processing.

In [30]:
import os
import sys
import re
from tqdm import tqdm   #To display progress bars

# Import own defined functions and classes
print("Initialising the program")
modules_path = os.path.join('..','..','src','modules','')
sys.path.append(modules_path)
import eflp
import importlib
importlib.reload(eflp)

dir_list = ["allen-p","arnold-j","arora-h",
            "badeer-r","bailey-s","bass-e",
            "baughman-d","beck-s","benson-r",
            "blair-l","brawner-s","buy-r",
            "campbell-l","lay-k","skilling-j"]

src_data_root = os.path.join("..","..","data","raw")
mail_src = os.path.join(src_data_root,"maildir","allen-p")

# Load our custom efpl library developed for this project
email = eflp.Email_Forensic_Processor()

# Define a helper function to construct the file list
def build_file_list(src, type = ""):
    # Initialise the list
    src_dst = []
    with tqdm() as pbar:
        for dir_path, dirs, files in os.walk(src):
            src_path = dir_path
            for file in files:
                #print(file)
                if not re.search(r'^\.',file):      # Ignore hidden files in Unix
                    file_src_path = os.path.join(src_path,file)
                    file_dst_path = file_src_path.replace("/raw/","/processed/")
                    if type == "experimental":
                        file_dst_path = file_dst_path.replace("/maildir/","/experimental_data/")
                    file_dst_path = file_dst_path + "json"
                    src_dst.append((file_src_path,file_dst_path))
                    pbar.update(1)
                        #print("   ",file_src_path,file_dst_path)
    return src_dst


# Build a list of the multex.com files
def build_file_list_multex(filename, type = ""):
    # Initialise the list
    src_dst = []
    file_list = open(src_data_root + "/" + filename,"r")
    file_number = 0
    
    with tqdm() as pbar:
        for file in file_list:
            file_number = file_number + 1
            src_dst.append(("../../data/raw/" + file.strip(),"../../data/processed/Multex/Multex_" + str(file_number) + ".json"))
    pbar.update(1)
    file_list.close()
    return src_dst




src_dst = build_file_list(mail_src, type = "experimental")

Multex_list = build_file_list_multex("Multex_files.txt")

Initialising the program
Importing Spacy
Loading encore web
Initialisation of eflp complete.


3034it [00:00, 19693.88it/s]
0it [00:00, ?it/s]


## Standard full pre-process

Run a standard pre-process on the files.

In [21]:
# Preprocess the emails and store them

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        if os.path.exists(file_pair[1]):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0])
            email.saveMail(file_pair[1])
        pbar.update(1)

3034it [06:08,  8.23it/s]                                                       


## Very basic preprocess

The very basic pre-process. This is achieved by a special call to the class to not invoke pre-processing, and then a manual call to the class to invoke a basic pre-process, followed by a call to the class to finalise the basic pre-process. The output filename is modified with a pre-prend string:

very_basic_xxx.json


In [22]:
# Define a pre-pend to add to the file name so that experiments can extract the correct pre-processed files. 
pre_pend = "very_Basic_"

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        # Construct a new output file name.
        output_file = os.path.join(os.path.dirname(file_pair[1]),pre_pend + os.path.basename(file_pair[1]))
        if os.path.exists(output_file):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="very_basic")
            email.finalise_preprocess()
            email.saveMail(output_file)
        pbar.update(1)

3034it [07:28,  6.76it/s]                                                       


## Basic preprocess

The basic pre-process.  This is achieved by a special call to the class to not invoke pre-processing, and then a manual call to the class to invoke a basic pre-process, followed by a call to the class to finalise the basic pre-process.  The output filename is modified with a pre-prend string:
- basic_xxx.json

In [24]:
# Define a pre-pend to add to the file name so that experiments can extract the correct pre-processed files. 
pre_pend = "Basic_"

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        # Construct a new output file name.
        output_file = os.path.join(os.path.dirname(file_pair[1]),pre_pend + os.path.basename(file_pair[1]))
        if os.path.exists(output_file):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="basic")
            email.finalise_preprocess()
            email.saveMail(output_file)
        pbar.update(1)

3034it [01:34, 32.11it/s]                                                       


## Filtering special email addresses

The internal email address representation is not standard.  A standard email is of the form:
- name@domain.parts

However, the javamailer seem to have an internal representation of the form:
- name/domain/structure@company

The below code performs basic pre-processing, and then additionally filters this special email address format using a specially crafted regular expression.  The destination filename is pre-pended with:
- Mailer_

In [25]:
# Define a pre-pend to add to the file name so that experiments can extract the correct pre-processed files. 
pre_pend = "Mailer_"

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        # Construct a new output file name.
        output_file = os.path.join(os.path.dirname(file_pair[1]),pre_pend + os.path.basename(file_pair[1]))
        if os.path.exists(output_file):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="mailer")
            #email.remove_patterns(pattern_list = [eflp.EMAIL_ENRON,eflp.TWO_LETTERS])
            email.finalise_preprocess()
            email.saveMail(output_file)
        pbar.update(1)

3034it [02:12, 22.93it/s]                                                       


## Filtering names

The name of the mailbox owner may dominate, en therefore should potentially be filtered out.  This is becuase all mails will either address the email box owner, or signed by the email box owner at the end.  Filtering of names may or may not be required, dependend on the final model.

The destination filename is pre-pended with:

- Name_

In [26]:
import re
# Define a pre-pend to add to the file name so that experiments can extract the correct pre-processed files. 
pre_pend = "Name_"

#Define a regular expression to filer the name.
NAME_REGEX = "[P,p]hillip|[A,a]llen"
NAME = re.compile(NAME_REGEX,flags=re.IGNORECASE)

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        # Construct a new output file name.
        output_file = os.path.join(os.path.dirname(file_pair[1]),pre_pend + os.path.basename(file_pair[1]))
        if os.path.exists(output_file):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="name")
            #email.remove_patterns(pattern_list = [eflp.EMAIL_ENRON,eflp.TWO_LETTERS])
            #email.remove_patterns(pattern_list = [NAME,eflp.TWO_LETTERS])
            email.finalise_preprocess()
            email.saveMail(output_file)
        pbar.update(1)

3034it [03:18, 15.30it/s]                                                       


## Final full pre-process

The final code builds a full pre-process. Refer to the dissertation, or the code in efpl.py in the src folder for the full details.

The destination filename is pre-pended with:

- Full_

In [32]:
import re
# Define a pre-pend to add to the file name so that experiments can extract the correct pre-processed files. 
pre_pend = "Full_"

#Define a regular expression to filer the name.
#NAME_REGEX = "[P,p]hillip|[A,a]llen"
#NAME = re.compile(NAME_REGEX,flags=re.IGNORECASE)

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        # Construct a new output file name.
        output_file = os.path.join(os.path.dirname(file_pair[1]),pre_pend + os.path.basename(file_pair[1]))
        if os.path.exists(output_file):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="full")
            email.saveMail(output_file)
        pbar.update(1)

3034it [12:57,  3.90it/s]                                                       


## Final full pre-process for POS

For experiment 2, we also compare the Part of Speech (POS) tagging as a way to extract features. The below cell pre-processes the mails, extracting nouns, proper nouns and verbs as features. 

The destination filename is pre-pended with:

- POS_

In [33]:
import re
# Define a pre-pend to add to the file name so that experiments can extract the correct pre-processed files. 
pre_pend = "POS_"

#Define a regular expression to filer the name.
#NAME_REGEX = "[P,p]hillip|[A,a]llen"
#NAME = re.compile(NAME_REGEX,flags=re.IGNORECASE)

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        # Construct a new output file name.
        output_file = os.path.join(os.path.dirname(file_pair[1]),pre_pend + os.path.basename(file_pair[1]))
        if os.path.exists(output_file):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="full_pos")
            email.saveMail(output_file)
        pbar.update(1)

3034it [24:13,  2.09it/s]                                                       


## Datasets for additional tests

The code that follows creates subset datasets for additional testing of LDA concepts. The main theory is that by controlling the dictionary with a careful seelction of subsets of data, the LDA algorithm will be better "tuned" on specific topics. The full pre-process is run on the subsets in preparation for the LDA algorithm to be applied.

### Newsletter
The Multex newsletter was identified to be of specific topics. The filenames were extracted by a simple unix command search and saved in Multex_filenames.txt. 

In [28]:
with tqdm(total=(len(Multex_list) - 1)) as pbar:
    for file_pair in Multex_list:
        # Construct a new output file name.
        output_file = file_pair[1]
        if os.path.exists(output_file):
            #print("file exists")
            #print(output_file)
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="full")
            email.saveMail(output_file)
        pbar.update(1)

149it [00:00, 6289.32it/s]                                                      





_________________________________________________________
# End Notebook
________________
