# Preprocessing

## Preprocessing reference code

The first cell below is the basic preprocessing as we call it for final preprocessing.  There is also a py script to perform the same.  This portion below is used for partial processing and as reference code.

## Preprocessing experimental data

The preprocessing code below is to generate data for preprocessing experiments. Standard preprocessing will remain in place for all emails:
1. Remove standard email addresses
1. Remove formatting including
    1. Visual formatting (e.g. =02, =09, \n\n)
    1. HTML tags
1. Remove stopwords
1. Expand contractions
1. Lemmatize words (i.e. change them into their root word)
1. Removal of single letters and possible double letter words.

We are interested in the following experiments in preprocessing:

1. The effect of leaving common names in an email.
1. Special formats of email addresses that does not conform to external email address notation.
1. Detecting single concept words and tying them together in the dictionary.
1. Specialised tokenization.
1. Filtering out "Forwaded" information from email bodies.

The code that follows cover incrementally adding the filtering as described for the experiments.  The experiments are performed in relevant notebooks to the specific model under investigation.

### Build a file list
The first step is to build a file list in memory before commencing processing.

In [2]:
import os
import sys
import re
from tqdm import tqdm   #To display progress bars

# Import own defined functions and classes
print("Initialising the program")
modules_path = os.path.join('..','..','src','modules','')
sys.path.append(modules_path)
import eflp

dir_list = ["allen-p","arnold-j","arora-h",
            "badeer-r","bailey-s","bass-e",
            "baughman-d","beck-s","benson-r",
            "blair-l","brawner-s","buy-r",
            "campbell-l","lay-k","skilling-j"]

src_data_root = os.path.join("..","..","data","raw")
#mail_src = os.path.join(src_data_root,"maildir","allen-p","_sent_mail")
#mail_src = os.path.join(src_data_root,"maildir","allen-p","all_documents")
mail_src = os.path.join(src_data_root,"maildir","allen-p")


email = eflp.Email_Forensic_Processor()

# Define a helper function to construct the file list
def build_file_list(src, type = ""):
    # Initialise the list
    src_dst = []
    with tqdm() as pbar:
        for dir_path, dirs, files in os.walk(src):
            src_path = dir_path
            for file in files:
                #print(file)
                if not re.search(r'^\.',file):      # Ignore hidden files in Unix
                    file_src_path = os.path.join(src_path,file)
                    file_dst_path = file_src_path.replace("/raw/","/processed/")
                    if type == "experimental":
                        file_dst_path = file_dst_path.replace("/maildir/","/experimental_data/")
                    file_dst_path = file_dst_path + "json"
                    src_dst.append((file_src_path,file_dst_path))
                    pbar.update(1)
                        #print("   ",file_src_path,file_dst_path)
    return src_dst

    
src_dst = build_file_list(mail_src, type = "experimental")



Initialising the program
Importing Spacy
Loading encore web
Initialisation of eflp complete.


3034it [00:00, 18512.86it/s]


## Standard full pre-process

Run a standard pre-process on the files.

In [5]:
# Preprocess the emails and store them

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        if os.path.exists(file_pair[1]):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0])
            email.saveMail(file_pair[1])
        pbar.update(1)

3034it [02:51, 17.69it/s]                                                                                                                                                                                 


## Basic preprocess

The basic pre-process.  This is achieved by a special call to the class to not invoke pre-processing, and then a manual call to the class to invoke a basic pre-process, followed by a call to the class to finalise the basic pre-process.  The output filename is modified with a pre-prend string:
- basic_xxx.json

In [6]:
# Define a pre-pend to add to the file name so that experiments can extract the correct pre-processed files. 
pre_pend = "Basic_"

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        # Construct a new output file name.
        output_file = os.path.join(os.path.dirname(file_pair[1]),pre_pend + os.path.basename(file_pair[1]))
        if os.path.exists(output_file):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="basic")
            email.finalise_preprocess()
            email.saveMail(output_file)
        pbar.update(1)

3034it [05:24,  9.34it/s]                                                                                                                                                                                 


## Filtering special email addresses

The internal email address representation is not standard.  A standard email is of the form:
- name@domain.parts

However, the javamailer seem to have an internal representation of the form:
- name/domain/structure@company

The below code performs basic pre-processing, and then additionally filters this special email address format using a specially crafted regular expression.  The filename is pre-pended with:
- Mailer_

In [7]:
# Define a pre-pend to add to the file name so that experiments can extract the correct pre-processed files. 
pre_pend = "Mailer_"

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        # Construct a new output file name.
        output_file = os.path.join(os.path.dirname(file_pair[1]),pre_pend + os.path.basename(file_pair[1]))
        if os.path.exists(output_file):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="basic")
            email.remove_patterns(pattern_list = [eflp.EMAIL_ENRON,eflp.TWO_LETTERS])
            email.finalise_preprocess()
            email.saveMail(output_file)
        pbar.update(1)

3034it [04:37, 10.93it/s]                                                                                                                                                                                 


## Filtering names

The name of the mailbox owner may dominate, en therefore should potentially be filtered out.  This is becuase all mails will either address the email box owner, or signed by the email box owner at the end.  Filtering of names may or may not be required, dependend on the final model.

The file is pre-pended with:

- Name_

In [8]:
import re
# Define a pre-pend to add to the file name so that experiments can extract the correct pre-processed files. 
pre_pend = "Name_"

#Define a regular expression to filer the name.
NAME_REGEX = "[P,p]hillip|[A,a]llen"
NAME = re.compile(NAME_REGEX,flags=re.IGNORECASE)

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        # Construct a new output file name.
        output_file = os.path.join(os.path.dirname(file_pair[1]),pre_pend + os.path.basename(file_pair[1]))
        if os.path.exists(output_file):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="basic")
            email.remove_patterns(pattern_list = [eflp.EMAIL_ENRON,eflp.TWO_LETTERS])
            email.remove_patterns(pattern_list = [NAME,eflp.TWO_LETTERS])
            email.finalise_preprocess()
            email.saveMail(output_file)
        pbar.update(1)

3034it [04:59, 10.14it/s]                                                                                                                                                                                 


## Final full pre-process

XXXX

The file is pre-pended with:

- Full_

In [3]:
import re
# Define a pre-pend to add to the file name so that experiments can extract the correct pre-processed files. 
pre_pend = "Full_"

#Define a regular expression to filer the name.
#NAME_REGEX = "[P,p]hillip|[A,a]llen"
#NAME = re.compile(NAME_REGEX,flags=re.IGNORECASE)

with tqdm(total=(len(src_dst) - 1)) as pbar:
    for file_pair in src_dst:
        # Construct a new output file name.
        output_file = os.path.join(os.path.dirname(file_pair[1]),pre_pend + os.path.basename(file_pair[1]))
        if os.path.exists(output_file):
            #print("file exists")
            pass
        else:
            #print(file_pair[0])
            email.initMail(file_pair[0], preProcess = False)
            email.preProcess(type="full")
            email.saveMail(output_file)
        pbar.update(1)

3034it [02:10, 23.33it/s]                                                                                                                                                                                 





_________________________________________________________
# End Notebook
________________


## Special experimentation section

In the section below we perform general experiments with code.  This will be deleted in the final notebook.

In [7]:
import re
import json


################ Function definitions ###############
# A helper function which loads the json email and returns it as a dictionary
def loadMail(filename):
    with open (filename, "r") as inputFile:
        return json.load(inputFile)
################ End Function definitions ###############

email = loadMail("../../data/processed/maildir/allen-p/_sent_mail/520.json")
#print(email)

FORMATTING_REGEX = r"=\n|=\d+"

EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""

ENRON_EMAIL_REGEX = r"/\w+/\w+/\w+@\w+|/\w+/\w+@\w+|/\w+@\w+|/\w+/Enron Communications@Enron Communication|/HOU/ECT|/NA/Enron|@ENRON"


NAME_REGEX_1 = "[P,p]hillip"
NAME_REGEX_2 = "[A,a]llen k"
NAME_REGEX_3 = "[A,a]llen."

HTML_REGEX = r"<.*?>"
CHARACTERS_REGEX = r"[,_]"
BELONG_REGEX = r"'s"
TWO_LETTERS_REGEX = r"\b[\w]{1,2}\b"

#print(FORMATTING_REGEX)
FORMATTING = re.compile(FORMATTING_REGEX,flags=re.IGNORECASE)
EMAIL = re.compile(EMAIL_REGEX,flags=re.IGNORECASE)
EMAIL_ENRON = re.compile(ENRON_EMAIL_REGEX,flags=re.IGNORECASE)
HTML = re.compile(HTML_REGEX,flags = re.IGNORECASE)
CHARACTERS = re.compile(CHARACTERS_REGEX,flags = re.IGNORECASE)
TWO_LETTERS = re.compile(TWO_LETTERS_REGEX,flags = re.IGNORECASE)
URL = re.compile(URL_REGEX, flags = re.IGNORECASE)


def remove_patterns(text,pattern_list = None, flags = None):
    
    if pattern_list == None:
        return(text)
    else:
        for pattern in pattern_list:
            text = pattern.sub('',text)
    return(text)
#print(email.keys())
print(email['body'])
pattern_list = [FORMATTING,EMAIL,URL,EMAIL_ENRON]

#sample = "This is a example of an sentence with two and one letter s s s"
#sample_url = "http://www.up.za"
#print()
#print(TWO_LETTERS.sub('',sample))
processed_body = remove_patterns(email['body'],pattern_list = pattern_list)
#processed_body = remove_patterns(email.body,pattern_list = pattern_list)
print("\n\n")
print(processed_body)

---------------------- Forwarded by Phillip K Allen/HOU/ECT on 02/07/2001 
07:14 AM ---------------------------


Susan J Mara@ENRON
02/06/2001 04:12 PM
To: Alan Comnes/PDX/ECT@ECT, Angela Schwarz/HOU/EES@EES, Beverly 
Aden/HOU/EES@EES, Bill Votaw/HOU/EES@EES, Brenda Barreda/HOU/EES@EES, Carol 
Moffett/HOU/EES@EES, Cathy Corbin/HOU/EES@EES, Chris H Foster/HOU/ECT@ECT, 
Christina Liscano/HOU/EES@EES, Christopher F Calger/PDX/ECT@ECT, Craig H 
Sutter/HOU/EES@EES, Dan Leff/HOU/EES@EES, Debora Whitehead/HOU/EES@EES, 
Dennis Benevides/HOU/EES@EES, Don Black/HOU/EES@EES, Dorothy 
Youngblood/HOU/ECT@ECT, Douglas Huth/HOU/EES@EES, Edward 
Sacks/Corp/Enron@ENRON, Eric Melvin/HOU/EES@EES, Erika Dupre/HOU/EES@EES, 
Evan Hughes/HOU/EES@EES, Fran Deltoro/HOU/EES@EES, Frank W 
Vickers/HOU/ECT@ECT, Gayle W Muench/HOU/EES@EES, Ginger 
Dernehl/NA/Enron@ENRON, Gordon Savage/HOU/EES@EES, Harold G 
Buchanan/HOU/EES@EES, Harry Kingerski/NA/Enron@ENRON, Iris Waser/HOU/EES@EES, 
James D Steffes/NA/Enron@ENRO

# URL improvement

In [8]:
import re
import json


################ Function definitions ###############
# A helper function which loads the json email and returns it as a dictionary
def loadMail(filename):
    with open (filename, "r") as inputFile:
        return json.load(inputFile)
################ End Function definitions ###############

email = loadMail("../../data/processed/maildir/allen-p/_sent_mail/520.json")
#email = loadMail("../../data/processed/maildir/allen-p/_sent_mail/2.json")



FORMATTING_REGEX = r"=\n|=\d+"

EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""

ENRON_EMAIL_REGEX = r"/\w+/\w+/\w+@\w+|/\w+/\w+@\w+|/\w+@\w+|/\w+/Enron Communications@Enron Communication|/HOU/\w+|/NA/Enron|@ENRON"


NAME_REGEX_1 = "[P,p]hillip"
NAME_REGEX_2 = "[A,a]llen k"
NAME_REGEX_3 = "[A,a]llen."

NAME_REGEX = "[P,p]hillip|[A,a]llen"
NAME = re.compile(NAME_REGEX, flags = re.IGNORECASE)

HTML_REGEX = r"<.*?>"
CHARACTERS_REGEX = r"[,_]"
BELONG_REGEX = r"'s"
TWO_LETTERS_REGEX = r"\b[\w]{1,2}\b"

#print(FORMATTING_REGEX)
FORMATTING = re.compile(FORMATTING_REGEX,flags=re.IGNORECASE)
EMAIL = re.compile(EMAIL_REGEX,flags=re.IGNORECASE)
EMAIL_ENRON = re.compile(ENRON_EMAIL_REGEX,flags=re.IGNORECASE)
HTML = re.compile(HTML_REGEX,flags = re.IGNORECASE)
CHARACTERS = re.compile(CHARACTERS_REGEX,flags = re.IGNORECASE)
TWO_LETTERS = re.compile(TWO_LETTERS_REGEX,flags = re.IGNORECASE)
URL = re.compile(URL_REGEX, flags = re.IGNORECASE)


def remove_patterns(text,pattern_list = None, flags = None):
    
    if pattern_list == None:
        return(text)
    else:
        for pattern in pattern_list:
            text = pattern.sub('',text)
    return(text)


def remove_justify(text):
    new_text = ""
    lines = re.findall(r".+\n", text, flags=0)
    for line in lines:
        if len(line) == 79:
            #print(repr(line))
            line = re.sub(r"\n","",line)
            #print(repr(line))
        elif len(line) == 77:
            line = re.sub(r"=\n","",line)
        elif len(line) == 75:
            line = re.sub(r"\n","",line)

        new_text += line
        #print(len(line))
        #print(repr(line))
    return(new_text)

def remove_forward(text):
    if re.search(r"(-+ Forwarded by .+Subject:)",text,flags=re.DOTALL) == None:
        #print(re.search(r"(-+ Forwarded by .+Subject:)",text))
        pass
    else:
        #print("Removing forward")
        text = re.sub(r"(-+ Forwarded by .+Subject:)","",text,flags=re.DOTALL)
        text = re.sub(r"^.+\n","",text)
    return text

 
#print(email['body'])
cleaned = remove_justify(email['body'])
#print(re.search(r"-+ Forwarded by .+",cleaned))
#print(cleaned)
#print(re.search(r"(-+ Forwarded by)",cleaned))
cleaned = remove_forward(cleaned)
print(cleaned)

#print(cleared)

                     
#print(email['body'])
#pattern_list = [FORMATTING,EMAIL,URL,EMAIL_ENRON]
#pattern_list = [URL]

#url = "http://www.governor.ca.gov/state/govsite/gov_htmldisplay.jsp?BV_SessionID=@@@@1673762879.0981503886@@@@&BV_EngineID=falkdgkgfmhbemfcfkmchcng.0&sCatTitle=Press+Release&sFilePath=/govsite/press_release/2001_02/20010206_PR01049_longtermcontracts.html&sTitle=GOVERNOR+DAVIS+ANNOUNCES+LONG+TERM+POWER+SUPPLY&iOID=13250"
#print(url)
#print(remove_patterns(url, pattern_list = pattern_list))
#sample = "This is a example of an sentence with two and one letter s s s"
#sample_url = "http://www.up.za"
#print()
#print(TWO_LETTERS.sub('',sample))
#processed_body = remove_patterns(email['body'],pattern_list = pattern_list)
#processed_body = remove_patterns(email.body,pattern_list = pattern_list)
#print("\n\n")
#print(processed_body)

Here is a link to the governor's press release.  He is billing it as 5,000 MW of contracts, but then he says that there is only 500 available immediately.  WIth the remainder available from 3 to 10 years.
http://www.governor.ca.gov/state/govsite/gov_htmldisplay.jsp?BV_SessionID=@@@@1673762879.0981503886@@@@&BV_EngineID=falkdgkgfmhbemfcfkmchcng.0&sCatTitle=Press+Release&sFilePath=/govsite/press_release/2001_02/20010206_PR01049_longtermcontracts.html&sTitle=GOVERNOR+DAVIS+ANNOUNCES+LONG+TERM+POWER+SUPPLY&iOID=13250



In [9]:
token_specification = [
    ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
    ('ASSIGN',   r':='),           # Assignment operator
    ('END',      r';'),            # Statement terminator
    ('ID',       r'[A-Za-z]+'),    # Identifiers
    ('OP',       r'[+\-*/]'),      # Arithmetic operators
    ('NEWLINE',  r'\n'),           # Line endings
    ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
    ('MISMATCH', r'.'),            # Any other character
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)


In [10]:
print(token_specification)
print(tok_regex)

[('NUMBER', '\\d+(\\.\\d*)?'), ('ASSIGN', ':='), ('END', ';'), ('ID', '[A-Za-z]+'), ('OP', '[+\\-*/]'), ('NEWLINE', '\\n'), ('SKIP', '[ \\t]+'), ('MISMATCH', '.')]
(?P<NUMBER>\d+(\.\d*)?)|(?P<ASSIGN>:=)|(?P<END>;)|(?P<ID>[A-Za-z]+)|(?P<OP>[+\-*/])|(?P<NEWLINE>\n)|(?P<SKIP>[ \t]+)|(?P<MISMATCH>.)
