In [1]:
import os
import pandas as pd
pd.options.display.max_colwidth = 160

import preprocessing as util
from raw_utils import save_to_csv

In [2]:
# Path
cwd = os.getcwd()
csv_path = os.path.join(cwd, 'data/csv/')

data_files = ['balanced.csv', 'imbalanced.csv']

In [3]:
balanced = pd.read_csv(os.path.join(csv_path, data_files[0]), index_col=0, dtype={'body': 'object', 'class': 'bool', 'id': 'int16'})
balanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3600 entries, 0 to 3599
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      3600 non-null   int16 
 1   body    3600 non-null   object
 2   class   3600 non-null   bool  
dtypes: bool(1), int16(1), object(1)
memory usage: 66.8+ KB


In [4]:
imbalanced = pd.read_csv(os.path.join(csv_path, data_files[1]), index_col=0, dtype={'body': 'object', 'class': 'bool', 'id': 'int16'})
imbalanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18792 entries, 0 to 18791
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      18792 non-null  int16 
 1   body    18792 non-null  object
 2   class   18792 non-null  bool  
dtypes: bool(1), int16(1), object(1)
memory usage: 348.7+ KB


### Initial Data

This is the initial state of the data, along with some representative examples of phishing and legitimate emails.

In [5]:
balanced.head(20)

Unnamed: 0,id,body,class
0,0,"Dear jose,\nThis is a general notice from monkey.org support team.\nWe have upgraded your email security due to new virus attacks.\nKindly verify your emai...",True
1,1,Just wanted to let everyone know that I will be on vacation the last week of \nDecember as well.\n\nLeslie\n\n\n\n\tBrenda Whitehead\n\t11/30/2000 02:38 PM\...,False
2,2,"and I thought Ellie was fat!!!\n\n:-)\n\n\n -----Original Message-----\nFrom: \tCantrell, Rebecca \nSent:\tFriday, May 04, 2001 10:36 AM\nTo:\tJulie A Gome...",False
3,3,"monkey.org Online Webmail App\nDear jose, Your email jose@monkey.org\nhas recently been suspended from the monkey.org\nonline data-base, please verify your ...",True
4,4,"Vince,\n?\nJust to let you know, the books will be shipped to both you and Rice \nUniversity tomorrow by express mail, which means the books should arrive ...",False
5,5,Mail Quota: (98% Full)\nAttention: jose@monkey.org\nYour email quota has reached 98% and will soon exceed its limit.\nFollow the URL below to upgrade your q...,True
6,6,"\r\n\r\n\r\n[image]\r\n\r\n\r\nDispatch Confirmation\r\n\r\n ID # LKG03-35352272\r\n\r\nHello jose@monkey.org<mailto:jose@monkey.org>,\r\n\r\n\r\nYour rece...",True
7,7,"\n Hello,\n \n Your Package has Arrived, Click on the link to Track status and pick up your Package with your tracking number below \nTRACK YO...",True
8,8,\n\t[IMAGE]\t\n\tPennWell Electric & Natural Gas Transmission Maps Click Here To Download Order Form And Additional Information \t\n\t[IMAGE]\t\n\tThe mo...,False
9,9,"We are migrating all email accounts to new Outlook Web App 2019, all active Account Holder are to upgrade to take effect automatically. This is done to impr...",True


#### Legitimate emails:

In [6]:
print(balanced['body'].iloc[1])

Just wanted to let everyone know that I will be on vacation the last week of 
December as well.

Leslie



	Brenda Whitehead
	11/30/2000 02:38 PM
		 
		 To: Elizabeth Sager/HOU/ECT@ECT, David Portz/HOU/ECT@ECT, Leslie 
Hansen/HOU/ECT@ECT, Janet H Moore/HOU/ECT@ECT, Shari Stack/HOU/ECT@ECT, 
Christian Yoder/HOU/ECT@ECT, Genia FitzGerald/HOU/ECT@ECT, Janice R 
Moore/HOU/ECT@ECT
		 cc: Sheri L Cromwell/HOU/ECT@ECT, Linda J Simmons/HOU/ECT@ECT, Becky 
Spencer/HOU/ECT@ECT, Kaye Ellis/HOU/ECT@ECT
		 Subject: December Calendar




Attached is the revised December Calendar for the Power Group.  Thanks.








In [7]:
print(balanced['body'].iloc[16])

jose,
Your package has arrived at our local post-office,
Please find
the link below to  download/print DHL-AWB
(Receipt) 
 as
proof to be given to courier agent upon arrival and your data
has been transmitted successfully.
<http://soc.or.id/wp-includes/ItfFG298dncsZQX/index.php?email=jose@monkey.org>
Inv&Pl 21x40HQ.pdf
<http://soc.or.id/wp-includes/ItfFG298dncsZQX/index.php?email=jose@monkey.org>
View
<http://soc.or.id/wp-includes/ItfFG298dncsZQX/index.php?email=jose@monkey.org>
|
Download
<http://soc.or.id/wp-includes/ItfFG298dncsZQX/index.php?email=jose@monkey.org>


In [8]:
print(balanced['body'].iloc[18])

Let's talk.


---------------------- Forwarded by Gerald Nemec/HOU/ECT on 01/17/2000 03:57 
PM ---------------------------


David Juist <DRJ@topsoe.com> on 01/12/2000 05:18:35 PM
To: Gerald Nemec/HOU/ECT@ECT
cc: Linda Payne <LJP@topsoe.com>, Niels Udengaard <NRU@topsoe.com> 
Subject: Agreement to Upgrade and Operate Natural Gas Pipeline Facilities



Dear Gerald:

In accordance with our conversation of yesterday regarding the Agreement
to Upgrade and Operate Natural Gas Pipeline Facilities (the "Upgrade
Agreement") I have reviewed the documents and investigated the location
of the metering station; the results are as follows:

* The existing sales contract provides quality and pressure
obligations.
* The existing sales contract provides for measurement of the gas.
* The existing sales contract provides for noncompliance with
quality obligations.
* As you indicated, the existing sales contract provides you have
no obligation to odorize the gas.
* The existing sales contract provides fo

#### Phishing Emails:

In [9]:
print(balanced['body'].iloc[6])




[image]


Dispatch Confirmation

  ID # LKG03-35352272

Hello jose@monkey.org<mailto:jose@monkey.org>,


Your recent order ( MacBook Pro) with us has been placed successfully!

The estimated delivery date is based on the handling time and the warehouse processing time. In certain cases, the estimated delivery date will vary.

You can check all the details for your order below. Thank you again for ordering from amazon.

Order Helpdesk : (800) 655-6099


Delivery by:

Thursday, June 10



Your package was shipped to:

Justin K
12321 W Doris Dr, Odessa,

TX 79764, USA

Your item(s) is (are) being sent by Priority Delivery Services.

Order summary

Item Subtotal:

$ 1545.90

Shipping & Handling:

Rs.0.00

Shipment Total:

$1545.90

P.S. If you haven't placed this order, Reach Account Support  Immediately on  (800) 655-6099

We hope to see you again soon!


This email was sent from a notification-only address that cannot accept inco

In [10]:
print(balanced['body'].iloc[11])

Nouvelle page 1
Sender's message: A message was not provided View Attachment
Please note:
Move email to inbox to
View 
Attachment
.


In [11]:
print(balanced['body'].iloc[17])





Notice of Policy Updates


Dear Customer,

Some information on your account appears to be missing or incorrect.
Please update your information promptly so that you can continue to enjoy all the benefits of your PayPal account.
If you don't update your information within 3 days, we'll limit what you can do with your PayPal account.

Click Here

If you need help logging in, go to our Help Center by clicking the Help link located in the upper right-hand corner of any PayPal page.

Sincerely,


Copyright © 2015 PayPal Inc. All rights reserved. PayPal is located at 2211 N. First St., San Jose, CA 95131.

Notice of Policy 
                  Updates
Dear Customer,
Some information 
      on your account appears to be missing or incorrect.
Please update your information promptly so that 
      you can continue to enjoy all the benefits of your PayPal account.
If you don't update your information within 3 
      days, we'll limit what you can do with your PayPal 
      account.
Click Here<h

In [12]:
print(balanced['body'].iloc[3240])

I have the Manual Entry process ready for testing again (testing as in testing by the analysts).  I would prefer to have it used by the analysts at least for a few days to iron out any more potential "bugs" before releasing it to users on the floor.  If the consensus is to have people on the floor be the testers, that is fine as well, as long as they know that the following URL points to a development server.

Without further ado, the link to the Manual Entry Page:

http://fundamentals.dev.corp.enron.com/admin/manualentry/

As stated in the above URL, this is located on the DEVELOPMENT web server.  Users should honestly never be directed to this URL, unless authorized by one of the site developers, since anything on it is subject to be currently in development (read: potentially broken).

Anyway, here are some things that have been added to the Manual Entry piece in this release:

"Previous" data now reflects information in the DB for the latest effective date for the selected cycle (n

Some observations:
- There is a lot of extra whitespace that can be sanitized later.
- There still exist some emails with duplicated text, mostly because of the way the links from \<a\> tags were extracted.
- Emails and URLs can give away the class of the message (domain enron.com vs domain monkey.org), so removing them should make the model more general.

# Preprocessing

We need to convert the text data into a format more suitable for use with machine learning algorithms.<br>
Since we aim for two different feature sets, the process will be split.

## Basic Preprocessing

These processeses should happen to all the feature sets.

### Replacing addresses

As is obvious from the examples, a lot of the emails contain either **web addresses** (URLs) or **email addresses** that need to be removed in order for the frequency of certain domains to not influence the results.<br>
In order for this information to not get completely lost however, those addresses will be replaced by the strings `'<urladdress>'` and `'<emailaddress>'` respectively. Those strings are chosen because they do not occur normally in the emails.

In [13]:
balanced['body'] = balanced['body'].apply(util.replace_email)
balanced['body'] = balanced['body'].apply(util.replace_url)

In [14]:
imbalanced['body'] = imbalanced['body'].apply(util.replace_email)
imbalanced['body'] = imbalanced['body'].apply(util.replace_url)

In [15]:
print(balanced['body'].iloc[6])




[image]


Dispatch Confirmation

  ID # LKG03-35352272

Hello <emailaddress><mailto:<emailaddress>>,


Your recent order ( MacBook Pro) with us has been placed successfully!

The estimated delivery date is based on the handling time and the warehouse processing time. In certain cases, the estimated delivery date will vary.

You can check all the details for your order below. Thank you again for ordering from amazon.

Order Helpdesk : (800) 655-6099


Delivery by:

Thursday, June 10



Your package was shipped to:

Justin K
12321 W Doris Dr, Odessa,

TX 79764, USA

Your item(s) is (are) being sent by Priority Delivery Services.

Order summary

Item Subtotal:

$ 1545.90

Shipping & Handling:

Rs.0.00

Shipment Total:

$1545.90

P.S. If you haven't placed this order, Reach Account Support  Immediately on  (800) 655-6099

We hope to see you again soon!


This email was sent from a notification-only address that cannot accept incomi

In [16]:
print(balanced['body'].iloc[17])





Notice of Policy Updates


Dear Customer,

Some information on your account appears to be missing or incorrect.
Please update your information promptly so that you can continue to enjoy all the benefits of your PayPal account.
If you don't update your information within 3 days, we'll limit what you can do with your PayPal account.

Click Here

If you need help logging in, go to our Help Center by clicking the Help link located in the upper right-hand corner of any PayPal page.

Sincerely,


Copyright © 2015 PayPal Inc. All rights reserved. PayPal is located at 2211 N. First St., San Jose, CA 95131.

Notice of Policy 
                  Updates
Dear Customer,
Some information 
      on your account appears to be missing or incorrect.
Please update your information promptly so that 
      you can continue to enjoy all the benefits of your PayPal account.
If you don't update your information within 3 
      days, we'll limit what you can do with your PayPal 
      account.
Click Here<<

The examples show that the URLs and email addresses have indeed been anonymized now.

## Preprocessing for content features

This preprocessing is necessary in order to convert the text strings to lists of words, that will be vectorized in order to be used by machine learning algorithms.

In [17]:
balanced_tokens = balanced.copy()
imbalanced_tokens = imbalanced.copy()

### Tokenization and stopword removal

Tokenization is the process of splitting text into individual words. This is useful because generally speaking, the meaning of the text can easily be interpreted by analyzing the words present in the text.<br>
Along with this process, letters are also converted to lowercase and punctuation or other special characters are removed.<br>
Since there are some words (called **stopwords**) that do not contribute very much in meaning (like pronouns or simple verbs), they can be removed to reduce the noise.

In [18]:
balanced_tokens['body'] = balanced_tokens['body'].apply(util.tokenize)
balanced_tokens['body'] = balanced_tokens['body'].apply(util.remove_stopwords)

In [19]:
imbalanced_tokens['body'] = imbalanced_tokens['body'].apply(util.tokenize)
imbalanced_tokens['body'] = imbalanced_tokens['body'].apply(util.remove_stopwords)

In [20]:
print(balanced_tokens['body'].iloc[6])

['image', 'dispatch', 'confirmation', 'id', 'lkg03-35352272', 'hello', 'emailaddress', 'mailto', 'emailaddress', 'recent', 'order', 'macbook', 'pro', 'us', 'placed', 'successfully', 'estimated', 'delivery', 'date', 'based', 'handling', 'time', 'warehouse', 'processing', 'time', 'certain', 'cases', 'estimated', 'delivery', 'date', 'vary', 'check', 'details', 'order', 'thank', 'ordering', 'amazon', 'order', 'helpdesk', '800', '655-6099', 'delivery', 'thursday', 'june', '10', 'package', 'shipped', 'justin', 'k', '12321', 'w', 'doris', 'dr', 'odessa', 'tx', '79764', 'usa', 'item', 'sent', 'priority', 'delivery', 'services', 'order', 'summary', 'item', 'subtotal', 'shipping', 'handling', 'shipment', 'total', 'placed', 'order', 'reach', 'account', 'support', 'immediately', '800', '655-6099', 'hope', 'see', 'soon', 'email', 'sent', 'notification-only', 'address', 'accept', 'incoming', 'email', 'please', 'reply', 'message', 'dispatch', 'confirmation', 'id', 'lkg03-35352272', 'hello', 'emailadd

In [21]:
print(balanced_tokens['body'].iloc[17])

['notice', 'policy', 'updates', 'dear', 'customer', 'information', 'account', 'appears', 'missing', 'incorrect', 'please', 'update', 'information', 'promptly', 'continue', 'enjoy', 'benefits', 'paypal', 'account', 'update', 'information', 'within', '3', 'days', 'limit', 'paypal', 'account', 'click', 'need', 'help', 'logging', 'go', 'help', 'center', 'clicking', 'help', 'link', 'located', 'upper', 'right-hand', 'corner', 'paypal', 'page', 'sincerely', 'copyright', '2015', 'paypal', 'rights', 'reserved', 'paypal', 'located', '2211', 'first', 'san', 'jose', 'ca', 'notice', 'policy', 'updates', 'dear', 'customer', 'information', 'account', 'appears', 'missing', 'incorrect', 'please', 'update', 'information', 'promptly', 'continue', 'enjoy', 'benefits', 'paypal', 'account', 'update', 'information', 'within', '3', 'days', 'limit', 'paypal', 'account', 'click', 'urladdress', 'need', 'help', 'logging', 'go', 'help', 'center', 'clicking', 'help', 'link', 'located', 'upper', 'right-hand', 'corne

The example shows how a quite big chunk of text was reduced to a smaller list that contains the more meaningful words. The addresses still exist as tokens ('urladdress').<br>
Also, some emails with duplicate emails obviously will have duplicate tokens, this however is not a big issue with most vectorizers.

### Lemmatization with POS tagging

Lemmatization is the process that reduces the inflectional forms of a word to keep its root form. This is useful because the set of words that results from this process is smaller because all the inflections of a word are converted to one, thus reducing the dimensionality without sacrificing information.<br>
In order to facilitate and improve the lemmatization, the **part-of-speech tagging** technique has been used. The POS of the word (which indicates whether a word is a noun, a verb, an adjective, or an adverb) is used as a part of the process.

In [22]:
balanced_tokens['body'] = balanced_tokens['body'].apply(util.lemmatize)

In [23]:
imbalanced_tokens['body'] = imbalanced_tokens['body'].apply(util.lemmatize)

In [24]:
print(balanced_tokens['body'].iloc[6])

['image', 'dispatch', 'confirmation', 'id', 'lkg03-35352272', 'hello', 'emailaddress', 'mailto', 'emailaddress', 'recent', 'order', 'macbook', 'pro', 'u', 'place', 'successfully', 'estimate', 'delivery', 'date', 'base', 'handling', 'time', 'warehouse', 'process', 'time', 'certain', 'case', 'estimate', 'delivery', 'date', 'vary', 'check', 'detail', 'order', 'thank', 'order', 'amazon', 'order', 'helpdesk', '800', '655-6099', 'delivery', 'thursday', 'june', '10', 'package', 'ship', 'justin', 'k', '12321', 'w', 'doris', 'dr', 'odessa', 'tx', '79764', 'usa', 'item', 'send', 'priority', 'delivery', 'service', 'order', 'summary', 'item', 'subtotal', 'shipping', 'handle', 'shipment', 'total', 'place', 'order', 'reach', 'account', 'support', 'immediately', '800', '655-6099', 'hope', 'see', 'soon', 'email', 'send', 'notification-only', 'address', 'accept', 'incoming', 'email', 'please', 'reply', 'message', 'dispatch', 'confirmation', 'id', 'lkg03-35352272', 'hello', 'emailaddress', 'mailto', 'em

The example shows how the lemmatization process has worked: words like 'holding' have been converted to their root form 'hold'.<br>
In addition, it also shows the working of the POS tagging process, since the word 'incoming' has remained the same as it is used as an adjective and not as a verb.

## Preprocessing for style features

This preprocessing is necessary in order to sanitize the raw email text and remove parsing artifacts, so that the stylometric features work better.

In [25]:
balanced_text = balanced.copy()
imbalanced_text = imbalanced.copy()

### Whitespace Sanitization

The first task should be stripping away any leading and trailing whitespace. In addition, newlines that only contain a dot are most likely artifacts from the parsing of HTML and can thus be removed with this dot placed at the previous line.

In [26]:
balanced_text['body'] = balanced_text['body'].apply(util.sanitize_whitespace)

In [27]:
imbalanced_text['body'] = imbalanced_text['body'].apply(util.sanitize_whitespace)

### Address Sanitization

There are also artifacts from the URL/email anonymization that while innocent with tokenized texts, they are more harmful when the number of special characters in the text matters.

In [28]:
balanced_text['body'] = balanced_text['body'].apply(util.sanitize_addresses)

In [29]:
imbalanced_text['body'] = imbalanced_text['body'].apply(util.sanitize_addresses)

One of the previous examples looks better now.

In [30]:
print(balanced_text['body'].iloc[6])

[image]


Dispatch Confirmation

  ID # LKG03-35352272

Hello <emailaddress><mailto:<emailaddress>,


Your recent order ( MacBook Pro) with us has been placed successfully!

The estimated delivery date is based on the handling time and the warehouse processing time. In certain cases, the estimated delivery date will vary.

You can check all the details for your order below. Thank you again for ordering from amazon.

Order Helpdesk : (800) 655-6099


Delivery by:

Thursday, June 10



Your package was shipped to:

Justin K
12321 W Doris Dr, Odessa,

TX 79764, USA

Your item(s) is (are) being sent by Priority Delivery Services.

Order summary

Item Subtotal:

$ 1545.90

Shipping & Handling:

Rs.0.00

Shipment Total:

$1545.90

P.S. If you haven't placed this order, Reach Account Support  Immediately on  (800) 655-6099

We hope to see you again soon!


This email was sent from a notification-only address that cannot accept incoming emai

## Deleting Empty Rows

After all the preprocessing, it is possible that some of the emails are now empty (because they did not contain any useful words from the beginning).<br>
So, these have to be removed to keep the data clean.

In [31]:
balanced_tokens = balanced_tokens[balanced_tokens['body'].astype(bool)]
balanced_tokens.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3589 entries, 0 to 3599
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      3589 non-null   int16 
 1   body    3589 non-null   object
 2   class   3589 non-null   bool  
dtypes: bool(1), int16(1), object(1)
memory usage: 66.6+ KB


In [32]:
imbalanced_tokens = imbalanced_tokens[imbalanced_tokens['body'].astype(bool)]
imbalanced_tokens.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18724 entries, 0 to 18791
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      18724 non-null  int16 
 1   body    18724 non-null  object
 2   class   18724 non-null  bool  
dtypes: bool(1), int16(1), object(1)
memory usage: 347.4+ KB


In [33]:
balanced_text = balanced_text[balanced_text['body'].astype(bool)]
balanced_text.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3599 entries, 0 to 3599
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      3599 non-null   int16 
 1   body    3599 non-null   object
 2   class   3599 non-null   bool  
dtypes: bool(1), int16(1), object(1)
memory usage: 66.8+ KB


In [34]:
imbalanced_text = imbalanced_text[imbalanced_text['body'].astype(bool)]
imbalanced_text.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18791 entries, 0 to 18791
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      18791 non-null  int16 
 1   body    18791 non-null  object
 2   class   18791 non-null  bool  
dtypes: bool(1), int16(1), object(1)
memory usage: 348.7+ KB


In order to have the same emails in both feature sets, the text dataset will be filtered according to the tokenized one.

In [35]:
balanced_text = balanced_text[balanced_text['id'].isin(balanced_tokens['id'])]
balanced_text.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3589 entries, 0 to 3599
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      3589 non-null   int16 
 1   body    3589 non-null   object
 2   class   3589 non-null   bool  
dtypes: bool(1), int16(1), object(1)
memory usage: 66.6+ KB


In [36]:
imbalanced_text = imbalanced_text[imbalanced_text['id'].isin(imbalanced_tokens['id'])]
imbalanced_text.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18724 entries, 0 to 18791
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      18724 non-null  int16 
 1   body    18724 non-null  object
 2   class   18724 non-null  bool  
dtypes: bool(1), int16(1), object(1)
memory usage: 347.4+ KB


Check for any discrepancies:

In [37]:
(balanced_text['id'] != balanced_tokens['id']).any() and (imbalanced_text['id'] != imbalanced_tokens['id']).any()

False

## Train-Test Split

In order to evaluate the classification process, only 80% of the data will be used to train the models. The remaining 20%, which will be unknown to the algorithms, will be used to test the performance of the classifiers on unknown data.

### Tokens

In [38]:
train_balanced_tokens, test_balanced_tokens = util.dataset_split(balanced_tokens, percent=20)

In [39]:
train_imbalanced_tokens, test_imbalanced_tokens = util.dataset_split(imbalanced_tokens, percent=20)

In [40]:
train_balanced_tokens[train_balanced_tokens['id'] == 6]

Unnamed: 0,id,body,class
2228,6,"[image, dispatch, confirmation, id, lkg03-35352272, hello, emailaddress, mailto, emailaddress, recent, order, macbook, pro, u, place, successfully, estimate...",True


In [41]:
test_balanced_tokens[test_balanced_tokens['id'] == 17]

Unnamed: 0,id,body,class
459,17,"[notice, policy, update, dear, customer, information, account, appear, miss, incorrect, please, update, information, promptly, continue, enjoy, benefit, pay...",True


One of the examples is on the train set while the other is on the test set.

### Text

In [42]:
train_balanced_text, test_balanced_text = util.dataset_split(balanced_text, percent=20)

In [43]:
train_imbalanced_text, test_imbalanced_text = util.dataset_split(imbalanced_text, percent=20)

Confirm that the train and test datasets do not have any different emails:

In [44]:
(train_balanced_text['id'] != train_balanced_tokens['id']).any() and (train_imbalanced_text['id'] != train_imbalanced_tokens['id']).any() and (test_balanced_text['id'] != test_balanced_tokens['id']).any() and (test_imbalanced_text['id'] != test_imbalanced_tokens['id']).any()

False

### Saving the Results

#### Tokens

In [45]:
save_to_csv(train_balanced_tokens, csv_path, 'train_balanced_tokens.csv')
save_to_csv(test_balanced_tokens, csv_path, 'test_balanced_tokens.csv')

Saving to C:\Users\13636\OneDrive\01WorkingDirectory\02PycharmProjects\FraudulentEmailAttack\data/csv/train_balanced_tokens.csv
Saving to C:\Users\13636\OneDrive\01WorkingDirectory\02PycharmProjects\FraudulentEmailAttack\data/csv/test_balanced_tokens.csv


In [46]:
save_to_csv(train_imbalanced_tokens, csv_path, 'train_imbalanced_tokens.csv')
save_to_csv(test_imbalanced_tokens, csv_path, 'test_imbalanced_tokens.csv')

Saving to C:\Users\13636\OneDrive\01WorkingDirectory\02PycharmProjects\FraudulentEmailAttack\data/csv/train_imbalanced_tokens.csv
Saving to C:\Users\13636\OneDrive\01WorkingDirectory\02PycharmProjects\FraudulentEmailAttack\data/csv/test_imbalanced_tokens.csv


#### Text

In [47]:
save_to_csv(train_balanced_text, csv_path, 'train_balanced_text.csv')
save_to_csv(test_balanced_text, csv_path, 'test_balanced_text.csv')

Saving to C:\Users\13636\OneDrive\01WorkingDirectory\02PycharmProjects\FraudulentEmailAttack\data/csv/train_balanced_text.csv
Saving to C:\Users\13636\OneDrive\01WorkingDirectory\02PycharmProjects\FraudulentEmailAttack\data/csv/test_balanced_text.csv


In [48]:
save_to_csv(train_imbalanced_text, csv_path, 'train_imbalanced_text.csv')
save_to_csv(test_imbalanced_text, csv_path, 'test_imbalanced_text.csv')

Saving to C:\Users\13636\OneDrive\01WorkingDirectory\02PycharmProjects\FraudulentEmailAttack\data/csv/train_imbalanced_text.csv
Saving to C:\Users\13636\OneDrive\01WorkingDirectory\02PycharmProjects\FraudulentEmailAttack\data/csv/test_imbalanced_text.csv
