## Data Process

### Description

This script cleans the feature column in the `emails.csv` document. 

### Steps

1. Import the data generated from `data-process.ipynb`

2. Randomly shuffle rows and reset row index to mix hams with spams

3. Text pre-process:
  - remove punctuation
  - remove stop words
  
4. Save as a new csv file.

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('stopwords')   #需要科学上网
nltk.download('punkt')

print("SUCCESS! All modules have been imported.")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


SUCCESS! All modules have been imported.


In [2]:
df = pd.read_csv("../data/emails.csv")

In [3]:
# Randomly shuffle rows to mix ham emails with spam ones
df = df.sample(frac = 1).reset_index(drop=True)

In [4]:
display(df.shape)
display(df.head(10))
display(df.tail(10))

(829172, 2)

Unnamed: 0,X,y
0,medicinal drugs with costs that are as more as...,1
1,* when to exit based on signals,0
2,cleaning the conference rooms after meetings ....,0
3,"anyone can help , please help .",0
4,"fda - approved labs , contain a sophisticated ...",1
5,> > fiction : gore claimed that his knowledge ...,0
6,slob - b - bering ove - e - r,1
7,"methanol is a swing deal , by which the gas sa...",0
8,"black , don in in in",0
9,walk north past the instructional center to th...,0


Unnamed: 0,X,y
829162,ph : ( 403 ) 974 - 1737,0
829163,http : / / www . islurp . biz : 8070 / us . php,1
829164,pjm has stated that in order to meet its deadl...,0
829165,this sf . net email is sponsored by : thinkgeek,1
829166,"medication on the net . no perscription , easy",1
829167,quality training de mxico,1
829168,copyright 2001 . the associated press . all ri...,0
829169,millions of profiles . many are local to your ...,1
829170,/ tr,1
829171,let me know your thoughts .,0


In [6]:
# Create a stop list
self_define = ['enron','subject','ect','hou','e','http'] # Manual list of words to be removed
stoplist = stopwords.words('english') + list(string.punctuation) + self_define
stoplist = set(stoplist)

In [7]:
def trim_word(text):
    '''Remove unrelated words or symbols in emails
    Param: text: email content as a string
    '''
    text = [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit() and word.isalpha()]
    return " ".join(text)

In [8]:
df['X'] = df['X'].astype(str)

In [9]:
# Remove punctuation, stop words, and self-defined words in X
df['X'] = df['X'].apply(trim_word)

In [10]:
df2 = df.copy()

In [11]:
# Check missing values
nan_value = float('NaN')
df2.replace('', nan_value, inplace=True)
n_missing_values = df2.isnull().sum()[0]
print('Total missing values (NaN) in the feature column:', n_missing_values)
print('\nTotal missing values (NaN) takes up {:.2%} of our data.'.format(n_missing_values/len(df.index)))

Total missing values (NaN) in the feature column: 43413

Total missing values (NaN) takes up 5.24% of our data.


In [12]:
# Remove rows containing missing values
df2.dropna(subset=['X'], inplace=True)

In [13]:
df2 = df2.sort_values(by='X')

In [14]:
# Drop rows that contain non-english characters
df2.drop(df2.tail(93).index, inplace=True)

In [15]:
display(df2.shape)
display(df2.head(10))
display(df2.tail(10))

(785666, 2)

Unnamed: 0,X,y
235643,aa,0
233861,aa,0
667278,aa,0
279614,aa,0
565440,aa,0
320658,aa,0
318437,aa exec lead congrats,0
282950,aa houston office interestingly enough wes col...,0
688615,aa indicated proposal regard transactions done,0
566819,aa informed hedges new power company warrants,0


Unnamed: 0,X,y
142442,àßíãìâ ãèñý ì,1
522845,àãõâÞ üùé õèêÞã Ýøã ô,1
659970,àçå õèµô µèí õèêð ç,1
406937,á áþãõ printer hp deskjet x dpi áùå è ß,1
133249,á é,1
479653,á þãõ vcd ã óÝøã ô sme ßÞíôÞàµíãìàÞçµ,1
148130,á þãõ vcd êíÞ ã óÝøã ô sme ßÞàÞçµ,1
50101,á þãõ vcd êíÞ ã óÝøã ô smes ßÞíôÞàµíãìàÞ,1
490907,â b åý Ýùø,1
505299,â five million pound cash credited file,1


In [16]:
# Save as cleand df
df2.to_csv('../data/emails_cleaned.csv', index=False)

In [17]:
df2.shape
print('The model-ready dataset contains {} rows.'.format(df2.shape[0]))

The model-ready dataset contains 785666 rows.
