# Data Preprocessing

For this assignment, I used the [Spam Mails Dataset](https://www.kaggle.com/venky73/spam-mails-dataset) found on kaggle. To achieve better results, the data, which consists of several e-mails that were either considered legit or spam, was filtered through various process such as stopword, punctuation and noise removal, lowercasing and Lemmatizing. 

In [1]:
import nltk, re, json
import pandas as pd

from IPython.core.display import display, HTML
from gensim.parsing.preprocessing import remove_stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer

nltk.download('wordnet')
snowball = SnowballStemmer(language='english')
lemmatizer = WordNetLemmatizer()



[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ritar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### The used dataframe:

In [2]:
df = pd.read_csv("./data/spam_ham.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


### An exemple of a legit email:

In [3]:
ham = df[df['label_num'] == 0]
display(HTML(ham.iloc[0]['text']))

### An exemple of spam:

In [4]:
spam = df[df['label_num'] == 1]
display(HTML(spam.iloc[1]['text']))

## Pre Processing
In order to be able to compare the processed data with the original one we'll make a copy!

In [5]:
cleaned_df = df.copy()

### Lowercasing

All text will be lower case, so our models won't differ in identical words with different spelling.

In [6]:
def lowercase_texts(texts):
   
    lower_texts=[email.lower() for email in texts]
    return lower_texts
    
cleaned_df['text'] = lowercase_texts(cleaned_df['text'])
cleaned_df['text'][0]

"subject: enron methanol ; meter # : 988291\r\nthis is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary\r\nflow data provided by daren } .\r\nplease override pop ' s daily volume { presently zero } to reflect daily\r\nactivity you can obtain from gas control .\r\nthis change is needed asap for economics purposes ."

### Removing punctuation

Punctuation will be useless, so we're better off without it!

In [7]:
def remove_punct(email_list):    
    return [s.translate(str.maketrans('', '', punctuation)) for s in email_list]

        
cleaned_df['text'] = remove_punct(cleaned_df['text'])
cleaned_df['text'][0]

'subject enron methanol  meter   988291\r\nthis is a follow up to the note i gave you on monday  4  3  00  preliminary\r\nflow data provided by daren  \r\nplease override pop  s daily volume  presently zero  to reflect daily\r\nactivity you can obtain from gas control \r\nthis change is needed asap for economics purposes '

### Removing Stopwords
Very common words will be filtered out, as they won't help in classifying whether they belong to spam or legitimate emails.

In [8]:
def stopword_removal(email_list):
    return [remove_stopwords(text) for text in email_list]

cleaned_df['text'] = stopword_removal(cleaned_df['text'])
cleaned_df['text'][0]

'subject enron methanol meter 988291 follow note gave monday 4 3 00 preliminary flow data provided daren override pop s daily volume presently zero reflect daily activity obtain gas control change needed asap economics purposes'

### Removing noise
Numbers, non-ascii characters and HTML markups are also removed.

In [9]:
def text_cleaning(email_list):
    
    new_list = []
    for email in email_list:
        
        # remove html markup
        email=re.sub("(<.*?>)","",email)
        email=re.sub("(\r\n)|(\r)|(\n)"," ",email)
        #remove non-ascii and digits
#         email=re.sub("(\\W|\\d)","",email)
        email=re.sub("[^A-Za-z ]","",email)    
        new_list.append(email)

    return new_list

cleaned_df['text'] = text_cleaning(cleaned_df['text'])
cleaned_df['text'][0]

'subject enron methanol meter  follow note gave monday    preliminary flow data provided daren override pop s daily volume presently zero reflect daily activity obtain gas control change needed asap economics purposes'

### Lemmatizing
Finally, we group together the inflected forms of a word so they can be analysed as a single item, identified by its root.

In [10]:
def stem_texts(email_list):  
    
    new_list = []
    for email in email_list:
    
        email =(" ").join([snowball.stem(word) for word in email.split(" ")])
        new_list.append(email)
        
    return new_list
    
cleaned_df['text'] = stem_texts(cleaned_df['text'])
cleaned_df['text'][0]

'subject enron methanol meter  follow note gave monday    preliminari flow data provid daren overrid pop s daili volum present zero reflect daili activ obtain gas control chang need asap econom purpos'

### Exemple of text after preprocessing:

Below we can see an example of a legit email before and after the filters, now, the text is much simpler!

In [11]:
df['text'][6]

"Subject: spring savings certificate - take 30 % off\r\nsave 30 % when you use our customer appreciation spring savings\r\ncertificate at foot locker , lady foot locker , kids foot locker and at\r\nour online stores !\r\nwelcome to our customer appreciation spring savings certificate !\r\nuse the special certificate below and receive 30 % off your purchases either in our stores or online . hurry ! this 4 - day sale begins thursday , march 22 and ends sunday , march 25 .\r\nshare the savings today and e - mail this offer to your friends . many items already are reduced and the 30 % discount is taken off the lowest sale price .\r\nclick below to print your customer appreciation spring savings certificate . you must present this coupon at any foot locker , lady foot locker or kids foot locker store in the u . s . foot locker canada is not participating in this program .\r\nready , set , save !\r\nour spring savings discount will automatically appear when you use the links below or type ca

In [12]:
cleaned_df['text'][6]

'subject spring save certif  save  use custom appreci spring save certif foot locker ladi foot locker kid foot locker onlin store welcom custom appreci spring save certif use special certif receiv  purchas store onlin hurri  day sale begin thursday march  end sunday march  share save today e mail offer friend item reduc  discount taken lowest sale price click print custom appreci spring save certif present coupon foot locker ladi foot locker kid foot locker store u s foot locker canada particip program readi set save spring save discount automat appear use link type camlem  promot code box checkout footlock com certif code camlem  ladyfootlock com certif code camlem  kidsfootlock com certif code camlem  rememb return hassl free simpli bring item store nationwid mail t left regist today learn new product promot event special simpli click term condit exclus appli manag complet detail certif present time purchas conjunct discount offer associ benefit redeem cash applic tax paid bearer app

## Saving the preprocessed dataframe
The data will be saved in a file, *preprocessed_spam_ham*, so all our models can use it!

In [13]:
cleaned_df.to_csv("./data/preprocessed_spam_ham.csv")