# Data Preprocessing

For this assignment, I used the [Spam Mails Dataset](https://www.kaggle.com/venky73/spam-mails-dataset) found on kaggle. To achieve better results, the data, which consists of several e-mails that were either considered legit or spam, was filtered through various process such as stopword, punctuation and noise removal, lowercasing, expanding contractions and stemming. 

In [1]:
import nltk, re, json
import pandas as pd

from IPython.core.display import display, HTML
from gensim.parsing.preprocessing import remove_stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
snowball = SnowballStemmer(language='english')
wnl = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ritar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ritar\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### The used dataframe:

In [2]:
df = pd.read_csv("./data/spam_ham.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


### An exemple of a legit email:

In [3]:
ham = df[df['label_num'] == 0]
display(HTML(ham.iloc[2]['text']))

### An exemple of spam:

In [4]:
spam = df[df['label_num'] == 1]
display(HTML(spam.iloc[1]['text']))

## Pre Processing
In order to be able to compare the processed data with the original one we'll make a copy!

In [5]:
cleaned_df = df.copy()

### Lowercasing

All text will be lower case, so our models won't differ in identical words with different spelling.

In [6]:
cleaned_df['text'] = cleaned_df['text'].apply(lambda x:x.lower())
cleaned_df['text'][2]

"subject: neon retreat\r\nho ho ho , we ' re around to that most wonderful time of the year - - - neon leaders retreat time !\r\ni know that this time of year is extremely hectic , and that it ' s tough to think about anything past the holidays , but life does go on past the week of december 25 through january 1 , and that ' s what i ' d like you to think about for a minute .\r\non the calender that i handed out at the beginning of the fall semester , the retreat was scheduled for the weekend of january 5 - 6 . but because of a youth ministers conference that brad and dustin are connected with that week , we ' re going to change the date to the following weekend , january 12 - 13 . now comes the part you need to think about .\r\ni think we all agree that it ' s important for us to get together and have some time to recharge our batteries before we get to far into the spring semester , but it can be a lot of trouble and difficult for us to get away without kids , etc . so , brad came up

### Expanding contractions
To have a cleaner text we'll transform contractions (which are quite common in legit emails and spam) in full words. <br>
*Example:* we'll -> we will

In [7]:
from contractions import contractionsdict
contractions_re=re.compile('(%s)' % '|'.join(contractionsdict.keys()))

def expand_contractions(text,contractionsdict=contractionsdict):
    def replace(match):
        return contractionsdict[match.group(0)]
    return contractions_re.sub(replace, text)

cleaned_df['text'] = cleaned_df['text'].apply(lambda x:expand_contractions(x))
cleaned_df['text'][2]

"subject: neon retreat\r\nho ho ho , we are around to that most wonderful time of the year - - - neon leaders retreat time !\r\ni know that this time of year is extremely hectic , and that it has / it is tough to think about anything past the holidays , but life does go on past the week of december 25 through january 1 , and that has / that is what i ' d like you to think about for a minute .\r\non the calender that i handed out at the beginning of the fall semester , the retreat was scheduled for the weekend of january 5 - 6 . but because of a youth ministers conference that brad and dustin are connected with that week , we are going to change the date to the following weekend , january 12 - 13 . now comes the part you need to think about .\r\ni think we all agree that it has / it is important for us to get together and have some time to recharge our batteries before we get to far into the spring semester , but it can be a lot of trouble and difficult for us to get away without kids ,

### Removing punctuation

Punctuation will be useless, so we're better off without it!

In [8]:
cleaned_df['text'] = cleaned_df['text'].apply(lambda x:x.translate(str.maketrans('', '', punctuation)))
cleaned_df['text'][2]

'subject neon retreat\r\nho ho ho  we are around to that most wonderful time of the year    neon leaders retreat time \r\ni know that this time of year is extremely hectic  and that it has  it is tough to think about anything past the holidays  but life does go on past the week of december 25 through january 1  and that has  that is what i  d like you to think about for a minute \r\non the calender that i handed out at the beginning of the fall semester  the retreat was scheduled for the weekend of january 5  6  but because of a youth ministers conference that brad and dustin are connected with that week  we are going to change the date to the following weekend  january 12  13  now comes the part you need to think about \r\ni think we all agree that it has  it is important for us to get together and have some time to recharge our batteries before we get to far into the spring semester  but it can be a lot of trouble and difficult for us to get away without kids  etc  so  brad came up w

### Removing Stopwords
Very common words will be filtered out, as they won't help in classifying whether they belong to spam or legitimate emails.

In [9]:
cleaned_df['text'] = cleaned_df['text'].apply(lambda x:remove_stopwords(x))
cleaned_df['text'][2]

'subject neon retreat ho ho ho wonderful time year neon leaders retreat time know time year extremely hectic tough think past holidays life past week december 25 january 1 d like think minute calender handed beginning fall semester retreat scheduled weekend january 5 6 youth ministers conference brad dustin connected week going change date following weekend january 12 13 comes need think think agree important time recharge batteries far spring semester lot trouble difficult away kids brad came potential alternative weekend let know prefer option retreat similar past years year heartland country inn www com outside brenham nice place 13 bedroom 5 bedroom house country real relaxing close brenham hour 15 minutes golf shop antique craft stores brenham eat dinner ranch spend time meet saturday return sunday morning like past second option stay houston dinner nice restaurant dessert time visiting recharging homes saturday evening easier trade time ll let decide email preference course avail

### Removing noise
Numbers, non-ascii characters and HTML markups are also removed.

In [10]:
def text_cleaning(email):
    
    # remove html markup
    email=re.sub("(<.*?>)","",email)
    email=re.sub("(\r\n)|(\r)|(\n)"," ",email)
    #remove non-ascii and digits
    email=re.sub("[^A-Za-z ]","",email)
    #remove extra whitespaces
    email=re.sub(' +', ' ', email)

    return email

cleaned_df['text'] = cleaned_df['text'].apply(lambda x:text_cleaning(x))
cleaned_df['text'][2]

'subject neon retreat ho ho ho wonderful time year neon leaders retreat time know time year extremely hectic tough think past holidays life past week december january d like think minute calender handed beginning fall semester retreat scheduled weekend january youth ministers conference brad dustin connected week going change date following weekend january comes need think think agree important time recharge batteries far spring semester lot trouble difficult away kids brad came potential alternative weekend let know prefer option retreat similar past years year heartland country inn www com outside brenham nice place bedroom bedroom house country real relaxing close brenham hour minutes golf shop antique craft stores brenham eat dinner ranch spend time meet saturday return sunday morning like past second option stay houston dinner nice restaurant dessert time visiting recharging homes saturday evening easier trade time ll let decide email preference course available weekend democratic

### Stemming
To improve results we can group together the inflected forms of a word so they can be analysed as a single item, identified by its root.

In [11]:
def stem_texts(email):  
    return (" ").join([snowball.stem(word) for word in word_tokenize(email)])
    
cleaned_df['text'] = cleaned_df['text'].apply(lambda x:stem_texts(x))
cleaned_df['text'][2]

'subject neon retreat ho ho ho wonder time year neon leader retreat time know time year extrem hectic tough think past holiday life past week decemb januari d like think minut calend hand begin fall semest retreat schedul weekend januari youth minist confer brad dustin connect week go chang date follow weekend januari come need think think agre import time recharg batteri far spring semest lot troubl difficult away kid brad came potenti altern weekend let know prefer option retreat similar past year year heartland countri inn www com outsid brenham nice place bedroom bedroom hous countri real relax close brenham hour minut golf shop antiqu craft store brenham eat dinner ranch spend time meet saturday return sunday morn like past second option stay houston dinner nice restaur dessert time visit recharg home saturday even easier trade time ll let decid email prefer cours avail weekend democrat process prevail major vote rule let hear soon possibl prefer end weekend vote way complain allo

### Lemmatizing
As an alternative to stemming we can group together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

In [12]:
def penn2morphy(penntag):
    
    if penntag[1][:2] == 'VB':
        return 'v'
    elif penntag[1][:2] == 'JJ':
        return 'a'
    elif penntag[1][:2] == 'RB':
        return 'r'
    else: 
        return 'n'
        

def lemmatize_sent(email_list): 
    
    new_list = []
    for email in email_list:
        new_text = (" ").join([wnl.lemmatize(word.lower(), pos=penn2morphy(tag)) for word, tag in pos_tag(word_tokenize(email))])
        new_list.append(new_text)
    return new_list
    
    
# cleaned_df['text'] = lemmatize_sent(cleaned_df['text'])
# cleaned_df['text'][2]

### Exemple of text after preprocessing:

Below we can see an example of a legit email before and after the filters, now, the text is much simpler!

In [13]:
df['text'][2]

"Subject: neon retreat\r\nho ho ho , we ' re around to that most wonderful time of the year - - - neon leaders retreat time !\r\ni know that this time of year is extremely hectic , and that it ' s tough to think about anything past the holidays , but life does go on past the week of december 25 through january 1 , and that ' s what i ' d like you to think about for a minute .\r\non the calender that i handed out at the beginning of the fall semester , the retreat was scheduled for the weekend of january 5 - 6 . but because of a youth ministers conference that brad and dustin are connected with that week , we ' re going to change the date to the following weekend , january 12 - 13 . now comes the part you need to think about .\r\ni think we all agree that it ' s important for us to get together and have some time to recharge our batteries before we get to far into the spring semester , but it can be a lot of trouble and difficult for us to get away without kids , etc . so , brad came up

In [14]:
cleaned_df['text'][2]

'subject neon retreat ho ho ho wonder time year neon leader retreat time know time year extrem hectic tough think past holiday life past week decemb januari d like think minut calend hand begin fall semest retreat schedul weekend januari youth minist confer brad dustin connect week go chang date follow weekend januari come need think think agre import time recharg batteri far spring semest lot troubl difficult away kid brad came potenti altern weekend let know prefer option retreat similar past year year heartland countri inn www com outsid brenham nice place bedroom bedroom hous countri real relax close brenham hour minut golf shop antiqu craft store brenham eat dinner ranch spend time meet saturday return sunday morn like past second option stay houston dinner nice restaur dessert time visit recharg home saturday even easier trade time ll let decid email prefer cours avail weekend democrat process prevail major vote rule let hear soon possibl prefer end weekend vote way complain allo

## Saving the preprocessed dataframe
The data will be saved in a file, *preprocessed_spam_ham*, so all our models can use it!

In [15]:
cleaned_df.to_csv("./data/preprocessed_spam_ham.csv")