## Preprocessing Data

### Instructions

It can be useful to be able to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. Here is one example of such data:

Download data from this url
- URL: https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/

### Import Packages

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn import svm, tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn import ensemble

Auto download if having not downloaded file yet !

In [4]:
import requests


url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/'
spambase_data = requests.get(url+"spambase.data", allow_redirects=True)
spambase_names = requests.get(url+"spambase.names", allow_redirects=True)

open("spambase.data", 'wb').write(spambase_data.content)
open("spambase.names", 'wb').write(spambase_names.content)

import shutil, os
files = ["spambase.data","spambase.names"]
for f in files:
    shutil.move(f, '../data/uci_spambase')

In [None]:
If have download, start run here

### Data Preprocessing
Using the link mentioned above we will download 2 files i.e "spambase.data" and "spambase.names".<br> The spambase.data file contains various parameters for each file as well as a classification of spam or not spam (i.e 1 = spam, 0 = Not spam). The spambase.names file contains the descriptions of each of the features. We will extract these features from the file and apply them to the dataset to create models. 

In [5]:
# extract columns from spambase.names file.

# created an empty columns list
columns = []

# open file 
columns_names_file = open('../data/uci_spambase/spambase.names')
# remove '\n', split the text,  and match with word_freq|char_freq feature 
for line in columns_names_file:
    if not re.match(r'\|', line):
        line = line.rstrip()
        if re.search(r'(word_freq_|char_freq_|capital_run_length_).+', line):   
            words = line.split()
            first_word = words[0]
            columns.append(first_word[:-1])
#The list of columns in the names file doesn't include the column defined classification of spam or ham (not spam), so let's add one. 

columns.append('class')
print('All columns: \n',columns)

All columns: 
 ['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our', 'word_freq_over', 'word_freq_remove', 'word_freq_internet', 'word_freq_order', 'word_freq_mail', 'word_freq_receive', 'word_freq_will', 'word_freq_people', 'word_freq_report', 'word_freq_addresses', 'word_freq_free', 'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000', 'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george', 'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet', 'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85', 'word_freq_technology', 'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting', 'word_freq_original', 'word_freq_project', 'word_freq_re', 'word_freq_edu', 'word_freq_table', 'word_freq_conference', 'char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!', 'char_freq_$', 'char_f

Let's now load the data from the data file and apply our 'categories' list as the column header. 

In [6]:
# Creating a function which will remove extra leading 
# and tailing whitespace from the data.
# pass dataframe as a parameter here
def whitespace_remover(dataframe):
    
    # iterating over the columns
    for i in dataframe.columns:
          
        # checking datatype of each columns
        if dataframe[i].dtype == 'object':
              
            # applying strip function on column
            dataframe[i] = dataframe[i].map(str.strip)
        else:
              
            # if condn. is False then it will do nothing.
            pass
  

Convert to csv file

In [8]:
data=pd.read_csv('../data/uci_spambase/spambase.data', header=None)
data.columns = columns
whitespace_remover(data)
data = data.apply(pd.to_numeric)
# Check data types
data.dtypes


word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_hpl                 float64
word_freq_ge

Remove any trailing spaces

In [10]:
# convert to csv file
data.to_csv (r'../processed_data/spambase.csv', index=None)
# applying whitespace_remover function on dataframe
data = pd.read_csv('../processed_data/spambase.csv')

In [11]:
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [12]:
print(len(data.columns))
print(len(data))

58
4601


#### Spam, Non-Spam Breakouts

Let' take a look at our dataset ot see how much spam and how much ham (non-spam) we have

In [14]:
# Count spam and non-spam
count_spam = len(data[data['class'] == 1])
count_nonspam = len(data[data['class'] == 0])

print("Spam: %d" %count_spam)
print("Non-spam: %d" %count_nonspam)

Spam: 1813
Non-spam: 2788
