## SPAM Filter

#Varvara Yakovleva

* Procedure
    * Divide data in train and test sets
    * Keep test data in a safe!
    * Transform test data (normalize, discretize, etc)
    * Train model
    * Transform test data with the parameters found in step 3
    * Test model with test data
    * Evaluate results

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metric
s import confusion_matrix
from scipy.stats import norm
from sklearn import preprocessing
from random import random

Perhaps easiest way to read in data is using Pandas. 
Pandas is a library for manipulating data. Similar to R's dataframes and very useful albeit in some cases confusing to combine with other libraries:

In [None]:
df = pd.read_csv("spambase.data.txt",header=None)

In [None]:
# This data does not have headers so each attribute or field is simply enumerated
df.describe()

There are a few ways to split data into train and test. The first is using Sklearn, which is a machine learning library in python has a method for spliting data into train and test

In [None]:
# Here df.columns is a list of all the columns and df.columns[0:-1] is all columns minus the last which is y. 
# If the data had headers you could use column names: df[['column1','column2','etc']]
X_train, X_test, Y_train, Y_test = train_test_split(df[df.columns[0:-1]],df[df.columns[-1]], train_size=0.75)

Something important to note. Sklearn is able to take in pandas dataframes but returns arrays 

The other way to split data that is useful to know is:

In [5]:
# index for selecting data 0.75 is the percentage in training
index=np.array([1 if random() < 0.75 else 0 for i in range(len(df))])

In [6]:
# Separate both train and test as well as the response variable
X_train=np.array(df[df.columns[0:-1]])[index==1]
X_test=np.array(df[df.columns[0:-1]])[index==0]
Y_train=np.array(df[df.columns[-1]])[index==1]
Y_test=np.array(df[df.columns[-1]])[index==0]

The above method for spliting data can also be used for selecting a subset of an array using the values of an equally sized array. Useful for the current excercise. For example, to extract all instances of spam for the training data: 

In [7]:
# Normalizar no ayuda mucho pero sale igual al de sklearn. Para que las alturas del pdf signifiquen lo mismo 
scaler = preprocessing.StandardScaler().fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)

In [8]:
# Find means and standard deviation
spam_mean = np.mean(X_train[Y_train==1], axis=0)
spam_std = np.std(X_train[Y_train==1], axis=0)

not_spam_mean = np.mean(X_train[Y_train==0], axis=0)
not_spam_std = np.std(X_train[Y_train==0], axis=0)

In [9]:
# Probability that message is spam or not 
pr_spam = float(sum(Y_train))/len(Y_train)
print(pr_spam)

pr_not_spam = 1 - pr_spam
print(pr_not_spam)

0.394325419803
0.605674580197


In [12]:
def bayes(X):
    spam = 0
    not_spam = 0
    for i in range(0, len(X)):
        x = i 
        
        # y_spam = p(x_i|C), but logarithmed 
        y_spam = np.ma.log(norm(spam_mean[x], spam_std[x]).pdf(X[x]))
        #p_spam - probability of spam, according to Bayes formula
        p_spam = np.log(pr_spam) + y_spam.sum()

        # y_not_spam = p(x_i|¬C), but logarithmed   
        y_not_spam = np.ma.log(norm(not_spam_mean[x], not_spam_std[x]).pdf(X[x]))
        p_not_spam = np.log(pr_not_spam) + y_not_spam.sum()

        if p_spam >= p_not_spam:
            spam += 1
        else:
            not_spam += 1
        
    if spam > not_spam:
        return "spam"
    else:
        return "not spam"
        

In [13]:
spams = 0
not_spams = 0
for i in range(0, len(X_train)):
    result = bayes(X_train[i])
    if result == "spam":
        spams += 1
    else:
        not_spams +=1
        
print spams
print not_spams

1203
2251
