# Spam Filter

· Procedure
    
    · Divide data in train and test sets
    · Keep test data in a safe!    
    · Transform test data (normalize, discretize, etc)   
    · Train model    
    · Transform test data with the parameters found in step 3    
    · Test model with test data    
    · Evaluate results

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from scipy.stats import norm
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from random import random

In [2]:
df = pd.read_csv("spambase/spambase.data",header=None)
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


## Divide data in train and test sets

In [3]:
# Crear índices aleatorios para separar datos
index=np.array([1 if random() < 0.75 else 0 for i in range(len(df))])

# Crear conjunto train y test (sin saber si es spam)
X_train=np.array(df[df.columns[0:-1]])[index==1]
X_test=np.array(df[df.columns[0:-1]])[index==0]

# Guardar índices para saber si es spam
Y_train=np.array(df[df.columns[-1]])[index==1]
Y_test=np.array(df[df.columns[-1]])[index==0]

# Escalar los datos
scaler = preprocessing.StandardScaler().fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)

## Train Model

In [4]:
p_s = sum(Y_train)/float(len(Y_train))
p_ns = 1 - p_s
ln_ps= np.log(p_s)
ln_pns = np.log(p_ns)

In [5]:
mediaS = np.mean(X_train[Y_train==1], axis=0)
dsS = np.std(X_train[Y_train==1], axis=0)
mediaNS = np.mean(X_train[Y_train==0], axis=0)
dsNS = np.std(X_train[Y_train==0], axis=0)

## Function to detect class

In [6]:
def spamClass(obs):
    spam = ln_ps
    noSpam = ln_pns
    for i in range(len(obs)):
        if (mediaS[i] != 0.0 and dsS[i] > 0.0):
            spam = spam + np.log(norm(mediaS[i], dsS[i]).pdf(obs[i]))
        if (mediaNS[i] != 0.0 and dsNS[i] > 0.0):
            noSpam = noSpam + np.log(norm(mediaNS[i], dsNS[i]).pdf(obs[i]))
    if spam > noSpam:
        return 1
    else:
        return 0

## Test model with test data

In [7]:
pred = [spamClass(X_test[i]) for i in range(len(X_test))]

  
  


In [8]:
confusion_matrix(Y_test, pred)

array([[450, 235],
       [ 13, 434]])

In [9]:
# Con SKLearn
skl = GaussianNB()
skl.fit(X_train, Y_train) 
pred_skl = skl.predict(X_test) 
confusion_matrix(Y_test, pred_skl)

array([[479, 206],
       [ 18, 429]])

## Results

Con las matrices de confusión, podemos ver que el filtro creado tiene resultados similares al filtro utilizado de SKLearn. La mayoría de los mails los clasifica de manera correcta: spam-spam y nospam-nospam, por lo que podemos decir que el filtro funciona bien.