# SVM Spam Filter
#### Miloslav Homer, Marek Zpěváček

### Matematika

#### Objective function

Naším cieľom bude minimalizovať objective function:
$$
    J(\alpha) = \frac{1}{m}\sum_{i=1}^m\left[1-y^{(i)}K^{(i)\top}\alpha\right]_+ + \frac{\lambda}{2}\alpha^\top K\alpha,
$$
kde hľadáme $\alpha$, $m$ je počet správ, $\lambda$ je parameter, ktorý volíme na začiatku a $K$ je Gaussovský kernel, tj:
$$
    K(x,z)=\operatorname{exp}\left(-\frac{1}{2\tau^2}\|x-z\|_2^2\right).
$$
Značením $[t]_+$ rozumieme $\max{(t,0)}$.

### Parsing dát

#### Formát vstupu

Na vstupe dostaneme súbor obsahujúci (v tomto poradí):
počet emailov, dĺžku slovníka (tj počet rôznych slov vyskytujúcich sa v týchto emailoch), slovník (oddelené medzerou), zoznam emailov. Prvé číslo je vždy buď 0 alebo 1, indikuje či je daný email spam. Ďalej zoznam čísel ukončených -1, na $i$-tej pozícii sa nachádza číslo $j$, tj $i$-te slovo emailu je $j$-te slovo slovníka.

In [1]:
import numpy as np
#np.set_printoptions(threshold=np.nan)

def readData(path):
    reader=open(path)
    
    # ignore first line
    reader.readline()

    # second line contains number of emails and dictionary size
    array = reader.readline().split(' ')
    num_of_emails = int(array[0])
    dict_size = int(array[1])
    
    # ignore third line
    reader.readline()
    
    x= np.zeros((num_of_emails,dict_size), dtype=np.int)
    y= np.zeros(num_of_emails, dtype=np.int)
    
    # x[i,j] number of occurences of j-th word in i-th email
    # y[i] i-th email is spam?
    for i in range(num_of_emails):
        array=reader.readline().split(' ')
        int_array=[int(e) for e in array]
        y[i]=int_array[0]
        
        #indexing mind*uck - check encoding.txt file 
        index=0
        for j in range(1,int(len(array)/2)):
            index=index+int_array[2*j-1]
            x[i,index]=int_array[2*j]
    reader.close()
    return (x,y)

### Learning fáza

In [2]:
import random
import math

def learnSVM(x,y):
    m=len(y)
    x=1*(x>0)
    y=2*y-1
    alpha=np.zeros(m)
    avg_alpha = np.zeros(m)
    num_outer_loops = 40
    
    Ker=np.zeros((m, m))
    for i in range(m):
        for j in range(m):
            Ker[i,j]=-(np.linalg.norm(x[i]-x[j],2))**2/(2*tau*tau)
    Ker=np.exp(Ker)
    
    for i in range(num_outer_loops * m):
        index = random.randint(0,m-1)
        margin=y[index]*np.dot(Ker[index],alpha)
        g=np.dot(Ker[index],alpha)/64-(margin<1)*y[index]*Ker[index]    
        alpha=alpha-g/math.sqrt(i+1)
        avg_alpha+=alpha
        
    avg_alpha=avg_alpha/(num_outer_loops*m)
    
    return avg_alpha

### Testovacia fáza

In [4]:
def testSVM(x_test,y_test,avg_alpha,x_train):
    x_test = 1*(x_test>0)
    x_train = 1*(x_train>0)
    y_test=2*y_test-1
    Ker=np.zeros((len(y_test), len(avg_alpha)))
    for i in range (len(y_test)):
        for j in range(len(avg_alpha)):
            Ker[i,j]=-(np.linalg.norm(x_test[i]-x_train[j],2))**2/(2*tau*tau)
    Ker=np.exp(Ker)
    preds = np.dot(Ker,avg_alpha)
    test_err=np.sum((np.multiply(preds,y_test))<=0)/len(y_test)
    return test_err

### Testy a výsledky

In [3]:
tau = 8
num_of_tests = 1
testSizes = ['50', '100', '200', '400', '800', '1400']

In [None]:
(m_test, category_test) = readData('spam_data/MATRIX.TEST')
for size in testSizes:
    err = 0
    for i in range(num_of_tests):
        (m_train, y_train) = readData('spam_data/MATRIX.TRAIN.' + size)
        avg_alpha = learnSVM(m_train, y_train)
        err += testSVM(m_test, category_test, avg_alpha, m_train)
    err = err / num_of_tests
    print('Train size:', size, 'Error:', err)

Train size: 50 Error: 0.015
Train size: 100 Error: 0.01375
Train size: 200 Error: 0.00375
Train size: 400 Error: 0.0025
Train size: 800 Error: 0.00875
