# Help functions

Part of help functions relies on python pseudo random number generator.

- **dot_product** takes two vectors and returns their the dot_product
- **random_sample** takes three integers as input and returns the unique list of random integers in the asked interval
- **shuffle_list** takes a list and returns the list shuffled.

In [1]:
import random
def dot_product(vector_a,vector_b):
    if len(vector_a)!=len(vector_b): 
        print ("Error: vector of different lengths")
        return False
    return sum([vector_a[i]*vector_b[i] for i in range(len(vector_a))])
 
#note that lower upper limit is included
def random_sample(lower_limit,upper_limit,sample_size):
    if sample_size > (upper_limit-lower_limit):
        print ("The size is ")
    random_set=set([])
    random_list=[]
    while(len(random_set)<sample_size):
        a=random.randint(lower_limit,upper_limit)
        if a not in random_set:
            random_set.add(a)
            random_list.append(a)     
    return(random_list)

def shuffle_list(a_list):
    list_indices=list(range(len(a_list)))
    list_of_final_indices=[]
    for i in range(len(a_list)):
        current_choice=random.choice(list_indices)
        list_of_final_indices.append(current_choice)
        list_indices.remove(current_choice)
    return [a_list[i] for i in list_of_final_indices]
    

# Loading the data

**load_data_with_ones** takes a file with two comma separated numeric columns (coordinates on the real plane) and a second file with one numeric column.

The function returns list with four columns
- column 1 is made of ones for easier calculation of half step prediciton
- column 2 is the first column from the file with the data, $x_1$
- column 3 is the second column from the file with the data, $x_2$
- column 4 is the column with classes


In [2]:
def load_data_with_ones(training_data,training_class):

    train_data_file=open(training_data)
    train_data_file_lines=train_data_file.readlines()
    train_data_file.close()
    train_class_file=open(training_class)
    train_class_file_lines=train_class_file.readlines()
    train_class_file.close()
    data_list=[]
    for i in range(len(train_data_file_lines)):
        temp_list=[1]
        temp_line=train_data_file_lines[i].strip("\n")
        for j in train_data_file_lines[i].split(","):
            temp_list.append(j)
        temp_list=[float(i) for i in temp_list]
        temp_line=train_class_file_lines[i].strip("\n")
        temp_list.append(float(temp_line))
        data_list.append(temp_list)
    return data_list


# Implementation of Perceptron algorithm with Stochastic Gradient Descent

**unit_step_prediction** takes a single data point $(x_1,x_2)$ and a set of weights and returns predictions of its class

**learn_weigths** takes a train dataset, and two parameters, learning rate $\eta$ $(0<\eta<1)$ and epoch. It returns learned weigths.

**evaluation_metrics** takes a test dataset and learned weigths and returns the rate of success for the classification with the current weigths


In [45]:
def unit_step_prediction(data_row, weigths):
    z=dot_product(data_row,weigths)
    if z>=0:
        return 1
    else:
        return -1

def learn_weigths(data, epoch_num, learning_rate):
    weigths=[random.random()*random.choice([-1,1]) for i in range(len(data[0])-1)]
    for epoch in range(epoch_num):
        data=shuffle_list(data)
        number_of_missclassified=0
        for data_row in data:
            difference = unit_step_prediction(data_row[0:3],weigths)-data_row[3]
            if difference!=0:
                number_of_missclassified+=1
            for i in range(len(weigths)):
                weigths[i]=weigths[i]-difference*data_row[i]*learning_rate
        print("Epoch: %d" % (epoch+1), "Learning rate: %.2f" % learning_rate, "Success rate: %.2f %%" % ((1-number_of_missclassified/len(data))*100))
    return weigths

def evaluation_metrics(test_data,calculated_weigths):
    num_all=len(test_data)
    num_true=0
    for test_row in test_data:
        if test_row[3]==unit_step_prediction(test_row[0:3],calculated_weigths):
            num_true+=1
    return (num_true)/num_all*100

# Main program

In [61]:
#load the data
data_with_ones=load_data_with_ones("linsep-traindata.csv","linsep-trainclass.csv")
#select the random sample to learn weigths
train_sample_indices=random_sample(0,99, 90)
shuffled_train_sample=[data_with_ones[i]for i in train_sample_indices]
test_sample=[data_with_ones[i] for i in range(0,100) if i not in shuffled_train_sample]
#learn weigths
new_weigths=learn_weigths(data_with_ones,10,0.1)
#evaluate the classifier
print("Evaluated success of the classifier is %.2f %%" % (evaluation_metrics(test_sample,new_weigths)))

Epoch: 1 Learning rate: 0.10 Success rate: 98.00 %
Epoch: 2 Learning rate: 0.10 Success rate: 100.00 %
Epoch: 3 Learning rate: 0.10 Success rate: 100.00 %
Epoch: 4 Learning rate: 0.10 Success rate: 100.00 %
Epoch: 5 Learning rate: 0.10 Success rate: 100.00 %
Epoch: 6 Learning rate: 0.10 Success rate: 100.00 %
Epoch: 7 Learning rate: 0.10 Success rate: 100.00 %
Epoch: 8 Learning rate: 0.10 Success rate: 100.00 %
Epoch: 9 Learning rate: 0.10 Success rate: 100.00 %
Epoch: 10 Learning rate: 0.10 Success rate: 100.00 %
Evaluated success of the classifier is 100.00 %


# Comments and questions


- Should the provided sample be split into the train sample and the test sample, or the test sample should be subset of the train sample? 
- Should the sample be shuffled in each epoch?
- Does the learning with the subset of the whole sample destroy the balnce of classes for training?
- Is it enough to evaluate the function with a simple ratio success/all? 


# References


1. https://sebastianraschka.com/Articles/2015_singlelayer_neurons.html#the-unit-step-function