# **PERCEPTRON**

In this code I will perform the following steps:
* Download cars data set into a pandas data frame
* Clean the data and build features for the perceptron algorithm
* Implement averaged perceptron algorithm
* Implement a cross validation, that will output accuracy



## Download car data

The mtcars data set is read from my drive. I could not find a reasonable URL frow which I coul read it.

In [396]:
import pandas as pd
import os
import numpy as np
import csv


def load_car_data(path):
    '''
    @path - path to the data set
    returns a pandas data frame with the read data set
    '''
    data = []
    with open(path) as f:
        for row in csv.DictReader(f, delimiter='\t'):
            data.append(row)
    return pd.DataFrame(data)

df = load_car_data('auto-mpg.tsv')
print('{} rows loaded'.format(df.shape[0]))


392 rows loaded


In [389]:
print('Vehicles with HIGH fuel consumption')
df[df['mpg']=='-1'].head()

Vehicles with HIGH fuel consumption


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,-1,8,304,193,4732,18.5,70,1,hi 1200d
1,-1,8,307,200,4376,15.0,70,1,chevy c20
2,-1,8,360,215,4615,14.0,70,1,ford f250
3,-1,8,318,210,4382,13.5,70,1,dodge d200
4,-1,8,350,180,3664,11.0,73,1,oldsmobile omega


In [397]:
print('Vehicles with LOW fuel consumption')
df[df['mpg']=='1'].head()

Vehicles with LOW fuel consumption


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
196,1,4,115,950,2694,15.0,75,2,audi 100ls
197,1,4,120,880,2957,17.0,75,2,peugeot 504
198,1,4,120,970,2506,14.5,72,3,toyota corona mark ii (sw)
199,1,4,122,860,2220,14.0,71,1,mercury capri 2000
200,1,4,140,780,2592,18.5,75,1,pontiac astro


## Clean the data and build features for the perceptron algorithm
* Remake displacement, horsepower, weight, acceleration and model_year into numeric - float - values
* Standarize the numeric features
* Perform one-hot encoding for cylinders and origin
* Drop the remaining features - so we do not use car name, as it seems irrelevant to fuel consumption

In [391]:
######### Make sure we have numeric data in approriate fields #########
for feature in ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year']:
    df[feature] = df[feature].astype('float')
    
    
######### Make sure that mpg takes on only two values: #########
if (df['mpg'].unique() != np.array([-1.0,1.0])).any():
    print('There is a problem with mpg field')

    
######### Create a function that builds one-hot encoded features #########
def one_hot(x):
    '''
    @param x : pandas series object
    returns a data frame with the original series and the one hot encoded fields
    '''
    new_frame = pd.DataFrame(x)
    for val in x.unique():
        _name = str(int(val))
        new_frame[x.name + _name] = new_frame.apply(lambda col: 1 if col[x.name] == val else 0, axis=1)
    return new_frame
        
        
######### Create a function that standarizes #########
def standarize(x):
    '''
    @param x : string holding the name of the field to be standarized
    '''
    return (x - np.mean(x))/np.std(x)
    

######### proceed with standarization and one-hot encoding #########    
for feature in ['displacement', 'horsepower', 'weight', 'acceleration', 'model_year']:
    df[feature] = standarize(df[feature])


for feature in ['cylinders', 'origin']:
    one_hot_frame = one_hot(df[feature])
    one_hot_frame.drop(labels = feature, inplace=True, axis=1)
    df = pd.concat([df, one_hot_frame], axis=1)

In [392]:
######### come up with the final set of features #########
final_features = ['displacement', 'horsepower', 'weight', 'acceleration', 'cylinders3',
                  'cylinders4', 'cylinders5', 'cylinders6', 'cylinders8', 'origin1', 'origin2', 'origin3'
                 ]    


# change to numpy as input for the learning algorithm
data = df[final_features].to_numpy().T
labels = df['mpg'].to_numpy().reshape((df.shape[0],1)).T
print('We have {} features and {} observations in our data set'.format(data.shape[0], data.shape[1]))

We have 12 features and 392 observations in our data set


## Implementation of the perceptron

In [393]:
def averaged_perceptron(data, labels, params=(300,True)):
    '''
    @data - numpay arra with observations in columns, features in rows
    @labels - numpy arra with one row and as many columns as observations in data
    @params - tuple holding number of max iterations and should the algorithm return average or last theta
    
    returns a theta which is a numpy array with 1 row and number of columns equal to number of features plus one
    '''
    max_iter, averaged = params
    data = np.vstack((data, np.ones((1,data.shape[1]))))  # add ones for the intercept
    m,n = data.shape # m - number of features n - number of data points
    theta = np.zeros((m,1)) # initialize theta with zeros
    theta_sum = np.zeros((m,1))
    counter = 0
    all_good = False

    while counter < max_iter and not all_good:
        counter += 1
        all_good = True
        for i in range(n):
            if sum(labels[0,[i]]*np.matmul(theta.T, data[:,[i]]))<=0:
                theta = theta + labels[0,[i]]*data[:,[i]]
                theta_sum = theta_sum + theta
                all_good = False
                
    if not all_good:
        print("Max iterations reached without finding a perfect separator!")
    if averaged:
        return theta_sum / counter
    else:
        return theta
        

Here is a theta vector, representing the linear separator, as outputed by the perceptron algorithm

In [394]:
theta = averaged_perceptron(data, labels)
print(theta)

Max iterations reached without finding a perfect separator!
[[-20.19171811]
 [ 11.92208309]
 [-97.90831313]
 [ 10.85756865]
 [-49.40333333]
 [  4.78333333]
 [ 13.34      ]
 [-18.55333333]
 [ 37.47666667]
 [-12.15      ]
 [ 16.53      ]
 [-16.73666667]
 [-12.35666667]]


## Cross validation of the results

Here we perform a 10 cross validation on the data set to obtain a better grasp the accuracy of the perceptron algorithm.

In [395]:

def simple_evaluation(data, labels, theta):
    '''
        @data - numpay arra with observations in columns, features in rows
        @labels - numpy arra with one row and as many columns as observations in data
        @theta - linear separator, output of learning alogorithm
        
        returns a float showing the ration os correctly classified observations
    '''
    # add ones:
    data = np.vstack((data, np.ones((1,data.shape[1]))))
    m,n = data.shape
    success = 0.0
    for i in range(n):
        if sum(labels[0,[i]]*np.matmul(theta.T, data[:,[i]])) > 0:
            success+=1.0
    return success/n

def cross_validation(learner, data, labels, learner_params=None, k=10):
    '''
        @learner - a classifying function
        @data - numpay arra with observations in columns, features in rows
        @labels - numpy arra with one row and as many columns as observations in data
        @learner_params - in case we want to pass some parameters to the learner
        @k -integer telling into how many subsets to we want to split data for cross validation
        
        returns and averaged cross validated accuracy
    '''
    
    # first we randomly shuffle the data (by columns!):
    new_data = np.vstack((data, labels))
    new_data = new_data.T
    np.random.shuffle(new_data)
    new_data = new_data.T
    data = new_data[:-1,:]
    labels = new_data[[-1],:]
    
    # get lists of splitted data arrays
    split_data = np.array_split(data, k, axis=1)
    split_labels = np.array_split(labels, k, axis=1)

    cross_validation = 0
    for i in range(k):
        data_train = np.concatenate(split_data[:i]+ split_data[i+1:], axis=1)
        labels_train = np.concatenate(split_labels[:i] + split_labels[i+1:], axis=1)
        data_test = split_data[i]
        labels_test = split_labels[i]
    
        if learner_params is not None:
            theta = learner(data_train, data_test, learner_params)
        else:
            theta = learner(data_train, labels_train)
     
        accuracy = simple_evaluation(data_test, labels_test, theta)
        print('During the {}-th validation accuracy was {}'.format(i + 1, accuracy))
        cross_validation += accuracy
    return cross_validation / k
    
print('') 
print('The cross validate accuracy is {}'.format(cross_validation(averaged_perceptron, data, labels)))


Max iterations reached without finding a perfect separator!
During the 1-th validation accuracy was 0.85
Max iterations reached without finding a perfect separator!
During the 2-th validation accuracy was 0.875
Max iterations reached without finding a perfect separator!
During the 3-th validation accuracy was 0.9230769230769231
Max iterations reached without finding a perfect separator!
During the 4-th validation accuracy was 0.8461538461538461
Max iterations reached without finding a perfect separator!
During the 5-th validation accuracy was 0.8717948717948718
Max iterations reached without finding a perfect separator!
During the 6-th validation accuracy was 0.8205128205128205
Max iterations reached without finding a perfect separator!
During the 7-th validation accuracy was 0.9230769230769231
Max iterations reached without finding a perfect separator!
During the 8-th validation accuracy was 0.9487179487179487
Max iterations reached without finding a perfect separator!
During the 9-t