# Machine Learning Homework 3

Student: Dagmawi Abraham Seifu

Professor: Pasquale Caianiello

# Homework problem: 
Support Vector Machine (SVM) algorithm implementation for Generalized learning to predict any of specified features from a dataset.

I used SVC module from sklearn to run SVM algorithm on encoded dataset, in addition 'pipeline' class from scikit module is used to combine multiple processes as a single estimator.

# Dataset
For this homework I used the Letter recognition dataset, which contains 20000 samples with 16 different type of features

In [0]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import random, copy
import csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

from sklearn.metrics import *
from sklearn.externals import joblib
import os

We first create a CharacteristicVector class that has two functions for encoding and decoding columns (features) of a given dataset. The unique values in a column determine how many bits are going to be used in encoding the feature. If, for instance, a column contains 10 unique values, then those values are taken as classes of that feature and are represented with 10 bits (1000000000, 0100000000, ...). If a value in a column is fraction(float), it will be rounded to the closest integer.

In [0]:
class CharacteristicVector:
  def __init__(self):
    self.columnMap = np.array([])

  def encode(self, column):
    column = column.astype(str) if not str(column.dtypes).startswith("float") else column.apply(np.round).astype(int).astype(str)

    uniqueColumn = np.unique(column.as_matrix())
    for i in range(0, len(uniqueColumn)):
      self.columnMap = np.append(self.columnMap, uniqueColumn[i])

    characteristicVector = []

    for i in range(0, len(column)):
      value = column[i]
      binary = np.zeros(len(self.columnMap), dtype=np.int)
      binary[self.columnMap.tolist().index(value)] = 1
      characteristicVector += [binary]

    return np.array(characteristicVector)

  def decode(self, column):
    vector = []

    for i in range(0, len(column)):
      binary = column[i]
      index = binary.tolist().index(1)
      value = self.columnMap[index]
      vector += [value] 

    return np.array(vector)


In [0]:
ltr = 'Data/Letter recognition/letter-recognition.data'
dataSet = pd.read_csv(ltr, header=None)

We split the dataset randomly into training and test set, with 80/20 ratio.

In [0]:
indexes = np.arange(len(dataSet)).tolist()
indexesTrainingSet = random.sample(indexes, int(0.8*len(dataSet)))
trainingSet = dataSet.iloc[indexesTrainingSet]
testSet = dataSet.iloc[list(set(indexes) - set(indexesTrainingSet))]

Write the training set and the test set in separate files, in order to make encoding and decoding much easier

In [0]:
training_file = open('trainingSet.csv', "w+")
writer = csv.writer(training_file, delimiter=',')
writer.writerows(trainingSet.values)

test_file = open('testSet.csv', "w+")
writer = csv.writer(test_file, delimiter=',')
writer.writerows(testSet.values)

Encode the dataset using the using the CharacteristicVector class

In [74]:
cv = CharacteristicVector()
characteristicVectorLabel = cv.encode(dataSet[0])  # encode the label
lengthCharacteristicVectorLabel = len(cv.columnMap)
cvFeatures = []
characteristicVectorFeatures = []
startPosition = lengthCharacteristicVectorLabel

for i in range(1, len(dataSet.columns)) :
  cvFeature = CharacteristicVector()
  columnDataSet = dataSet[i]
  characteristicVector = cvFeature.encode(columnDataSet)
  lengthCharacteristicVector = len(cvFeature.columnMap)

  cvFeatures += [[cvFeature, lengthCharacteristicVector, startPosition]]

  startPosition += lengthCharacteristicVector

  characteristicVectorFeatures += [characteristicVector]

  


Write the encoded labels and the remaining features into separate files for easy access

In [75]:
joblib.dump(cv, 'cvLabel.pkl') 
joblib.dump(cvFeatures, 'cvFeatures.pkl') 

['cvFeatures.pkl']

Write the encoded dataset in another file

In [0]:
file = open('dataSetEncoded.csv', "w+")
for i in range(0, len(dataSet)) :
  label = characteristicVectorLabel[i]
  line = label
  for j in range(0, len(characteristicVectorFeatures)):
    column = characteristicVectorFeatures[j]
    line = np.concatenate((line, column[i]))
  file.write(','.join(map(str, line.tolist()))+'\n')

Open the encoded dataset and split it into training set and test set with ratio 80/20 as before.

In [0]:
dataSetEncoded = pd.read_csv("dataSetEncoded.csv", header=None)
trainingSetEncoded = dataSetEncoded.iloc[indexesTrainingSet]
testSetEncoded = dataSetEncoded.iloc[list(set(indexes) - set(indexesTrainingSet))]


In [78]:
X = trainingSetEncoded.drop(np.arange(len(trainingSetEncoded.columns)/2, len(trainingSetEncoded.columns)), axis=1)
y = trainingSetEncoded[np.arange(len(trainingSetEncoded.columns)/2, len(trainingSetEncoded.columns))]

scaler = StandardScaler()
#scaler.fit(X)
StandardScaler(copy=True, with_mean=True, with_std=True)
svm = SVC(kernel = 'poly', degree = 8)
steps = [('scaler', StandardScaler()), ('svm', SVC())]
  
pipeline = Pipeline(steps)
pipeline.fit(X,y.iloc[:,0])

#X_train = scaler.transform(X)

  
joblib.dump(pipeline, 'pipeline.pkl') 


['pipeline.pkl']

In [81]:
X_test = testSetEncoded.drop(np.arange(len(testSetEncoded.columns)/2, len(testSetEncoded.columns)), axis=1)
y_test = testSetEncoded[np.arange(len(testSetEncoded.columns)/2, len(testSetEncoded.columns))]

print("Accuracy for Encoded Test: " + str(pipeline.score(X_test,y_test.iloc[:,0])*100) + "%" )

Accuracy for Encoded Test: 99.4%


Now let us implement the support vector machine algorithm for the original dataset (the unencoded one) to compare the prediction accuracy

In [82]:
X_train = trainingSet.iloc[:,1:17]
y_train = trainingSet.iloc[:,0]
X_test = testSet.iloc[:,1:17]
y_test = testSet.iloc[:,0]
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scale = scaler.transform(X_train)
X_test_scale = scaler.transform(X_test) 
svm = SVC(kernel = 'poly', degree=8)
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print("Accuracy:",accuracy_score(y_test, y_pred)*100)



Accuracy: 94.45


As we can see, SVM will give much higher almost perfect prediction accuracy on the encoded dataset than the original.

# Reference
https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html