## Feature Extraction - Single Pixel

Our Input Data contains ['#','+'] as pixel values and we consider a weight of 1 for each of these. Having different values for these characters gives the same result. We transform our input data into binary values depending on the characters.

In [7]:
pixelChars = ['#', '+']

def countPixels(self, line):
    count = 0

    if char in self.pixelChars:
        count += 1

    return count

## Feature Extraction - Grid of Pixels

We have considered successive grids of size n*\n as our features to give a better description of input to our classifier. This gave us amazing result improvements increasing accuracy over 10% for every classifier. We calculate total sum of pixels which are valued 1, for a GRID of size n\*n and store this as our feature.   

We observed the following error rate for varied grid sizes for our classifiers.

In [8]:
#TODO: classifier per grid size results (graph maybe)
print

<function print>

## Dataset

We have **FACE** and **DIGIT** datasets. We have combined the Training and Validation Datasets for our Training. The Testing dataset is purely used for Testing. 

# Classifier Algorithms 

## 1. Perceptron 

*Perceptron* is a supervised learning algorithm where we train our algorithm from a *labeled* training dataset.

Weight matrix is of SHAPE as below and we initialize our weights with random numbers. Every LABEL has a weight array of FEATURES + 1 length. The + 1 is for the bias weight.

In [9]:
def __init__(self, FEATURES, LABELS):
    self.SHAPE = (LABELS, FEATURES + 1)  # The +1 is for our w0 weight.
    self.weightMatrix = np.zeros(self.SHAPE)

### Training :

For Traning, we calculate the dot product between the weights and feature values and sum up these. We do this for all the labels and find the probabilities. 

In [10]:
def runModel(self, isTrain, featureValueList, actualLabel):
    for labelWeights in self.weightMatrix:
        featureScoreList.append(np.sum(np.dot(labelWeights, featureValueList)))

The maximum probability is our predicted Label

In [11]:
predictedLabel = np.argmax(featureScoreList)

NameError: name 'np' is not defined

### Weight updation :
If the predicted and actual lables do not match, we update the weights by adding/subtracting the feature values. 

In [None]:
if predictedLabel != actualLabel:
    if isTrain:
        self.updateWeights(predictedLabel, actualLabel, featureValueList)
    else: return 1
else: return 0
    
def updateWeights(self, predictedLabel, actualLabel, featureValueList):
    self.weightMatrix[actualLabel] = self.weightMatrix[actualLabel] + featureValueList
    self.weightMatrix[predictedLabel] = self.weightMatrix[predictedLabel] - featureValueList

### Observations

In [None]:
# TODO: Command to run this classifier with arguments. ANd show their graphs 

### **Advantages**
- Perceptrons have the ability to learn themselves the complex relationships and patterns in the dataset.
- We can have any type of input. It does not restrict to use any one datatype as inputs.
- If we have a single layer of perceptron, then the training is very quick.
- Is really accurate for image processing and character recognition.

### **Disadvantages**
- A single layer of perceptron cannot train a problem whose solution is a non-linear function.
- Multi layer perceptron takes more time to train.
- Difficult optimization if we have a lot of local minima/maxima.

## 2. Naive Bayes Classifier
*Naive Bayes Classifier* is a part of the probabilistic classifiers based on Bayes' Theorem. The formula for Bayes' theorem is $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$. There are two assumptions taken in this algorithm.
- That the feature is independent.
- The importance of every feature is equal.

### Initialization
2 maps to store our frequency count probabilities

In [None]:
# Initialization of FMAP - FEATURES X LABELS X POSSIBLE_VALUES
for featureIndex in range(self, self.FEATURES):
    for labelIndex in range(self.LABELS):
        for possibleValueIndex in POSSIBLE_VALUES:
            self.FeatureMap[featureIndex][labelIndex][possibleValueIndex] = 0

# Initialization of LMAP - LABELS
for labelIndex in range(self, 0, self.LABELS):
    self.LabelMap[labelIndex] = 0

### Training
We fill the 2 maps initialized, with our input data features frequency and then calculate the LOG(Probabilities(frequency)))

In [None]:
def constructLabelsProbability(self, trainingLabels):
    # Storing Frequency
    for label in trainingLabels: self.LabelMap[label] += 1

    # Calculating probability -> frequency/total -> LOG
    for key in self.LabelMap:
        probability = self.LabelMap[key] / totalDataset
        self.LabelMap[key] = probability

def constructFeaturesProbability(self, featureValueListForAllTrainingImages, actualLabelForTrainingList, POSSIBLE_VALUES):

    # TRAINING
    for label, featureValuesPerImage in zip(actualLabelForTrainingList, featureValueListForAllTrainingImages):
        for feature in range(0, self.FEATURES):
            self.FeatureMap[feature][label][featureValuesPerImage[feature]] += 1

    # Then Converting frequencies to probabilities and then to it's LOG

### Testing

### Laplace Smoothing
In real life, we do not want to set the probabilities of any term to be 0. But, if the algorithm does not see any connections between $A$ and $B$ in the formula, it would give $P(A|B) = 0$. This is not acceptable in real world predications and hence, we have used *Laplace Smoothing* to get rid of any probabilities leading to 0.

We experimented with the smoothing value with these values and found the corresponding results.

In [None]:
kgrid = [0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 20, 50]

# print graphs with the above results

# Odds Ratio
When we want the algorithm to learn, we use *Odds Ratio*. For every feature and every pair of classes we have, we check each one of them to increase the belief in one class or the other.

## Observations 

In [None]:
print # TODO: Add all the observation graph here

**Advantages**
- It is easy to implement.
- Requires just small amount of data to establish the beliefs.
- It is less sensitive to lost data.
- Speed of training and testing is pretty high

**Disadvantages**
- We need prior probability.
- The assumption that all the features are independent.
- If there is a test data that was not seen during training, then the probability of this data would be 0.

## 3. K-Nearest Neighbors(KNN)

*KNN* algorithm

**Advantages**
- No training period. Learns when testing the data.
- Can accept new data flexibly.
- Easy to implement.

**Disadvantages**
- Accuracy is small with large datasets.
- We need to standardize the input to scale them for appropriate predictions.
- If some data is missing or the dataset has some error, it can give wrong predictions.

In [None]:
from utility import euclidean_distance, most_frequent


class KNN:
    def __init__(self, num_neighbors):
        self.num_neighbors = num_neighbors
        self.trainX = []
        self.trainY = []

    def test(self, test_row, actualLabel):
        neighbors = self.get_neighbors(test_row)
        y_pred = most_frequent(neighbors)

        if y_pred == actualLabel:
            return 0
        else:
            return 1

    def storeTrainingSet(self, x, y):
        self.trainX += x
        self.trainY += y

    # Locate the most similar neighbors
    def get_neighbors(self, test_row):
        distances = list()
        zippedTrainingData = zip(self.trainX, self.trainY)

        for train_row in zippedTrainingData:
            features = train_row[0]
            label = train_row[1]
            dist = euclidean_distance(test_row, features)
            distances.append((features, label, dist))

        distances.sort(key=lambda tup: tup[2]) # Sort according to the dist
        neighbors = list()
        for i in range(self.num_neighbors):
            neighbors.append(distances[i][1])
        return neighbors



We have created a separate class to open the files and iterate through the data in those files. The data from the files are stored in the form of lists of list which is then fed to different functions according to their needs.

In [None]:
class Samples:

    def __init__(self, DATA_DIR):
        self.TestFileObject = None
        self.TestLabelFileObject = None
        self.TrainFileObject = None
        self.TrainLabelFileObject = None
        self.ValidationFileObject = None
        self.ValidationLabelFileObject = None

        self.test_lines_itr = None
        self.test_labelsLines_itr = None
        self.train_lines_itr = None
        self.train_labelsLines_itr = None
        self.validate_lines_itr = None
        self.validate_labelsLines_itr = None


        self.trainingFileName = DATA_DIR + "/trainingimages"
        self.trainingLabelFileName = DATA_DIR + "/traininglabels"
        self.testFileName = DATA_DIR + "/testimages"
        self.testLabelFileName = DATA_DIR + "/testlabels"
        self.validationFileName = DATA_DIR + "/validationimages"
        self.validationLabelFileName = DATA_DIR + "/validationlabels"

        TEST = "TEST"
        TRAIN = "TRAIN"
        VALIDATION = "VALIDATION"

    # def open_many_files(self,file_name):
    #     with open(self.name, 'r') as f:
    #         c = 0
    #         for l in f:
    #             c+=1
    #         l=c+1-self.k_value
    #         for i in range(0,l):
    #             lines = [line for line in f][:self.k_value]
    #             object=lines
    #         return object


    def closeFiles(self):
        self.TestFileObject.close()
        self.TestLabelFileObject.close()
        self.TrainFileObject.close()
        self.TrainLabelFileObject.close()
        self.ValidationFileObject.close()
        self.ValidationLabelFileObject.close()

    def initTestIters(self):
        self.TestFileObject.close()
        self.TestLabelFileObject.close()
        self.TestFileObject = open(self.testFileName)
        self.TestLabelFileObject = open(self.testLabelFileName)
        self.test_lines_itr = iter(self.TestFileObject.readlines())
        self.test_labelsLines_itr = iter(self.TestLabelFileObject.readlines())

    def initValidateIters(self):
        self.validate_lines_itr = iter(self.ValidationFileObject.readlines())
        self.validate_labelsLines_itr = iter(self.ValidationLabelFileObject.readlines())

    def readFiles(self):
        self.TrainFileObject = open(self.trainingFileName)
        self.TrainLabelFileObject = open(self.trainingLabelFileName)
        self.TestFileObject = open(self.testFileName)
        self.TestLabelFileObject = open(self.testLabelFileName)
        self.ValidationFileObject = open(self.validationFileName)
        self.ValidationLabelFileObject = open(self.validationLabelFileName)

        self.train_lines_itr = iter(self.TrainFileObject.readlines())
        self.train_labelsLines_itr = iter(self.TrainLabelFileObject.readlines())

        self.test_lines_itr = iter(self.TestFileObject.readlines())
        self.test_labelsLines_itr = iter(self.TestLabelFileObject.readlines())

        self.initValidateIters()

This class is used to plot the error data in the form of graphs.
- First value is the dataset that we would be plotting on the $x$ axis. 
- Second value is the list of error rates for all the three algorithms. 
- The type of the dataset ("FACE" or "DIGIT") is passed here for the title of the graph.
- And the final parameter is the name of the algorithm so that we can specify which curve is for which graph.

In [None]:
import matplotlib.pyplot as plt
from matplotlib import pyplot


class Error:

    def graphplot(self, dataset, errorRateList, type, method):
        for i in range(len(errorRateList)):
            plt.plot(dataset, errorRateList[i], label=type[i])
            plt.xlim(0, dataset[-1] + dataset[-1]/10)
            plt.ylim(0, 100)

        for i in range(len(errorRateList)):
            for data, errorRate in zip(dataset, errorRateList[i]):
                pyplot.text(data, errorRate, str(int(errorRate)))

        plt.title(method)
        plt.legend()
        plt.show()

This is the main function of the code. All other classes are imported in this file and are accessed here as required.

In [None]:
import math
import time
import statistics
from knn import KNN
import matplotlib.pyplot as plt
from naivebyes import NaiveBayesClassifier
from perceptron import PerceptronClassifier
from error_plot import Error
from samples import Samples


def mean_standard_deviation(errorRate, name):
    if len(errorRate) > 1:
        mean = statistics.mean(errorRate)
        standard_deviation = statistics.stdev(errorRate)
        print(name, " mean = ", mean, " and Standard Deviation = ", standard_deviation)
        return mean


class DataClassifier:
    def __init__(self, imgHeight, imgWidth, LABELS, pixelChars, pixelGrid):
        if pixelChars is None:
            pixelChars = ['#', '+']
        self.pixelGrid = pixelGrid
        self.imgHeight = imgHeight
        self.imgWidth = imgWidth
        self.FEATURES = math.ceil((imgHeight - self.pixelGrid + 1) * (imgWidth - self.pixelGrid + 1))
        self.LABELS = LABELS
        self.pixelChars = pixelChars
        self.FileObject = None
        self.LabelFileObject = None

    def countPixels(self, line):
        count = 0
        if not isinstance(line, list):
            line = list(line)

        for char in line:
            if char in self.pixelChars:
                count += 1

        return count

    def extractFeaturesPerLine(self, line, row):
        gridList = []
        featureValueList = []

        for startIndexOfGrid in range(0, len(line), 1):
            gridList.append(line[startIndexOfGrid:startIndexOfGrid + 1])

        # col = 0
        for grid in gridList:
            # Count the number of chars in this grid and add the count to respective index of FEATURE
            # indexOfFeature = row + col
            featureValueList.append(self.countPixels(grid))

        return featureValueList

    def splitImageLineFeaturesIntoGridFeatures(self, imageLinesList, gridSize):
        height_rows = self.imgHeight + 1 - gridSize
        width_rows = self.imgWidth + 1 - gridSize
        height_new_list = []

        for rowIndex in range(0, self.imgHeight):
            line = imageLinesList[rowIndex]
            width_new_list = []
            for gridStartIndex in range(0, width_rows):
                width_new_list.append(sum(line[gridStartIndex: gridStartIndex + gridSize]))
            height_new_list.append(width_new_list)

        featureListForImage = []
        for rowIndex in range(0, height_rows):
            for column in range(0, width_rows):
                sum1 = 0
                for rows in range(0, gridSize):
                    sum1 += height_new_list[rowIndex + rows][column]
                featureListForImage.append(sum1)

        return featureListForImage

    def extractFeatures(self, lines_itr, labelsLines_itr):
        imageLine = lines_itr.__next__()

        totalImages = 0
        featureValueListForAllTestingImages = []
        actualLabelList = []

        try:
            while imageLine:
                # Skipping the blank lines
                while imageLine and self.countPixels(imageLine) == 0:
                    imageLine = lines_itr.__next__()

                imageLinesList = []
                # Scanning image pixels
                for i in range(0, self.imgHeight):
                    imageLinesList.append(self.extractFeaturesPerLine(imageLine, i))
                    # print(featureValueList)
                    imageLine = lines_itr.__next__()

                featureValueListPerImage = self.splitImageLineFeaturesIntoGridFeatures(imageLinesList, gridSize)

                totalImages += 1
                actualLabel = labelsLines_itr.__next__()

                featureValueListForAllTestingImages.append(featureValueListPerImage)
                actualLabelList.append(int(actualLabel))

        except StopIteration:
            # print("End of File")
            pass

        return featureValueListForAllTestingImages, actualLabelList


def error(errorPrediction, total):
    errorRate = (errorPrediction * 100) / total
    print("Error is", errorPrediction, "out of Total of ", total, " : ", errorRate)
    return errorRate


FACE = "FACE"
DIGIT = "DIGIT"
DIR = "DIR"
HEIGHT = "HEIGHT"
WIDTH = "WIDTH"
LABEL = "LABEL"
PIXELS = "PIXELS"

if __name__ == '__main__':
    print("TRAINING OUR MODEL FIRST")
    PERCENT_INCREMENT = 10

    perceptron_y = []
    bayes_y = []
    knn_y = []
    dataSetIncrements = []
    perceptron_time = []
    bayes_time = []
    knn_time = []
    perceptron_msd=[]
    bayes_msd=[]
    knn_msd=[]

    inp = input("Type FACE or DIGIT")
    gridSize = int(input("Value of Grid"))
    POSSIBLE_VALUES = [x for x in range(0, gridSize * gridSize + 1)]

    map = {
        FACE: {
            DIR: 'data/facedata', HEIGHT: 68, WIDTH: 61, LABEL: 2, PIXELS: None
        },
        DIGIT: {
            DIR: 'data/digitdata', HEIGHT: 20, WIDTH: 29, LABEL: 10, PIXELS: None
        }
    }

    dataType = map.get(inp)
    samples = Samples(dataType.get(DIR))

    dataClassifier = DataClassifier(dataType.get(HEIGHT), dataType.get(WIDTH), dataType.get(LABEL),
                                    dataType.get(PIXELS), gridSize)
    perceptronClassifier = PerceptronClassifier(dataClassifier.FEATURES, dataClassifier.LABELS)

    samples.readFiles()

    # Extracting Features from the Training Data
    dataset = 0
    featureValueListForAllTrainingImages, actualLabelForTrainingList = \
        dataClassifier.extractFeatures(samples.train_lines_itr, samples.train_labelsLines_itr)

    TOTALDATASET = len(actualLabelForTrainingList)
    INCREMENTS = int(TOTALDATASET * PERCENT_INCREMENT / 100)
    PERCEPTRON_TIME = {}

    # Initialization of Classifiers
    perceptronClassifier = PerceptronClassifier(dataClassifier.FEATURES, dataClassifier.LABELS)
    naiveBayesClassifier = NaiveBayesClassifier(dataClassifier.FEATURES, dataClassifier.LABELS, POSSIBLE_VALUES)
    KNNClassifier = KNN(num_neighbors=20)

    featureValueListForAllTestingImages = actualTestingLabelList = []
    while dataset < TOTALDATASET:

        featureValueList_currentTrainingImages = featureValueListForAllTrainingImages[dataset:dataset + INCREMENTS]
        actualLabel_currentTrainingImages = actualLabelForTrainingList[dataset:dataset + INCREMENTS]

        print("\n\n\n\n\n Training ON {0} to {1} data".format(dataset, dataset + INCREMENTS))
        ImageFeatureLabelZipList = zip(featureValueList_currentTrainingImages, actualLabel_currentTrainingImages)

        startTimer = time.time()
        ''' ####################  TRAINING PHASE FOR PERCEPTRON ############# '''
        for featureValueListPerImage, actualLabel in ImageFeatureLabelZipList:
            perceptronClassifier.runModel(True, featureValueListPerImage, actualLabel)
        endTimer = time.time()

        perceptron_time.append(endTimer - startTimer)

        startTimer = time.time()
        ''' ####################  TRAINING PHASE FOR NAIVE BYES ############# '''
        naiveBayesClassifier.constructLabelsProbability(actualLabel_currentTrainingImages)
        naiveBayesClassifier.constructFeaturesProbability(featureValueList_currentTrainingImages,
                                                          actualLabel_currentTrainingImages,
                                                          POSSIBLE_VALUES)
        endTimer = time.time()

        bayes_time.append(endTimer - startTimer)

        ''' ################## NO TRAINING PHASE FOR KNN #################  '''
        KNNClassifier.storeTrainingSet(featureValueList_currentTrainingImages, actualLabel_currentTrainingImages)
        ''' SIMPLY STORING FOR KNN '''

        ''' ####################  TESTING PHASE ############# '''
        samples.initTestIters()

        print("TESTING our model that is TRAINED ON {0} to {1} data".format(0, dataset + INCREMENTS))

        perceptron_errorPrediction = naiveByes_errorPrediction = knn_errorPrediction = total = 0
        featureValueListForAllTestingImages, actualTestingLabelList = \
            dataClassifier.extractFeatures(samples.test_lines_itr, samples.test_labelsLines_itr)

        for featureValueListPerImage, actualLabel in zip(featureValueListForAllTestingImages, actualTestingLabelList):
            perceptron_errorPrediction += perceptronClassifier.runModel(False, featureValueListPerImage, actualLabel)
            naiveByes_errorPrediction += naiveBayesClassifier.testModel(featureValueListPerImage, actualLabel)

            ''' ####################  TESTING PHASE FOR KNN ############# '''
            startTimer = time.time()

            knn_errorPrediction += KNNClassifier.test(featureValueListPerImage, actualLabel)

            endTimer = time.time()
            knn_time.append(endTimer - startTimer)
            ''' ####################  TESTING PHASE OVER FOR KNN ############# '''

            total += 1

        perceptron_error = error(perceptron_errorPrediction, total)
        bayes_error = error(naiveByes_errorPrediction, total)
        knn_error = error(knn_errorPrediction, total)
        perceptron_msd.append(perceptron_error)
        bayes_msd.append(bayes_error)
        knn_msd.append(knn_error)

        perceptron_msd_graph = mean_standard_deviation(perceptron_msd,"Perceptron")
        bayes_msd_graph = mean_standard_deviation(bayes_msd,"Bayes")
        knn_msd_graph = mean_standard_deviation(knn_msd,"KNN")

        dataset += INCREMENTS

        dataSetIncrements.append(dataset)
        perceptron_y.append(perceptron_error)
        bayes_y.append(bayes_error)
        knn_y.append(knn_error)

    final_array = {
        1: [perceptron_y, bayes_y, knn_y],
        2: ["Perceptron", "Bayes", "KNN"]
    }

    final_array2 = {
        1: [perceptron_time, bayes_time, knn_time],
        2: ["Perceptron", "Bayes", "KNN"]
    }

    final_array3 = {
        1: [perceptron_msd_graph, bayes_msd_graph, knn_msd_graph],
        2: ["Perceptron", "Bayes", "KNN"]
    }

    error = Error()
    error.graphplot(dataSetIncrements, final_array.get(1), final_array.get(2), inp) #For error plotting
    # error.graphplot(dataSetIncrements, final_array2.get(1), final_array2.get(2), inp) #For time
#     error.graphplot(dataSetIncrements, final_array3.get(1), final_array3.get(2), inp) #For mean
    # error.graphplot(dataSetIncrements, final_array3.get(1)[1], final_array3.get(2), inp) #For Standard Deviation

    samples.closeFiles()


In [None]:
import matplotlib.pyplot as plt
from matplotlib import pyplot


class Error:

    def graphplot(self, dataset, errorRateList, type, method):
        for i in range(len(errorRateList)):
            plt.plot(dataset, errorRateList[i], label=type[i])
            plt.xlim(0, dataset[-1] + dataset[-1]/10)
            plt.ylim(0, 100)

        for i in range(len(errorRateList)):
            for data, errorRate in zip(dataset, errorRateList[i]):
                pyplot.text(data, errorRate, str(int(errorRate)))

        plt.title(method)
        plt.legend()
        plt.show()