## Perceptron 

*Perceptron* is a supervised learning algorithm for binary classifiers. It is supervised because we can train this algorithm from a training dataset and test it out on the testing data. It gives results in a linear function. Perceptrons are just a single layer Neural Network. If we add many perceptrons taking care of different functionalities, we can create a multi layer Perceptron. In the early days, it was difficult to classify some problems in as a linear function. So they were not considered as a serious solution to Artificial Intelligence. But, when people experimented with adding more layers to the perceptron, they found that problems which cannot be classified as a linear function, can now be solved. One example of this is the **XOR** solution. This XOR problem cannot be implemented in a single layer of perceptron as the results are not linearly separable. Therefore, people tried to implement multi-layer perceptron which got the required result. Hence, it was the advent of Perceptron and Artificial Intelligence.

In this algorithm, we create a feature function. This function takes features from the image according to the function specified. *For our implementation*, we ask the user to input the number of grids they want to compute on. In general, if the grid size is greater than or equal to 3\*3, the accuracy is higher. Also, we take input from the user the dataset to run these algorithms on. Users can enter **FACE** or **DIGIT** to run the algorithm on face or digit datasets respectively.

Initially, we take random weights for features of each label and compute the score of feature of a class(for digits, 6 or 9 & for face, face or no face). It can be mathematically be written as $score(f,y) = \sum_i f_i w_i^y$. This equation means we calculate the score for feature vector *f* for a particular class *y* which equals to sum of the multiplication of all the features vectors with its weight of its class. 

We update the weights by calculating the maximum of the the score it gives for a label from the individual features. We can write it as $ y^{*} = arg max score(f,y^{**}) $ . Then after computing each feature and checking for the actual label, we predict if the result is accurate. For example, if the label is *6*, and we get a prediction of *9*, then, in this case, we decrease the values of weights for the label 9 as $w^9 = w^9 - f$ where f is some predefined value which we will subtract from the current weight. But for the weights of 6, we will add f to its current weight as $w^6 = w^6 + f$. Finally, the perceptron algorithm would tune the weight according to the actual label during the training and try to get the accurate result in the testing data.

**Advantages**
- Perceptrons have the ability to learn themselves the complex relationships and patterns in the dataset.
- We can have any type of input. It does not restrict to use any one datatype as inputs.
- If we have a single layer of perceptron, then the training is very quick.
- Is really accurate for image processing and character recognition.

**Disadvantages**
- A single layer of perceptron cannot train a problem whose solution is a non-linear function.
- Multi layer perceptron takes more time to train.
- Difficult optimization if we have a lot of local minima/maxima.

In [8]:
import numpy as np


class PerceptronClassifier:
    def __init__(self, FEATURES, LABELS):
        self.SHAPE = (LABELS, FEATURES + 1)  # The +1 is for our w0 weight.
        self.weightMatrix = np.zeros(self.SHAPE)

    def updateWeights(self, predictedLabel, actualLabel, featureValueList):
        # print("Updating Weights")
        self.weightMatrix[actualLabel] = self.weightMatrix[actualLabel] + featureValueList
        self.weightMatrix[predictedLabel] = self.weightMatrix[predictedLabel] - featureValueList
        # print(weightMatrix[actualLabel, :])
        # print(weightMatrix[predictedLabel, :])

    def runModel(self, isTrain, featureValueList, actualLabel):
        featureScoreList = []
        featureValueList = [1] + featureValueList # The [1] + is to accommodate the bias weight - w0
        for labelWeights in self.weightMatrix:
            featureScoreList.append(np.sum(np.dot(labelWeights, featureValueList)))

        # print("Feature Score List :", featureScoreList)
        predictedLabel = np.argmax(featureScoreList)

        if predictedLabel != actualLabel:
            # print(predictedLabel, " ", actualLabel)
            if isTrain:
                self.updateWeights(predictedLabel, actualLabel, featureValueList)
            else:
                return 1
        else:
            return 0

    def initWeightMatrix(self):
        self.weightMatrix = np.zeros(self.SHAPE)  # Randomized



## Naive Bayes Classifier
*Naive Bayes Classifier* is a part of the probabilistic classifiers based on Bayes' Theorem. The formula for Bayes' theorem is $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$. There are two assumptions taken in this algorithm.
- That the feature is independent.
- The importance of every feature is equal.
When we have a label *X*, the naive bayes tries to model it using joint distribution from the formula mentioned above. For every labels, we have features dedicated for them as a group like $ (f_1, f_2, ..., f_n) $. We have a  formula given for calculating the joing probability. $ (f_1, f_2, ..., f_n, X) = P(X) \prod_i P(F_i|X) $. As we had done in the perceptron algorithm, we again calculate the argmax of the features from the probabilities and predict the label from the set of inputs given. 

Another thing for the prediction is smoothing. In real life, we do not want to set the probabilities of any term to be 0. But, if the algorithm does not see any connections between $A$ and $B$ in the formula, it would give $P(A|B) = 0$. This is not acceptable in real world predications and as a result, we use *Laplace Smoothing* to get rid of any probabilities leading to 0. We have used smoothing value to be 0.001 as we do not want to vary the results very much.

When we want the algorithm to learn, we use *Odds Ratio*. For every feature and every pair of classes we have, we check each one of them to increase the belief in one class or the other. 

**Advantages**
- It is easy to implement.
- Requires just small amount of data to establish the beliefs.
- It is less sensitive to lost data.
- Speed of training and testing is pretty high

**Disadvantages**
- We need prior probability.
- The assumption that all the features are independent.
- If there is a test data that was not seen during training, then the probability of this data would be 0.


In [9]:
import math
import numpy as np


class NaiveBayesClassifier:

    P_A_GIVEN_B = 'P(A|B)'
    P_B_GIVEN_A = 'P(B|A)'
    P_A = 'P(A)'
    P_B = 'P(B)'

    # Smotthing
    kgrid = [0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 20, 50]

    def __init__(self, FEATURES, LABELS, POSSIBLE_VALUES, k_value):
        self.LabelMap = {}
        self.FeatureMap = {}
        self.FEATURES = FEATURES
        self.LABELS = LABELS
        self.K = k_value

        # Initialization of FMAP - FEATURES X LABELS X POSSIBLE_VALUES
        for featureIndex in range(self.FEATURES):
            self.FeatureMap[featureIndex] = {}
            for labelIndex in range(self.LABELS):
                self.FeatureMap[featureIndex][labelIndex] = {}
                for possibleValueIndex in POSSIBLE_VALUES:
                    self.FeatureMap[featureIndex][labelIndex][possibleValueIndex] = 0

        # Initialization
        for labelIndex in range(0, self.LABELS):
            self.LabelMap[labelIndex] = 0

    def P_A_given_B(self, map):
        result = ( map.get(NaiveBayesClassifier.P_B_GIVEN_A) * map.get(NaiveBayesClassifier.P_A) )\
                 / map.get(NaiveBayesClassifier.P_B)
        return result

    # Constructing Labels probability
    # PRIOR DISTRIBUTION OVER LABELS #
    def constructLabelsProbability(self, trainingLabels):
        totalDataset = len(trainingLabels)

        # Storing Frequency
        for label in trainingLabels:
            self.LabelMap[label] += 1

        # Calculating probability -> frequency/total -> LOG
        for key in self.LabelMap:
            probability = self.LabelMap[key] / totalDataset
            self.LabelMap[key] = probability

    def constructFeaturesProbability(self, featureValueListForAllTrainingImages, actualLabelForTrainingList, POSSIBLE_VALUES):

        # TRAINING
        for label, featureValuesPerImage in zip(actualLabelForTrainingList, featureValueListForAllTrainingImages):
            for feature in range(0, self.FEATURES):
                self.FeatureMap[feature][label][featureValuesPerImage[feature]] += 1

        # Converting frequencies to probabilities to it's LOG
        for featureIndex in range(self.FEATURES):
            for labelIndex in range(self.LABELS):
                sum = 0
                for possibleValueIndex in POSSIBLE_VALUES:
                    sum += self.FeatureMap.get(featureIndex).get(labelIndex).get(possibleValueIndex) + self.K
                for possibleValueIndex in POSSIBLE_VALUES:
                    probability = (self.FeatureMap.get(featureIndex).get(labelIndex).get(possibleValueIndex) + self.K) / sum
                    self.FeatureMap[featureIndex][labelIndex][possibleValueIndex] = probability

    def predictLabel_GivenFeatures(self, featuresListOfImage):
        probabilityPerLabel = []
        for label in self.LabelMap:
            # P(Y=label|features)
            P_Y = self.LabelMap.get(label)
            P_features_given_Y = 0
            for feature in range(0, self.FEATURES):
                P_features_given_Y += math.log(self.FeatureMap[feature][label][featuresListOfImage[feature]])
            probability = math.log(P_Y, 2) + P_features_given_Y
            probabilityPerLabel.append(probability)

        predictedLabel = np.argmax(probabilityPerLabel)
        return predictedLabel

    def testModel(self, featuresListOfImage, actualLabel):
        predictedLabel = self.predictLabel_GivenFeatures(featuresListOfImage)
        if predictedLabel != actualLabel:
            return 1
        else:
            return 0


## K-Nearest Neighbors(KNN)

*KNN* algorithm predicts on the basis that similar items are together. This uses the classification method to identify in which class the new data belongs. It is also called lazy learning as there is no training. The algorithm learns when the testing is performed. So, during the training time, the algorithm just stores the feature values and the labels of the data.

### Training: Actually just Storing the Training dataset

In [1]:
def storeTrainingSet(self, featuresPerImage, labelForImage):
    self.features += featuresPerImage
    self.labels += labelForImage

Currently, we pass the number of neighbors $k=20$ but the value of $k$ depends on the datasize

### Testing:

We use **Eucledian distance** to calculate the distance of the new data from the all the trainingset data points.
- Sorts these data points according to the distance
- Finds nearest k points and then see the most frequent labe that

In [4]:
# Locate the most similar neighbors
def get_neighbors(self, featureForTestImage, labelForTestImage):
    
    for featureTrainingImage, labelTrainingImage  in zip(self.features, self.labels):
        dist = euclidean_distance(featureTrainingImage, labelTrainingImage)
        distances.append((featureTrainingImage, labelTrainingImage, dist))

    distances.sort(key=lambda tup: tup[2]) # Sort according to the dist
    for i in range(self.num_neighbors):
        neighbors.append(distances[i][1])
    return neighbors

**Advantages**
- No training period. Learns when testing the data.
- Can accept new data flexibly.
- Easy to implement.

**Disadvantages**
- Accuracy is small with large datasets.
- We need to standardize the input to scale them for appropriate predictions.
- If some data is missing or the dataset has some error, it can give wrong predictions.

In [3]:
%run -i dataclassifier.py --input=FACE --gridSize=3 --smoothingValue=0.001 --classifier=PERCEPTRON  --percentIncrement=10

KeyboardInterrupt: 