# Optical Recognition of Handwritten Digits Dataset(UCI repository)

This is the code for problem P1. 

In [1]:
import numpy as np
import math
from numpy.linalg import inv

Here, the probability density function of a multivariate gaussian random variable is defined in the function 'gaussian' whose parameters are $\mu$ and $\Sigma$. 

In [2]:
def gaussian(x,mu,sigma):
    n=x.shape
    x_bar=np.subtract(x,mu)
    x_bar_vec=np.array([x_bar])
    sigma_inv=np.linalg.inv(sigma)
    index=np.matmul(x_bar_vec,np.matmul(sigma_inv,x_bar_vec.T))
    num=math.exp(-0.5*index)
    den=(((2*np.pi)**(n[0]))*np.linalg.det(sigma))**0.5
    return num/den

The training dataset is loaded from the file "P1_data_train.csv" which is present the directory "P1_data/P1_data/".

In [3]:
data=np.genfromtxt('P1_data/P1_data/P1_data_train.csv',delimiter=',')
labels=np.genfromtxt('P1_data/P1_data/P1_labels_train.csv',delimiter=',')
size=data.shape

In [4]:
count_five=0
mu_five=np.zeros(size[1])
sigma_five=np.zeros((size[1],size[1]))
count_six=0
mu_six=np.zeros(size[1])
sigma_six=np.zeros((size[1],size[1]))

Here, the training data corresponding to handwritten digits 5 and 6 are described using Multivariate Gaussian distributions with parametric means $\mu _{5}$,$\mu _{6}$ and covariance matrices $\Sigma _{5}$,$\Sigma _{6}$.These parameters are first intialized to zero vectors.<br>
From the Maximum Likehood Estimation ,the estimates of the parameters $\mu _{5}$ & $\mu _{6}$ are given by the formula<br> 
$$\mu _{5} =  \Sigma  x _{i} / n _{5}$$  where $x _{i}$ is the ith example such that label of $x _{i}$ i.e $L(x _{i})$ = 5, n5 = number of datapoints in class 5  &<br>
$$\mu _{6} =  \Sigma  x _{i} / n _{6}$$ where $x _{i}$ is the ith example such that label of $x _{i}$ i.e $L(x _{i})$ = 6, $n_{6}$ = number of datapoints in class 6.<br>
Also the estimates of the parameters $\Sigma_{5}$ & $\Sigma_{6}$ are given by the formula <br>
$$\Sigma_{5} = \Sigma ( x _{i} – \mu _{5} )( x _{i} – \mu _{5} )^T / (n_{5}-1)$$ where $x _{i}$ is the ith example such that label of $x _{i}$ i.e $L(x _{i})$ = 5<br>
$$\Sigma_{6} = \Sigma ( x _{i} – \mu _{6} )( x _{i} – \mu _{6} )^T / (n_{6}-1)$$ where $x _{i}$ is the ith example such that label of $x _{i}$ i.e $L(x _{i})$ = 6<br>

In [5]:
for i in range(size[0]):
    if labels[i]==5:
        mu_five=np.add(mu_five,data[i])
        count_five+=1
    else:
        mu_six=np.add(mu_six,data[i])
        count_six+=1
mu_five=mu_five/count_five
mu_six=mu_six/count_six

In [6]:
for i in range(size[0]):
    if labels[i]==5:
        x=np.subtract(data[i],mu_five)
        x_vec=np.array([x])
        pd=np.matmul(x_vec.T,x_vec)
        sigma_five=np.add(sigma_five,pd)
    else:
        x=np.subtract(data[i],mu_six)
        x_vec=np.array([x])
        pd=np.multiply(x_vec.T,x_vec)
        sigma_six=np.add(sigma_six,pd)
sigma_five=sigma_five/(count_five-1)
sigma_six=sigma_six/(count_six-1)

The apriori probabilities $\pi_{5}$ and $\pi_{6}$ are estimated as follows:<br>
$\pi_{5}$=(Number of examples in Class 5)/Total number of examples<br>
$\pi_{6}$=(Number of examples in Class 6)/Total number of examples<br>

In [7]:
prob_C5=count_five/(count_five+count_six)
prob_C6=1-prob_C5

The test dataset is loaded from the file "P1_data_test.csv" which is present the directory "P1_data/P1_data/".

In [8]:
test_data=np.genfromtxt('P1_data/P1_data/P1_data_test.csv',delimiter=',')
test_labels=np.genfromtxt('P1_data/P1_data/P1_labels_test.csv',delimiter=',')
test_size=test_data.shape

From the estimated parameters of the two Normal distributions for class 5 and class 6,the test data is classified using the Bayesian Classfication criterion:<br>
If $\pi_{5}.N(x/\mu_{5},\Sigma_{5})>\pi_{6}.N(x/\mu_{6},\Sigma_{6})$ choose class 5 else choose class 6<br>
In the first case,we take the parameters $\Sigma _{5}$,$\Sigma _{6}$ to be same as the empirical covariance matrices as calculated fron the training data.<br>
From the predicted labels,we also calculate the confusion matrix. 

In [9]:
### Empirical Case ###
tp=0
tn=0
fp=0
fn=0
predicted_labels=np.zeros(test_size[0])
for i in range(test_size[0]):
    if prob_C5*gaussian(test_data[i],mu_five,sigma_five)>prob_C6*gaussian(test_data[i],mu_six,sigma_six):
        predicted_labels[i]=5
        if(predicted_labels[i]==test_labels[i]):
            tp+=1
        else:
            fp+=1
    else:
        predicted_labels[i]=6
        if(predicted_labels[i]==test_labels[i]):
            tn+=1
        else:
            fn+=1

conf_mat=[[tp ,fp],[fn,tn]]
print("Confusion Matrix: ")
print(np.array(conf_mat))
false_pos_rate=fp/(fp+tn)
false_neg_rate=fn/(tp+fn)
print("False Negative Rate: ",end="")
print(false_neg_rate)
print("False Positive Rate: ",end="")
print(false_pos_rate)

Confusion Matrix: 
[[106  27]
 [ 49 151]]
False Negative Rate: 0.3161290322580645
False Positive Rate: 0.15168539325842698


For this case,the misclassification rate for class 5 is **15.168 %** and that for class 6 is **31.613 %**.<br>
Next ,we take the case where the covariance matrices for the two classes are same i.e. $\Sigma _{5}=\Sigma _{6}=\Sigma$ where $\Sigma$ is calculated as the covariance matrix of the entire data.

In [10]:
### Equal Sigma Case ####
tp=0
tn=0
fp=0
fn=0

count=0
common_mu=np.zeros(size[1])
common_sigma=np.zeros((size[1],size[1]))
for i in range(size[0]):
    common_mu=np.add(common_mu,data[i])
    count+=1
common_mu=common_mu/count
for i in range(size[0]):
    X=np.subtract(data[i],common_mu)
    X_vec=np.array([X])
    pd=np.matmul(X_vec.T,X_vec)
    common_sigma=np.add(common_sigma,pd)
common_sigma=common_sigma/(count-1)

predicted_labels=np.zeros(test_size[0])
for i in range(test_size[0]):
    if prob_C5*gaussian(test_data[i],mu_five,common_sigma)>prob_C6*gaussian(test_data[i],mu_six,common_sigma):
        predicted_labels[i]=5
        if(predicted_labels[i]==test_labels[i]):
            tp+=1
        else:
            fp+=1
    else:
        predicted_labels[i]=6
        if(predicted_labels[i]==test_labels[i]):
            tn+=1
        else:
            fn+=1

conf_mat=[[tp ,fp],[fn,tn]]
print("Confusion Matrix:")
print(np.array(conf_mat))
false_pos_rate=fp/(fp+tn)
false_neg_rate=fn/(tp+fn)
print("False Negative Rate:",end="")
print(false_neg_rate)
print("False Positive Rate:",end="")
print(false_pos_rate)

Confusion Matrix:
[[136  28]
 [ 19 150]]
False Negative Rate:0.12258064516129032
False Positive Rate:0.15730337078651685


In this case,the misclassification rate for class 5 is **15.73 %** and that for class 6 is **12.258 %**.<br>
Next, we take another case where the covariance matrices of the two classes are equal but are diagonal matrices.

In [11]:
### Diagonal Sigma Case ####
tp=0
tn=0
fp=0
fn=0

for i in range(size[1]):
    for j in range(size[1]):
        if i!=j:
            common_sigma[i][j]=0
predicted_labels=np.zeros(test_size[0])
for i in range(test_size[0]):
    if prob_C5*gaussian(test_data[i],mu_five,common_sigma)>prob_C6*gaussian(test_data[i],mu_six,common_sigma):
        predicted_labels[i]=5
        if(predicted_labels[i]==test_labels[i]):
            tp+=1
        else:
            fp+=1
    else:
        predicted_labels[i]=6
        if(predicted_labels[i]==test_labels[i]):
            tn+=1
        else:
            fn+=1

conf_mat=[[tp ,fp],[fn,tn]]
print("Confusion Matrix:")
print(np.array(conf_mat))
false_pos_rate=fp/(fp+tn)
false_neg_rate=fn/(tp+fn)
print("False Negative Rate:",end="")
print(false_neg_rate)
print("False Positive Rate:",end="")
print(false_pos_rate)

Confusion Matrix:
[[134  40]
 [ 21 138]]
False Negative Rate:0.13548387096774195
False Positive Rate:0.2247191011235955


For this case,the misclassification rate for class 5 is **22.472 %** and that for class 6 is **13.548 %**.