# Assignment 1
This jupyter notebook is meant to be used in conjunction with the full questions in the assignment pdf.

## Instructions
- Write your code and analyses in the indicated cells.
- Ensure that this notebook runs without errors when the cells are run in sequence.
- Do not attempt to change the contents of the other cells.

## Submission
- Ensure that this notebook runs without errors when the cells are run in sequence.
- Rename the notebook to `<roll_number>.ipynb` and submit ONLY the notebook file on moodle.

### Environment setup

The following code reads the train and test data (provided along with this template) and outputs the data and labels as numpy arrays. Use these variables in your code.

---
#### Note on conventions
In mathematical notation, the convention is tha data matrices are column-indexed, which means that a input data $x$ has shape $[d, n]$, where $d$ is the number of dimensions and $n$ is the number of data points, respectively.

Programming languages have a slightly different convention. Data matrices are of shape $[n, d]$. This has the benefit of being able to access the ith data point as a simple `data[i]`.

What this means is that you need to be careful about your handling of matrix dimensions. For example, while the covariance matrix (of shape $[d,d]$) for input data $x$ is calculated as $(x-u)(x-u)^T$, while programming you would do $(x-u)^T(x-u)$ to get the correct output shapes.

In [9]:
from __future__ import print_function

import numpy as np
import matplotlib.pyplot as plt

def read_data(filename):
    with open(filename, 'r') as f:
        lines = f.readlines()
    
    num_points = len(lines)
    dim_points = 28 * 28
    data = np.empty((num_points, dim_points))
    labels = np.empty(num_points)
    
    for ind, line in enumerate(lines):
        num = line.split(',')
        labels[ind] = int(num[0])
        data[ind] = [ int(x) for x in num[1:] ]
        
    return (data, labels)

train_data, train_labels = read_data("sample_train.csv")
test_data, test_labels = read_data("sample_test.csv")
print(train_data.shape, test_data.shape)
print(train_labels.shape, test_labels.shape)

(6000, 784) (1000, 784)
(6000,) (1000,)


# Questions
---
## 1.3.1 Representation
The next code cells, when run, should plot the eigen value spectrum of the covariance matrices corresponding to the mentioned samples. Normalize the eigen value spectrum and only show the first 100 values.

In [27]:
# Samples corresponding to the last digit of your roll number (plot a)
indices = []
length = len(train_labels)
for i in range(length):
    if(train_labels[i] == 9):
        indices.append(i)
# for i in indices:
#     print (i)
cov = []
for i in indices:
    cov.append(train_data[i])
# print(cov[599].shape)

eigval,eigvec = np.linalg.eig(cov)
mean = np.mean(eigval)
std = np.std(eigval)



        
        


(784,)


In [13]:
# Samples corresponding to the last digit of (your roll number + 1) % 10 (plot b)

In [14]:
# All training data (plot c)

In [15]:
# Randomly selected 50% of the training data (plot d)


### 1.3.1 Question 1
- Are plots a and b different? Why?
- Are plots b and c different? Why?
- What are the approximate ranks of each plot?

---
Your answers here (double click to edit)

---

### 1.3.1 Question 2
- How many possible images could there be?
- What percentage is accessible to us as MNIST data?
- If we had acces to all the data, how would the eigen value spectrum of the covariance matrix look?

---
Your answers here (double click to edit)

---

## 1.3.2 Linear Transformation
---
### 1.3.2 Question 1
How does the eigen spectrum change if the original data was multiplied by an orthonormal matrix? Answer analytically and then also validate experimentally.

---
Analytical answer here (double click to edit)

---

In [16]:
# Experimental validation here.
# Multiply your data (train_data) with an orthonormal matrix and plot the
# eigen value specturm of the new covariance matrix.

# code goes here
n = 784
H = np.random.rand(n, n)
u, s, v = np.linalg.svd(H, full_matrices=False)
mat = u @ v
print(mat @ mat.T)

[[ 1.00000000e+00 -3.64291930e-16  2.10335221e-16 ...  3.71230824e-16
   2.82759927e-16  3.46944695e-18]
 [-3.64291930e-16  1.00000000e+00 -1.56179323e-16 ...  3.98986399e-17
  -9.19403442e-17  9.02056208e-17]
 [ 2.10335221e-16 -1.56179323e-16  1.00000000e+00 ... -1.47451495e-16
  -3.74700271e-16 -3.51281504e-17]
 ...
 [ 3.71230824e-16  3.98986399e-17 -1.47451495e-16 ...  1.00000000e+00
  -2.22044605e-16  2.94902991e-16]
 [ 2.82759927e-16 -9.19403442e-17 -3.74700271e-16 ... -2.22044605e-16
   1.00000000e+00  8.53605923e-17]
 [ 3.46944695e-18  9.02056208e-17 -3.51281504e-17 ...  2.94902991e-16
   8.53605923e-17  1.00000000e+00]]


### 1.3.2 Question 2
If  samples  were  multiplied  by  784 × 784  matrix  of rank 1 or 2, (rank deficient matrices), how will the eigen spectrum look like?

---
Your answer here (double click to edit)

---

### 1.3.2 Question 3
Project the original data into the first and second eigenvectors and plot in 2D

In [18]:
# Plotting code here
print(train_data.shape)
eval=np.linalg.eig(train_data)

(6000, 784)


LinAlgError: Last 2 dimensions of the array must be square

## 1.3.3 Probabilistic View
---
In this section you will classify the test set by fitting multivariate gaussians on the train set, with different choices for decision boundaries. On running, your code should print the accuracy on your test set.

In [8]:
# Print accuracy on the test set using MLE

In [9]:
# Print accuracy on the test set using MAP
# (assume a reasonable prior and mention it in the comments)

In [10]:
# Print accuracy using Bayesian pairwise majority voting method

In [11]:
# Print accuracy using Simple Perpendicular Bisector majority voting method

### 1.3.3 Question 4
Compare performances and salient observations

---
Your analysis here (double click to edit)

---

## 1.3.4 Nearest Neighbour based Tasks and Design
---
### 1.3.4 Question 1 : NN Classification with various K
Implement a KNN classifier and print accuracies on the test set with K=1,3,7

In [12]:
# Your code here
# Print accuracies with K = 1, 3, 7

### 1.3.4 Question 1 continued
- Why / why not are the accuracies the same?
- How do we identify the best K? Suggest a computational procedure with a logical explanation.

---
Your analysis here (double click to edit)

---

### 1.3.4 Question 2 :  Reverse NN based outlier detection
A sample can be thought of as an outlier is it is NOT in the nearest neighbour set of anybody else. Expand this idea into an algorithm.

In [13]:
# This cell reads mixed data containing both MNIST digits and English characters.
# The labels for this mixed data are random and are hence ignored.
mixed_data, _ = read_data("outliers.csv")
print(mixed_data.shape)

(20, 784)


### 1.3.4 Question 3 : NN for regression
Assume that each classID in the train set corresponds to a neatness score as:
$$ neatness = \frac{classID}{10} $$

---
Assume we had to predict the neatness score for each test sample using NN based techiniques on the train set. Describe the algorithm.

---
Your algorithm here (double click to edit)

---

### 1.3.4 Question 3 continued
Validate your algorithm on the test set. This code should print mean absolute error on the test set, using the train set for NN based regression.

In [14]:
# Your code here

---
# FOLLOW THE SUBMISSION INSTRUCTIONS
---