# BT3017 Tutorial 4

- There is an online copy<sup>+</sup> of this tutorial on github available [here](https://github.com/KohSiXing/Feature-Engineering-for-Machine-Learning/blob/master/BT3017%20Tutorial%204.ipynb)
- Dataset modified from Machine Learning Mastery: [How to Code a Neural Network with Backpropagation In Python](https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/)

<sup>+</sup> Online copy will only be published after Wednesday 1000 of that week to prevent plagiarism.

### Preprocessing

In [1]:
import pandas as pd
import numpy as np

wheat_seed = pd.read_csv("seeds_with_headers.csv")
wheat_seed

Unnamed: 0,feat1,feat2,feat3,feat4,feat5,feat6,feat7,feat8
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1
...,...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870,3
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003,3
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056,3
208,11.84,13.21,0.8521,5.175,2.836,3.598,5.044,3


### 1

- use `np.mean` to compute the mean of each of the 8 features

In [2]:
mu = np.mean(wheat_seed, axis=0)
mu

feat1    14.847524
feat2    14.559286
feat3     0.870999
feat4     5.628533
feat5     3.258605
feat6     3.700201
feat7     5.408071
feat8     2.000000
dtype: float64

- Use `np.cov`<sup>[1]</sup> to compute the covariance matrix of the data in the file
    - if rowvar = True: each row represents a variable, with observations in the columns
    - if rowvar = False: each column represents a variable, while the rows contain observations

In [3]:
cov_ws = np.cov(wheat_seed - mu, rowvar=False)
pd.DataFrame(cov_ws)

Unnamed: 0,0,1,2,3,4,5,6,7
0,8.466351,3.778443,0.041823,1.224704,1.066911,-1.004356,1.235133,-0.824115
1,3.778443,1.705528,0.016332,0.562666,0.466065,-0.426766,0.571753,-0.350478
2,0.041823,0.016332,0.000558,0.003852,0.006798,-0.011777,0.002634,-0.010269
3,1.224704,0.562666,0.003852,0.196305,0.143992,-0.11429,0.203125,-0.093292
4,1.066911,0.466065,0.006798,0.143992,0.142668,-0.146543,0.139068,-0.130909
5,-1.004356,-0.426766,-0.011777,-0.11429,-0.146543,2.260684,-0.008187,0.710382
6,1.235133,0.571753,0.002634,0.203125,0.139068,-0.008187,0.241553,0.009775
7,-0.824115,-0.350478,-0.010269,-0.093292,-0.130909,0.710382,0.009775,0.669856


- Use `np.linalg.eig`<sup>[2]</sup> to do eigen decomposition of the covariance matrix *cov_ws*

In [4]:
# w contains the eigenvalues and v contains the eigenvectors
w,v = np.linalg.eig(cov_ws)
w

array([1.08883260e+01, 2.33107101e+00, 3.96953708e-01, 5.45019865e-02,
       8.49185405e-03, 2.65452400e-03, 1.47991736e-03, 2.53695934e-05])

- Use `np.argsort` to sort the eigenvalues in **descending** order
- The sequence is in order, the first value (i.e. idx 0) is the largest eigenvalue and the last value (i.e. idx 7) is the smallest
- This means that the first eigenvector will be the largest and the last eigenvector will be the smallest

In [5]:
np.argsort(-1 * w)

array([0, 1, 2, 3, 4, 5, 6, 7], dtype=int64)

- Use `np.dot` to show that the eigenvectors are orthogonal to each other.
- Though the numbers are not exactly zero (0), they are very close to zero and it is the result of rounding off errors<sup>[3]</sup>

In [6]:
for i in range(v.shape[1] - 1):
    for j in range(1, v.shape[1]):
        if i == j:
            continue
        else:
            print(np.dot(v[:,i].T, v[:,j]))

1.3877787807814457e-17
-9.71445146547012e-17
-2.220446049250313e-16
-1.942890293094024e-16
-1.1102230246251565e-16
1.942890293094024e-16
2.42861286636753e-17
-1.214306433183765e-16
-1.5959455978986625e-16
-4.163336342344337e-17
-1.1796119636642288e-16
1.6306400674181987e-16
-7.632783294297951e-16
-1.214306433183765e-16
2.7755575615628914e-16
-8.326672684688674e-17
0.0
1.734723475976807e-16
2.983724378680108e-16
-1.5959455978986625e-16
2.7755575615628914e-16
4.440892098500626e-16
-2.942091015256665e-15
-3.608224830031759e-16
3.969047313034935e-15
-4.163336342344337e-17
-8.326672684688674e-17
4.440892098500626e-16
-1.9512169657787126e-14
1.27675647831893e-15
1.0692835505921039e-14
-1.1796119636642288e-16
0.0
-2.942091015256665e-15
-1.9512169657787126e-14
-1.3698937817441248e-13
3.0319496913122634e-14
1.6306400674181987e-16
1.734723475976807e-16
-3.608224830031759e-16
1.27675647831893e-15
-1.3698937817441248e-13
-1.3282430710859217e-13


- Check if C${\cdot}$e<sub>1</sub> = ${\lambda}{\cdot}$e<sub>1</sub> <sup>[4]</sup>
    - in this case, only the first eigenvector is used to confirm the above statment

In [7]:
np.allclose(np.dot(cov_ws,v[:,0]),np.dot(w[0],v[:,0]))

True

### 2

- Project each data point onto all the eigenvectors and store the results of projection into a matrix

In [8]:
wheat_seed = wheat_seed.to_numpy()

In [9]:
projected = np.zeros(shape=(len(wheat_seed),8))

for i in range(len(wheat_seed)):
    for j in range(8):
        projected[i][j] = np.dot(wheat_seed[i], v[:,j])/np.linalg.norm(v[:,j])
        
projected

array([[ 20.63339663,  -6.04413934,  -1.95610142, ...,   1.97685081,
         -4.4221903 ,  -1.3158124 ],
       [ 20.29570966,  -4.82492469,  -2.28373839, ...,   2.00418691,
         -4.4243389 ,  -1.31362472],
       [ 19.31404356,  -6.26074278,  -1.65272335, ...,   2.01391938,
         -4.38566277,  -1.3176479 ],
       ...,
       [ 17.25504733, -11.98139471,  -1.60703803, ...,   1.98949335,
         -4.47402223,  -1.31966865],
       [ 16.45809523,  -7.38228691,  -3.20930424, ...,   2.01347185,
         -4.37781303,  -1.31586237],
       [ 16.66813841,  -9.35245721,  -2.52153975, ...,   1.95770899,
         -4.41463441,  -1.32004096]])

- Reconstruct the data points using matrix of projected values and the eigenvectors
    - The values are the same as the original dataframe at the preprocessing stage

In [10]:
reconstruction = np.zeros(shape=(len(wheat_seed),8))

for i in range(len(wheat_seed)):
    for j in range(8):    
        reconstruction[i] += projected[i][j] * v[:,j]
        
pd.DataFrame(reconstruction)

Unnamed: 0,0,1,2,3,4,5,6,7
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220,1.0
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1.0
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825,1.0
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1.0
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1.0
...,...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870,3.0
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003,3.0
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056,3.0
208,11.84,13.21,0.8521,5.175,2.836,3.598,5.044,3.0


- Calculate the squared reconstruction errors
    
(${P}_{i}$ - $\hat{P}_{i}$)<sup>T</sup>(${P}_{i}$ - $\hat{P}_{i}$)

where (${P}_{i}$ - $\hat{P}_{i}$) is of dimension 8 x 1

In [11]:
errors = np.zeros(shape=(len(wheat_seed),1))

for i in range(len(wheat_seed)):
    errors[i] = np.matmul((wheat_seed[i] - reconstruction[i]).T, (wheat_seed[i] - reconstruction[i]))

- Print the squared reconstruction error for each data point
    - the values are very close to zero most being 7.++ x 10<sup>-25</sup> due to rounding off errors

In [12]:
pd.DataFrame(errors, columns=["Errors"])

Unnamed: 0,Errors
0,7.468792e-25
1,7.507165e-25
2,7.395402e-25
3,7.768858e-25
4,7.493439e-25
...,...
205,7.375779e-25
206,7.422442e-25
207,7.719598e-25
208,7.386708e-25


- Breakdown of errors when all eigenvectors are used

In [13]:
pd.DataFrame(errors, columns = ["errors"]).describe()

Unnamed: 0,errors
count,210.0
mean,7.461628000000001e-25
std,1.300749e-26
min,7.092437e-25
25%,7.37691e-25
50%,7.463877e-25
75%,7.545227e-25
max,7.802667e-25


### 3

- Repeat the steps for Q2 with the eigenvectors corresponding to the 4 **biggest** eigenvalues (i.e. 0 $\le$ indices $\le$ 3)
- Project each data point onto all the eigenvectors and store the results of projection into a matrix

In [14]:
projected = np.zeros(shape=(len(wheat_seed),8))

for i in range(len(wheat_seed)):
    for j in range(4):
        projected[i][j] = np.dot(wheat_seed[i], v[:,j])/np.linalg.norm(v[:,j])
        
projected

array([[ 20.63339663,  -6.04413934,  -1.95610142, ...,   0.        ,
          0.        ,   0.        ],
       [ 20.29570966,  -4.82492469,  -2.28373839, ...,   0.        ,
          0.        ,   0.        ],
       [ 19.31404356,  -6.26074278,  -1.65272335, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [ 17.25504733, -11.98139471,  -1.60703803, ...,   0.        ,
          0.        ,   0.        ],
       [ 16.45809523,  -7.38228691,  -3.20930424, ...,   0.        ,
          0.        ,   0.        ],
       [ 16.66813841,  -9.35245721,  -2.52153975, ...,   0.        ,
          0.        ,   0.        ]])

- Reconstruct the data points using matrix of projected values and the 4 **biggest** eigenvectors

In [15]:
reconstruction = np.zeros(shape=(len(wheat_seed),8))

for i in range(len(wheat_seed)):
    for j in range(4):    
        reconstruction[i] += projected[i][j] * v[:,j]
        
pd.DataFrame(reconstruction)

Unnamed: 0,0,1,2,3,4,5,6,7
0,17.252304,10.612756,-0.290964,5.565724,0.366515,2.245378,7.007033,0.561672
1,16.880856,10.311689,-0.275824,5.370138,0.394121,1.042364,6.758463,0.558925
2,16.258840,9.928360,-0.265451,5.169466,0.392076,2.723141,6.508532,0.579633
3,15.834876,9.772698,-0.271421,5.142573,0.316531,2.283957,6.486855,0.576600
4,18.122199,10.811394,-0.263874,5.494739,0.581995,1.379449,6.886821,0.573773
...,...,...,...,...,...,...,...,...
205,14.158817,9.043273,-0.294653,4.944521,0.037050,3.655249,6.602961,2.571522
206,13.187497,8.771533,-0.318893,4.971937,-0.176134,4.349295,6.672035,2.581996
207,15.189247,9.486932,-0.288063,5.078417,0.206317,8.339703,6.739219,2.577269
208,13.811047,9.033258,-0.314165,5.045444,-0.093963,3.622089,6.751045,2.575980


- Calculate the squared reconstruction errors

In [16]:
errors = np.zeros(shape=(len(wheat_seed),1))

for i in range(len(wheat_seed)):
    errors[i] = np.matmul((wheat_seed[i] - reconstruction[i]).T, (wheat_seed[i] - reconstruction[i]))

- Print the squared reconstruction error for each data point

In [17]:
pd.DataFrame(errors, columns=["Errors"])

Unnamed: 0,Errors
0,35.290037
1,35.589936
2,34.264450
3,35.127795
4,34.772069
...,...
205,34.421667
206,33.897060
207,34.947555
208,34.386221


- Breakdown of errors when the eigenvectors corresponding to the 4 **biggest** eigenvalues are used

In [18]:
pd.DataFrame(errors, columns = ["errors"]).describe()

Unnamed: 0,errors
count,210.0
mean,34.357913
std,0.685971
min,31.839909
25%,33.898034
50%,34.364601
75%,34.799078
max,36.861207


### 4

- Repeat the steps for Q2 with the eigenvectors corresponding to the 4 **smallest** eigenvalues (i.e. 4 $\le$ indices $\le$ 7)
- Project each data point onto all the eigenvectors and store the results of projection into a matrix

In [19]:
projected = np.zeros(shape=(len(wheat_seed),8))

for i in range(len(wheat_seed)):
    for j in range(4,8):
        projected[i][j] = np.dot(wheat_seed[i], v[:,j])/np.linalg.norm(v[:,j])
        
projected

array([[ 0.        ,  0.        ,  0.        , ...,  1.97685081,
        -4.4221903 , -1.3158124 ],
       [ 0.        ,  0.        ,  0.        , ...,  2.00418691,
        -4.4243389 , -1.31362472],
       [ 0.        ,  0.        ,  0.        , ...,  2.01391938,
        -4.38566277, -1.3176479 ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  1.98949335,
        -4.47402223, -1.31966865],
       [ 0.        ,  0.        ,  0.        , ...,  2.01347185,
        -4.37781303, -1.31586237],
       [ 0.        ,  0.        ,  0.        , ...,  1.95770899,
        -4.41463441, -1.32004096]])

- Reconstruct the data points using matrix of projected values and the 4 **smallest** eigenvectors

In [20]:
reconstruction = np.zeros(shape=(len(wheat_seed),8))

for i in range(len(wheat_seed)):
    for j in range(4,8):    
        reconstruction[i] += projected[i][j] * v[:,j]
        
pd.DataFrame(reconstruction)

Unnamed: 0,0,1,2,3,4,5,6,7
0,-1.992304,4.227244,1.161964,0.197276,2.945485,-0.024378,-1.787033,0.438328
1,-2.000856,4.258311,1.156924,0.183862,2.938879,-0.024364,-1.802463,0.441075
2,-1.968840,4.161640,1.170451,0.121534,2.944924,-0.024141,-1.683532,0.420367
3,-1.994876,4.167302,1.166921,0.181427,3.062469,-0.024957,-1.681855,0.423400
4,-1.982199,4.178606,1.167274,0.163261,2.980005,-0.024449,-1.711821,0.426227
...,...,...,...,...,...,...,...,...
205,-1.968817,4.156727,1.172953,0.192479,2.943950,-0.024249,-1.732961,0.428478
206,-1.957497,4.108467,1.169993,0.168063,2.971134,-0.024295,-1.669035,0.418004
207,-1.989247,4.173068,1.176363,0.157583,3.025683,-0.024703,-1.683219,0.422731
208,-1.971047,4.176742,1.166265,0.129556,2.929963,-0.024089,-1.707045,0.424020


- Calculate the squared reconstruction errors

In [21]:
errors = np.zeros(shape=(len(wheat_seed),1))

for i in range(len(wheat_seed)):
    errors[i] = np.matmul((wheat_seed[i] - reconstruction[i]).T, (wheat_seed[i] - reconstruction[i]))

- Print the squared reconstruction error for each data point

In [22]:
pd.DataFrame(errors, columns=["Errors"])

Unnamed: 0,Errors
0,495.924558
1,467.439766
2,439.982251
3,420.497049
4,525.561221
...,...
205,370.362035
206,345.801870
207,468.241003
208,363.240614


- Breakdown of errors when the eigenvectors corresponding to the 4 **smallest** eigenvalues are used

In [23]:
pd.DataFrame(errors, columns = ["errors"]).describe()

Unnamed: 0,errors
count,210.0
mean,501.678451
std,137.546708
min,315.156162
25%,384.774898
50%,457.723848
75%,618.316532
max,847.798383


### 5

- Compare and comment on the squared reconstruction errors in `Q2`, `Q3`, and `Q4`.

In Q2, where all of the eigenvectors are used, the squared reconstruction errors of each data point is extremely close to 0. The errors have a mean of 7.461628 x 10<sup>-25</sup> which is essentially 0. The standard deviation of the errors is also essentially 0, having a value of 1.300749 x 10<sup>-26</sup>. Theoretically, the information loss should be 0, since all eigenvectors are used in this case. However, due to rounding off errors of the programming language used, the errors are extremely close to 0 and not exactly 0.
    
In Q3 where the eigenvectors corresponding to the 4 **biggest** eigenvalues are used, the squared reconstruction errors of each data point is on average 34.357913, having a standard deviation of 0.685971. The highest error value is only 36.861207. However, in Q4, when the eigenvectors corresponding to the 4 **smallest** eigenvalues are used, the squared reconstruction errors of each data point is on average 501.678451, having a standard deviation of 137.546708. The errors are at least 315.156162 and can even reach up to 847.798383. This is due to the fact that when we project onto the smallest eigenvectors, the amount of information lost is more than if we had used the largest eigenvectors. Should we decided to do dimensionality reduction, we should choose to project on eigenvectors with the largest eigenvalues to preserve as much information as possible. 

### References:

<sup>1</sup> NumPy. (n.d.). Numpy.cov. numpy.cov - NumPy v1.22 Manual. Retrieved February 16, 2022, from https://numpy.org/doc/stable/reference/generated/numpy.cov.html 

<sup>2</sup> NumPy. (n.d.). Numpy.linalg.eig. numpy.linalg.eig - NumPy v1.22 Manual. Retrieved February 16, 2022, from https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html 

<sup>3</sup> Walls, P. (n.d.). Eigenvalues and eigenvectors. Eigenvalues and Eigenvectors - Mathematical Python. Retrieved February 16, 2022, from https://personal.math.ubc.ca/~pwalls/math-python/linear-algebra/eigenvalues-eigenvectors/ 

<sup>4</sup> Gabil, D. (2019, May 21). Eigenvalues and eigenvectors in Python/numpy. ScriptVerse. Retrieved February 16, 2022, from https://scriptverse.academy/tutorials/python-eigenvalues-eigenvectors.html 

<sup>5</sup> Foundation, C. K.-12. (n.d.). 12 foundation. CK. Retrieved February 16, 2022, from https://flexbooks.ck12.org/cbook/ck-12-college-precalculus/section/9.6/primary/lesson/scalar-and-vector-projections-c-precalc/ 