# Session 05

## [NumPy](https://numpy.org/doc/)

### Basics


In [6]:
# TODO: import numpy w/ the common alias np
import numpy as np


In [7]:
years = [1977, 1980, 1983, 1999, 2005, 2015, 2017, 2019]

# TODO: create a numpy ndarray from years list
years = np.array(years, dtype=np.int32)  # this overrides the years list
# but we don't really need it
# NOTE: best practice is to avoid overriding variables

# TODO: create a numpy array of 6 entries all of value 5, and type int, call it
# months
# NOTE: you may use numpy's built-in API or use the vectorised operations
months = np.full((6,), 5)
alt_months = 5 * np.ones((6,))

# TODO: get the lowest entry in the years array
low_year = years.min()  # also max, mean, std `standard deviation`
# OR:
low_year = np.min(years)  # also max, mean, std

# TODO: insert 2002 in 5th **location** (not index) of the years array
years = np.insert(years, 4, 2002)

# TODO: insert 12 three times at the end of the months array
months = np.append(months, [12] * 3)

# TODO: create a new ndarray, that has both arrays in
#   rows
dates = np.vstack([months, years])
# OR:
# dates = np.r_[months, years]
#   columns, and let's keep that one as dates
dates = np.hstack([months.reshape((-1, 1)), years.reshape((-1, 1))])
# OR:
# dates = np.c_[months, years]

# TODO: print the number of dimensions of the dates matrix, and the dimensions
# themselves
print(dates.ndim, dates.shape)

# TODO: create an ascending array starting at 1 and ending at the number of rows
# in the dates matrix, using only numpy, named idx
n, _ = dates.shape
idx = np.arange(1, n + 1).reshape(n, 1)

# TODO: add idx to the left of the dates matrix
dates = np.hstack([idx, dates])

# TODO in the dates matrix
#   slice for the 2nd column
print(dates[:, 1])
#   slice for the 3rd row
print(dates[2])  # OR: dates[2,:], but we can omit latter dimensions
#   slice for values in 5th to 8th row, 2nd to 3rd column
print(dates[4:8, 1:3])

# TODO: save the dates array to dates.npy file
np.save("dates.npy", dates)

# TODO: get the transpose of the dates array, and reshape it into 3D
print(np.transpose(dates), dates.T, dates.transpose())

# TODO: make a copy of the dates matrix, named dup, and replace values between
# 7th row to last row with random floats
dup = np.copy(dates)
dup = dup.astype(float)

randfloats = np.random.random((3, 3))
dup[6:] = randfloats

# TODO: get the main diagonal of dates array
print(np.diag(dates))

# TODO: get the 1st diagonal above the main from dates
print(np.diag(dates, 1))
# TODO: get the 2nd diagonal below the main from dates
print(np.diag(dates, -2))

# TODO: get the unique values from dates
print(np.unique(dates))

# TODO: filter the dates for rows where months are strictly greater than 5
gt_5 = months > 5
print(dates[gt_5])

# TODO: create an array of random integers of size 8, called randi
randi = np.random.randint(0, 10, (8,))

# TODO: sort randi, out-of-place & in-place
out_of_place = np.sort(randi)
print(randi, out_of_place, sep="\n")
randi.sort()

# TODO: create another random integers array, called marti
marti = np.random.randint(0, 10, (8,))

# TODO: find intersection, union & difference amongst randi & marti
print(np.intersect1d(randi, marti))
print(np.union1d(randi, marti))
print(np.setdiff1d(randi, marti))

# TODO: combine both randi & marti as columns into rick, and sort on both axes
rick = np.c_[randi, marti]
np.sort(rick, axis=0), np.sort(rick, axis=1)


2 (9, 2)
[ 5  5  5  5  5  5 12 12 12]
[   3    5 1983]
[[   5 2002]
 [   5 2005]
 [  12 2015]
 [  12 2017]]
[[   1    2    3    4    5    6    7    8    9]
 [   5    5    5    5    5    5   12   12   12]
 [1977 1980 1983 1999 2002 2005 2015 2017 2019]] [[   1    2    3    4    5    6    7    8    9]
 [   5    5    5    5    5    5   12   12   12]
 [1977 1980 1983 1999 2002 2005 2015 2017 2019]] [[   1    2    3    4    5    6    7    8    9]
 [   5    5    5    5    5    5   12   12   12]
 [1977 1980 1983 1999 2002 2005 2015 2017 2019]]
[   1    5 1983]
[   5 1980]
[   3    5 2002]
[   1    2    3    4    5    6    7    8    9   12 1977 1980 1983 1999
 2002 2005 2015 2017 2019]
[[   7   12 2015]
 [   8   12 2017]
 [   9   12 2019]]
[9 9 1 8 5 0 3 9]
[0 1 3 5 8 9 9 9]
[0 1 3 5 9]
[0 1 3 4 5 7 8 9]
[8]


(array([[0, 0],
        [1, 1],
        [3, 3],
        [5, 4],
        [8, 4],
        [9, 5],
        [9, 7],
        [9, 9]]),
 array([[0, 7],
        [1, 4],
        [3, 4],
        [1, 5],
        [5, 8],
        [0, 9],
        [9, 9],
        [3, 9]]))

In [8]:
arr = np.random.random((4, 4))

# TODO: find the mean, standard deviation, & median of arr
flat_mean = np.mean(arr)
flat_std = np.std(arr)
flat_median = np.median(arr)

row_mean = np.mean(arr, axis=1)
row_std = np.std(arr, axis=1)
row_median = np.median(arr, axis=1)

col_mean = np.mean(arr, axis=0)
col_std = np.std(arr, axis=0)
col_median = np.median(arr, axis=0)


## [Broadcasting](https://numpy.org/devdocs/user/basics.broadcasting.html)

Simply put, the ability to (automatically) match sizes of arrays

---

in linear algebra, an operations like $a * \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix}$ is completely valid & results in $\begin{bmatrix} ab_{11} & ab_{12} \\ ab_{21} & ab_{22}\end{bmatrix}$.

On the other hand, operation like $a * \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix}$ is invalid, because addition requires either 2 scalars, or 2 vectors of identical dimensions, like $\begin{bmatrix}a_{11} \\ a_{21}\end{bmatrix}+\begin{bmatrix}b_{11} \\ b{21}\end{bmatrix}\to\begin{bmatrix}a_{11}+b_{11}\\a{21}+b_{21}\end{bmatrix}$.

Yet if tried the scalar $a$ + matrix $B$ in `numpy` it works. As you might have guessed already, it works because internally `numpy` found a way to match shapes to cover the rule of linear algebra. That's precisely the idea of broadcasting.

On checking the documentation, you are provided w/ the simplest rule: start out from the rightmost dimension and work your way to the left. If the dimensions is $1$ or matches, then it is broadcast-able.

If the one of the arrays have less dimensions, missing ones on the left can be assumed to be 1 (which is mathematically correct).


In [9]:
# TODO: optional, try to write a function to check for broadcasting using your
# own understanding

# We will leverage only the ones function just to demonstrate the broadcasting
# TODO: without running the cell, comment out what you believe would violate
# broadcasting rules
a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 2.0, 2.0])
a + b

a = np.array([1.0, 2.0, 3.0])
b = 2.0
a + b

a = np.ones((256, 256, 3))
b = np.ones((3,))
a * b

a = np.ones((8, 1, 6, 1))
b = np.ones((7, 1, 5))
a + b

a = np.ones((5, 4))
b = 1.0
a + b

b = np.ones((4,))
a + b

a = np.ones((15, 3, 5))
b = np.ones((15, 1, 5))
a + b

b = np.ones((3, 5))
a + b

b = np.ones((3, 1))
a + b

a = np.ones((4,))
b = np.ones((3,))
# a + b

a = np.ones((2, 1))
b = np.ones((8, 4, 3))
# a + b

a = np.ones((4, 3))
b = np.ones((3))
a + b

b = np.ones((3, 1))
# a + b

b = np.ones((4,))
# a + b

a = np.ones((4, 1))
b = np.ones((1, 3))
a + b

a = np.ones((10, 3))
b = np.ones((5, 1, 3))
a + b


array([[[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]],

       [[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]],

       [[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]],

       [[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]],

       [[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.],
  

For practice & review, we will try to implement `PCA` (Principle Component Analysis) algorithm

## PCA

> For quick reference, check [this video](https://www.youtube.com/watch?v=dsOyN46exG0), and [this video](https://www.youtube.com/watch?v=xB7-b6FSANA) from [Udacity's YouTube](https://www.youtube.com/c/Udacity) channel

![pca_gif](http://www.billconnelly.net/wp-content/uploads/2021/05/PCA1-smaller-smaller.gif)

### Steps for PCA

1. **Standardise** the feature set
2. Calculate **covariance matrix** for _standardised_ set
3. Calculate **eigenvectors** & **eigenvalues** for _covariance matrix_
4. Sort _eigenvalues_ & _eigenvectors_
5. **Pick top $k$** from sorted _eigenvalues_, and form a matrix from corresponding _eigenvectors_
6. **Transform** _standardised_ set using the _k-eigenvectors_ matrix

> First try each step on its own, then combine them into a subroutine/function


In [10]:
# TODO: we need the os module for path creation
import os

# NOTE: define globals/constants
DATA_PATH = "../data/raw/toy"
RAW_DATA_FILE = "raw.npy"
STD_DATA_FILE = "standard.npy"
COV_MAT_FILE = "covariance.npy"
EIG_VECT_FILE = "e_vect.npy"
EIG_VAL_FILE = "e_val.npy"


In [11]:
# TODO: read in the toy data set 'raw.npy' located at DATA_PATH, save it into
# variable called raw
raw = np.load(os.path.join(DATA_PATH, RAW_DATA_FILE))
raw


array([[1, 6, 9, 1],
       [3, 7, 4, 9],
       [3, 5, 3, 7],
       [5, 9, 7, 2],
       [4, 9, 2, 9]])

Standardisation is the process of ensuring the data set at hand has a mean of $0$ and a standard deviation of $1$

The formula is

$$
X_{std} = \frac{X-\bar{X}}{\sigma_X}
$$

> NOTE: this operation is done on feature level (i.e. done on column level, you'd use the axis argument w/ `numpy`)


In [12]:
# TODO: standardise the raw set data, then read & compare w/ standardised data
# TODO: find the mean
mean = np.mean(raw, axis=0)

# TODO: find the standard deviation
std = np.std(raw, axis=0)

# TODO: apply the formula
standard = (raw - mean) / std

# TODO: read the standard data set
std_set = np.load(os.path.join(DATA_PATH, STD_DATA_FILE))

# TODO: compare the values
# HINT: use numpy.all
np.all(std_set == standard)


True

Now that the standardised set is calculated, time to find the covariance

Covariance is a bivariate moment, which measures how closely change in one variable affects the other.

- A covariance of $1$ (which is covariance of a variable w/ itself) means highly positively related, 
- covariance of $0$ is no relation, and 
- $-1$ is highly negatively related.


The formula is

$$
\delta_{x,y}=\frac{1}{n-1}\Sigma_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})
$$

> NOTE: we will not implement this formula, as `numpy` already has a ready-made function for it, use the documentation (or search engine) to find it


In [13]:
# TODO: search the documentation (or use search engine) to find which numpy
# function can be used to find covariance, and use to find covariance of the
# standard set
cov = np.cov(standard, rowvar=False)
cov

# TODO: read in the covariance matrix and compare w/ your output
covariance = np.load(os.path.join(DATA_PATH, COV_MAT_FILE))
cov == covariance


array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

Well done, you've found the covariance (function), but we get the idea.

Now it is time to find eigenvalues & eigenvectors. There is a lot of math there behind those 2 words, but we don't need to worry just yet about that.

Again, use search engine or `numpy` documentation to find which module, and which function give you back the eigenvectors & eigenvalues.

> NOTE: the step for sorting is actually implemented inside the `numpy` utility, so the o/p is sorted descendingly w/ respect to eigenvalues.


In [14]:
# TODO: search the numpy documentation, or use search engine to find how to use
# numpy to find eigenvalues & eigenvectors
eig_vals, eig_vects = np.linalg.eig(cov)
eig_vals, eig_vects
# TODO: read in eigenvalues & eigenvectors and check against your results
e_vals = np.load(os.path.join(DATA_PATH, EIG_VAL_FILE))
e_vect = np.load(os.path.join(DATA_PATH, EIG_VECT_FILE))

np.all(eig_vals == e_vals) and np.all(e_vect == eig_vects)


True

The final steps, select a number of features $k$ ranging $[1,n[$, where n is the number of dimensions (columns) you have

Now slice the eigenvectors matrix for the first $k$ columns


In [15]:
# TODO: select k
k = 4

# TODO: slice up your eigenvectors matrix
transformer = eig_vects[:, :k]


Now using matrix multiplication (we'll search for that as well), transform the standardised set


In [16]:
# TODO: search how to do matrix multiplication (dot product) using numpy and
# transform the standardised set
pca_prod = np.dot(standard, transformer)


In [None]:
def do_pca(data: np.ndarray, k: int = -1):
    """A function to manually apply PCA on a given dataset

    Parameters
    ----------
    data: np.ndarray
        the data set to apply PCA on, dimensions/features are on columns
    k: int, default = -1
        the number of features to narrow down at the end, when -1 (default) use
        all features
    """
    if k == -1:
        _, k = data.shape
    mean = np.mean(data, axis=0)
    std = np.std(data, axis=0)

    data = (data - mean) / std

    cov = np.cov(data, rowvar=False)

    _, v = np.linalg.eig(cov)

    return data @ v[:, :k]  # alternative for np.dot(data, v[:, :k])


And, we're done.

Hope you enjoyed the exercise!

The next cell is some code outside from the content of the nanodegree, but it is to validate against the results we obtained

> NOTE: the next cell applies as $k=n$, i.e. on the full feature set, so to validate against it, either you'd repeat your last step, by setting `k` to the number of features (columns), or set `n_components=k` in the arguments to `PCA` instantiation


In [17]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
std_data = scaler.fit_transform(raw)
decomposer = PCA() # you may set `n_components` as arguments here
result = decomposer.fit_transform(std_data)
result


array([[ 2.69976873, -0.0957599 , -0.38522941, -0.07924949],
       [-0.63847181, -0.7477542 , -0.29299551,  0.33035357],
       [-0.10795215, -1.43832606,  0.76020549, -0.09370268],
       [-0.05389867,  2.1457436 ,  0.42226445,  0.06273794],
       [-1.8994461 ,  0.13609655, -0.50424503, -0.22013935]])

## Thank you

## Good Luck

## Have Fun
