In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Basic setup

Create anaconda environnement
<br>
```bash
conda create -n ml python=3.7.4 jupyter
```
Install fastai library
<br>
```bash
conda install -c pytorch -c fastai fastai
```

# Special types of matrices

Identity matrix $I_{n} \in \mathbb{R}^{n \times n}$ is a matrix which does not changes other matric after multiplication. This kind of matrices contain ones on main diagonal and zero everywhere else 
$$\begin{align} I_{n} &= \begin{pmatrix}
           1, 0, \dots, 0 \\
           0, 1, \dots, 0 \\
           \vdots \\
           0, 0, \dots, 1 \\
         \end{pmatrix}
  \end{align}$$
<br>
or we can define it with property $\forall a \in \mathbb{R}^{1 \times n}$ holds $aI_{n} = a$ or $\forall a \in \mathbb{R}^{n \times 1}$ holds $I_{n}a =a$

In [2]:
import numpy as np

In [3]:
I = np.identity(4)

In [4]:
I

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [5]:
A = np.random.random(size=(4, 5))
B = np.random.random(size=(6, 4))

In [6]:
A

array([[0.06408966, 0.63815018, 0.98566118, 0.60040464, 0.26895881],
       [0.56413004, 0.80495305, 0.42470747, 0.76482401, 0.84452987],
       [0.55984012, 0.69449168, 0.88445787, 0.02666495, 0.24918927],
       [0.10106618, 0.59648755, 0.70033257, 0.77702948, 0.96643059]])

In [7]:
B

array([[0.76903034, 0.96546203, 0.97486059, 0.7271405 ],
       [0.37869849, 0.5219698 , 0.2023284 , 0.1501247 ],
       [0.78487406, 0.28200862, 0.05506212, 0.67700607],
       [0.9554109 , 0.80970355, 0.70969572, 0.73534259],
       [0.69673117, 0.37909922, 0.60046858, 0.08526293],
       [0.74982353, 0.89713497, 0.35076501, 0.08630888]])

In [8]:
I @ A

array([[0.06408966, 0.63815018, 0.98566118, 0.60040464, 0.26895881],
       [0.56413004, 0.80495305, 0.42470747, 0.76482401, 0.84452987],
       [0.55984012, 0.69449168, 0.88445787, 0.02666495, 0.24918927],
       [0.10106618, 0.59648755, 0.70033257, 0.77702948, 0.96643059]])

In [9]:
B @ I

array([[0.76903034, 0.96546203, 0.97486059, 0.7271405 ],
       [0.37869849, 0.5219698 , 0.2023284 , 0.1501247 ],
       [0.78487406, 0.28200862, 0.05506212, 0.67700607],
       [0.9554109 , 0.80970355, 0.70969572, 0.73534259],
       [0.69673117, 0.37909922, 0.60046858, 0.08526293],
       [0.74982353, 0.89713497, 0.35076501, 0.08630888]])

Inverse matrix of $A \in \mathbb{R}^{n \times n}$, is the matrix $A^{-1} \in \mathbb{R}^{n \times n}$ for which $A^{-1}A = I$

In [10]:
A = np.random.random(size=(8, 8))
A

array([[0.6125751 , 0.15888746, 0.02223882, 0.76618076, 0.73241619,
        0.6289349 , 0.226645  , 0.65588465],
       [0.94065533, 0.58736191, 0.26012329, 0.7874319 , 0.60034295,
        0.06654736, 0.08035717, 0.04901165],
       [0.89697854, 0.29096958, 0.80725843, 0.64895212, 0.3339324 ,
        0.78196143, 0.32624645, 0.8055663 ],
       [0.79326052, 0.23051901, 0.92566027, 0.43530543, 0.34541688,
        0.89705842, 0.26389673, 0.13180854],
       [0.33051925, 0.91756976, 0.72004206, 0.38663996, 0.61841383,
        0.82386967, 0.14516633, 0.61626046],
       [0.71271661, 0.66551882, 0.28096133, 0.34088832, 0.65400256,
        0.48549735, 0.68552684, 0.69690502],
       [0.81669031, 0.79964656, 0.3127977 , 0.66943166, 0.84117798,
        0.02634866, 0.32596186, 0.32676447],
       [0.12712444, 0.55476692, 0.81661922, 0.19491048, 0.19584648,
        0.60718235, 0.46769667, 0.82171419]])

In [11]:
invA = np.linalg.inv(A)
invA

array([[-1.2335457 , -0.65363739,  2.11598374, -0.38724683,  0.53059156,
         0.96910526,  0.25370425, -2.30941815],
       [-0.61513068,  2.52377707, -0.76162407, -0.75500925,  0.84358705,
         1.0521448 , -2.22354802,  0.56744522],
       [-0.38682318, -1.91169979,  0.60232431,  0.75926954, -0.30995406,
        -1.49729754,  2.49048166,  0.22246454],
       [ 1.60950774,  3.76061659, -2.28501426,  0.05283267, -1.35486093,
        -0.75005324, -2.3954056 ,  3.32743158],
       [ 0.45913076, -4.14241846,  0.42817648,  0.87833232,  0.39960581,
        -1.14464426,  4.20210662, -1.67997384],
       [ 0.51578528,  1.38777084, -0.93476681,  0.412884  ,  0.62467068,
         0.93994577, -2.44229189,  0.06124253],
       [ 0.74598108,  1.79319836, -2.48813005,  0.92901548, -1.90996197,
         1.07365698, -1.37184952,  2.65518742],
       [-0.30636255, -1.6537287 ,  2.13509933, -1.24065116,  0.50805762,
        -0.22716735,  1.13904757, -0.97531139]])

In [12]:
Ia = invA @ A
print(Ia)

[[ 1.00000000e+00 -9.89908496e-17 -3.79891541e-17  6.39680135e-16
   2.87299269e-16  3.42368960e-16  3.58438270e-17 -2.03454327e-16]
 [ 9.42725243e-17  1.00000000e+00 -4.20154056e-16 -6.42350847e-17
   9.38383636e-17 -5.12828326e-16 -1.76038883e-16 -1.69825984e-16]
 [-8.15130108e-17  1.92389592e-16  1.00000000e+00  2.28034486e-16
  -2.97130533e-17  2.16630240e-16  1.45256877e-16  1.99948539e-16]
 [-3.35591407e-16  4.89921728e-16  4.89333616e-16  1.00000000e+00
   1.39395748e-16 -1.19454826e-16 -9.31906842e-17  6.38728856e-18]
 [-9.64345199e-16 -3.00190048e-16 -3.61757625e-17 -1.09390395e-15
   1.00000000e+00 -8.63274299e-17  2.06019002e-16  4.17368185e-16]
 [-1.50753297e-16 -8.99956809e-17 -2.22058098e-17 -2.21182031e-16
  -9.45495772e-17  1.00000000e+00 -1.46093413e-16 -9.55137907e-17]
 [ 8.88178420e-16  8.88178420e-16  0.00000000e+00 -4.44089210e-16
   0.00000000e+00 -4.44089210e-16  1.00000000e+00  4.44089210e-16]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.44089210e-16
   

In [13]:
np.round(Ia)

array([[ 1., -0., -0.,  0.,  0.,  0.,  0., -0.],
       [ 0.,  1., -0., -0.,  0., -0., -0., -0.],
       [-0.,  0.,  1.,  0., -0.,  0.,  0.,  0.],
       [-0.,  0.,  0.,  1.,  0., -0., -0.,  0.],
       [-0., -0., -0., -0.,  1., -0.,  0.,  0.],
       [-0., -0., -0., -0., -0.,  1., -0., -0.],
       [ 0.,  0.,  0., -0.,  0., -0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])

## Vector space 

$v, u \in \mathbb{R}^{n}$ and for every $\alpha \in \mathbb{R}^{1}$ we have $u + v \in \mathbb{R}^{n}$ and $\alpha u \in \mathbb{R}^{n}$
<br>
So we have a sum and multiplication on scalar with properties:
- for every $u, v, w \in \mathbb{R}^{n}$: $(u + v) + w = u + (u + w)$
- for every $u, v \in \mathbb{R}^{n}$: u + v = v + u
- there exists $0 \in \mathbb{R}^{n}$ such that: $0 + u = u = 0 = u$
- for every $u \in \mathbb{R}^{n}$ there exists $-u \in \mathbb{R}^{n}$ such that: $u + (-u) = (-u) + u = 0$
- for every $\alpha, \beta \in \mathbb{R}^{1}$ and every $u \in \mathbb{R}^{n}$: $\alpha(\beta u) = (\alpha \beta u)$
- for every $u \in \mathbb{R}^{n}$: $1u=u1=u$
- for every $u, v \in \mathbb{R}^{n}$ and $\alpha \in \mathbb{R}^{1}$: $\alpha (u + v) = \alpha u + \alpha v$
- for every $\alpha, \beta \in \mathbb{R}^{1}$ and every $u \in \mathbb{R}^{n}$: $(\alpha + \beta)u = \alpha u + \beta v$
<br>
So we can define $-$ and $:$ operations as well
If some structure satisfies such properties it's called vector space

In [14]:
import random

In [15]:
u = np.random.random(5)
v = np.random.random(5)
x = random.random()
y = random.random()
w = u * x + v * y
u, v, x, y, w


(array([0.55092915, 0.66542492, 0.11372875, 0.73760953, 0.41543302]),
 array([0.57892338, 0.55029794, 0.23984465, 0.84579504, 0.16675286]),
 0.005529069115488339,
 0.9436957463169117,
 array([0.54937365, 0.522993  , 0.22696919, 0.80225147, 0.15966093]))

## Linear combination

Let $v_1, v_2, \dots v_n \in \mathbb{R}^{n}$ and $\alpha_{1}, \alpha_{2} \dots \alpha_{n} \in \mathbb{R}^{1}$ then linear combination of this vectors is called the vector $w = \sum_{i=1}^{n}\alpha_{i} v_{i}$

Vectors $v_1, v_2, \dots v_n \in \mathbb{R}^{n}$ are called lineary independent if nan of them can be linear combination of thers, of for each $i \in (1 \dots n)$ there is no such $\alpha_{1}, \alpha_{2} \dots \alpha_{i-1}, \alpha_{i+1} \dots \alpha_{n_1} \in \mathbb{R}^{1}$ such that $u_i = \sum_{k = 1, k \neq i}^{n}\alpha_{k}u_k $

Maximum amount of lineary independent vectors in vector space is called dimension of this space
<br>
Every vector space has basis, lineary independent vectors $e_1, e_2 \dots e_n $ such that every other vector from this space can be achived by the linear combination of this basis $u = \sum_{i=1}^n\alpha_{i}e_i$ and $(\alpha_{1}, \alpha_{2}, \dots, \alpha_{n})$ are called the coordinants of the vector $u$
<br>
For $\mathbb{R}^{n}$ basis is $e_1 = (1, 0, \dots, 0), e_2 = (0, 1, \dots, 0), \dots e_i = (0, 0, \dots, 1, \dots, 0), e_n = (0, 0, \dots, 1)$
Or for $\mathbb{R}^{2}$ basis is $e_1=(1, 0), e_2=(0, 1)$ and for $\in \mathbb{R}^{3}$: $e_1=(1, 0, 0), e_2 = (0, 1, 0), e_3=(0, 0, 1)$
<br>

## Linear maps

Map $f:X \to Y$ between vector spaces $\mathbb{U}$ and $\mathbb{V}$ is called linear (or linear transformation) if for every $u, v \in \mathbb(U)$ and every scalar $\alpha in \in \mathbb{R}^{1}$ we have:
- $f(u + v) = f(u) + f(v)$
- $f(\alpha u) = \alpha f(u)$

Let $e_1, e_2, \dots, e_n$ be a basis for linear space $\mathbb{U}$ and $l_1, l_2, \dots, l_m$ basis for $\mathbb{V}$
then $f(e_i) = a_1{li}_1 + a_{2i}l_2 + \dots + a_{mi}l_m$ for some $a_1, \dots, a_m \in \in \mathbb{R}^{1}$
for we have the following matrix:
$$\begin{align} T &= \begin{pmatrix}
           a_{11}, a_{12}, \dots, a_{1n} \\
           a_{21}, a_{22}, \dots, a_{2n} \\
           \vdots \\
           a_{m1}, a_{n2}, \dots, a_{mn} \\
         \end{pmatrix}
  \end{align}$$
<br>
which is called the transformation matrix
<br>
For each $u \in \mathbb{U}$ there exists $b_1, b_2, \dots, b_n \in \mathbb{R}^{1}$ such that: $u = b_1e_1 + b_2e_2 + \dots + b_ne_n$ and for $f(u)$ (from linear property) we have $$f(u) = b_1f(e_1) + b_2f(e_2) + \dots + b_nf(e_n)$$
then from the property - $f(e_i) = a_{1i}1_1 + a_{2i}l_2 + \dots + a_{mi}l_m$, we get:
$$f(u) = Tb$$

## Eigenvectors and Eigenvalues

If for some linear transformation $T:\mathbb{U} \to \mathbb{V}$ with transformation matrix $M$, there exists nonzero vector $u \in \mathbb{U}$ and scalar $\lambda \in \mathbb{R}^{1}$ such that $T(u) = Mu = \lambda u$, thsi vector is called eigenvector for the transformation $T$ and $\lambda$ is called eigenvalue

$Mu = \lambda u$ then $Mu - \lambda u = 0$ and 
$$(M - I\lambda)u = 0$$

This equasion has a solution if $|M - I\lambda|=0$ so we can calculate eigenvector and eigenvalue for linear transformation

Consider eigenvectors $v_1, v_2, \dots, e_n$ and eihenvalues $\lambda_{1}, \lambda_{2}, \dots, \lambda_{n}$ from basis $e_1, e_2, \dots, e_n$ of transformation matrix $A$, let $Q = (v_1, v_2 \dots, v_n)$ then:
$$AQ = (\lambda_{1}v_1, \lambda_{2}v_2, \dots, \lambda_{n}v_n)$$
define:
$$\begin{align} \Lambda &= \begin{pmatrix}
           \lambda_{11}, 0, \dots, 0 \\
           0, \lambda_{22}, \dots, 0 \\
           \vdots \\
           0, 0, \dots, \lambda_{nn} \\
         \end{pmatrix}
  \end{align}$$
<br>
$$AQ = Q\Lambda$$
<br>
$$AQQ^{-1}=Q\Lambda Q^{-1}$$
<br>
$$AI=Q\Lambda Q^{-1}$$
<br>
$$A=Q\Lambda Q_{-1}$$
<br>
This is called eigendecomposition of matrix $A$

## Visualisation of Eigenvalues and Eigenvectors

If we consider transformation matrix A again, and we take bunch of vectors (as much as possible) from given vector 
space where that matrix is doing transformations, we can find some vectors that never change their orientation but maybe they are scaled with some factor. 

Each such vector is called Eigenvector of matrix A which direction isn't affected by transformation, but scale is affected. Scale of that vector will be eigenvalue of that vector.

Each eigenvector has it's eigenvalue and these can be multiple because of several axis of transformation.

### Eigenvalues and Eigenvectors example

![SegmentLocal](images/la2/eigenvalues_and_vectors.gif)

### Non-Eigenvalues and Eigenvectors example

![SegmentLocal](images/la2/non_eigenvalues_and_vectors.gif)

## Vector (Matrix) norm

$$||x||_{p} = \sqrt[p]{\sum_{i=1}^{n}x^{p}}$$
$$||x||_{p} = (\sum_{i=1}^{n}x^{p})^{1/p}$$
<br>
$L_2$ norm
$$||x||_{2} = (\sum_{i=1}^{n}x^{2})^{1/2}$$
$L_1$ norm
$$||x||_{1} = \sum_{i=1}^{n}|x|$$
<br>
$$||x||_{\infty} = max|x|$$

For matrices Frobenius norm:
$$||A||_{F} = (\sum_{i=1, j=1}^{n, m}a_{ij})^2$$

## Determinant as a scaling factor

Here is a link to the [video](https://www.youtube.com/watch?v=Ip3X9LOh2dk) about determinants

Given two vectors x,y in space and some transformation matrix A. 
If we multiply these vectors by given transformation matrix, we will get transformed vectors. 
Area value before and after transformation will be changed with exactly the value of determinant of A.

![SegmentLocal](images/la2/determinant_as_scaling_factor.gif)

## SVD

For eny $A \in \mathbb{R}^{n \times m} (\mathbb{C}^{n \times m})$ there exists decomposition:
$$A = U \Sigma V^{T}$$ 
where $U \in \mathbb{R}^{n \times n}(\mathbb{C}^{n \times n})$ is a square matrix, $\Sigma \in \mathbb{R}^{n \times m} (\mathbb{C}^{n \times m})$ ia a diagonal matrix and  $V \in \mathbb{R}^{n \times n} (\mathbb{C}^{n \times n})$ is also a square matrix
#### Note: We only discuss real valued vectors, matrices and tensors in this course

![title](images/la2/svg1.png)

short [tutorial](http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm) on SVD from MIT

very interesting medium [blog](https://medium.com/@jonathan_hui/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491) on SVD & PCA 

![title](images/la2/svd_steps.jpeg)

In [16]:
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
A, A.shape

(array([[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9],
        [10, 11, 12]]), (4, 3))

In [17]:
U, S, V_T = np.linalg.svd(A)
U, S, V_T, U.shape, S.shape, V_T.shape

(array([[-0.14087668,  0.82471435,  0.53358462,  0.12364244],
        [-0.34394629,  0.42626394, -0.8036038 ,  0.2328539 ],
        [-0.54701591,  0.02781353,  0.00645373, -0.83663514],
        [-0.75008553, -0.37063688,  0.26356544,  0.48013879]]),
 array([2.54624074e+01, 1.29066168e+00, 1.38648772e-15]),
 array([[-0.50453315, -0.5745157 , -0.64449826],
        [-0.76077568, -0.05714052,  0.64649464],
        [-0.40824829,  0.81649658, -0.40824829]]),
 (4, 4),
 (3,),
 (3, 3))

In [18]:
Sg = np.diag(S)
Sg, Sg.shape

(array([[2.54624074e+01, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 1.29066168e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 1.38648772e-15]]), (3, 3))

In [19]:
np.linalg.norm(A), np.round(U[:, 0] @ U[:, 1].T), np.linalg.norm(U[2, :]), np.linalg.norm(V_T)

(25.495097567963924, -0.0, 1.0, 1.7320508075688772)

# Matrix Decompositions

[fastai LA](https://nbviewer.jupyter.org/github/fastai/numerical-linear-algebra/blob/master/nbs/1.%20Why%20are%20we%20here.ipynb#Matrix-Decompositions)

[advanced matrix decompositions](https://sites.google.com/site/igorcarron2/matrixfactorizations)

[nfm tutorial](https://perso.telecom-paristech.fr/essid/teach/NMF_tutorial_ICME-2014.pdf)

[topic modeling](https://medium.com/@nixalo/comp-linalg-l2-topic-modeling-with-nmf-svd-78c94330d45f)

[background removal using svd](https://medium.com/@siavashmortezavi/fast-randomized-svd-singular-value-decomposition-using-pytorch-and-gpus-46b627511a6d)

## PCA

### Data Reduction
- PCA is most commonly used to condense the information contained in a large number of original variables into a smaller set of new composite dimensions, with a minimum loss of information.

[Example](https://www.projectrhea.org/rhea/index.php/PCA_Theory_Examples) of using PCA on image compression

#### Mapping of 2D points into 1D. 
PCA Takes the most optimal 1d axis to save data information better, reducing memory by factor of 2. 

![SegmentLocal](images/la2/pca_1d.gif)

### Interpretation
- PCA can be used to discover important features of a large data set. It often reveals relationships that were previously unsuspected, thereby allowing interpretations that would not ordinarily result.
PCA is typically used as an intermediate step in data analysis when the number of input variables is otherwise too large for useful analysis.

[example](https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b) usage of pca (and t-SNE) for data visualization

[example](https://github.com/aviolante/sas-python-work/blob/master/tSneExampleBlogPost.ipynb) notebook for comparing PCA and t-SNE for visualizing MNIST data 

Let $X = (x^{1}, x^{2}, \dots x^{m})$ is our data where $x^{i} = (x_1^{i}, x_2^{i}, \dots, x_n^{i}) \in \mathbb{R}^{n}$ for each $i \in (1, 2, \dots, m)$
<br>
Normalize data with mean $x^{i} = x^{i} - \frac{1}{m}\sum_{i=1}^{m}x^{i}$
<br>
compute the covariance matrix:
$$A = \frac{1}{m} \sum_{i=1}^{m}(x^{i})(x^{i})^{T}$$
<br>
take SVD from $X$:
$$ A = U\Sigma V^{T}$$
and consider first $k \leq n$ columns of $U \in \mathbb{R}^{n\times n}$: 
$$u_1, u_2, \dots, u_k \in \mathbb{R}^{n}$$
now consider the matrix $U_{r} = (u_1, u_2, \dots, u_k)$ and 
$$z^{i} = U_{r}^{T}x^{i}$$
<br>
$U_{r}^{T} \in \mathbb{R}^{k \times n}$ and $x^{i} \in \mathbb{R}^{n \times 1}$ thus $z^{i} \in \mathbb{R}^{k \times 1}$

### Reconstruction

We can approximate the reconstruction of the original value of $x^{i}$ as 
$$x_{a}^{i} = U_{r}z^{i}$$
<br>
$z^{i} \in \mathbb{R}^{n \times 1}$

to check our method we should compare original value to approximation:
$$\frac{
\frac{1}{m}\sum_{i=1}^{m}||x^{i} - x_{a}^{i}||^{2}
}{
\frac{1}{m}\sum_{i=1}^{m}||x^{i}||^{2}
} \leq \epsilon$$
<br>
$\epsilon$ might be any value, e.g $\epsilon = 0.01$

$$\frac{
\frac{1}{m}\sum_{i=1}^{m}||x^{i} - x_{a}^{i}||^{2}
}{
\frac{1}{m}\sum_{i=1}^{m}||x^{i}||^{2}
} \leq  = 1 -
\frac{
\sum_{i=1}^{k}S_{ii}
}{
\sum_{j=1}^{n}S_{jj}
}$$
<br>
So we can calculate
$$\frac{
\sum_{i=1}^{k}S_{ii}
}{
\sum_{j=1}^{n}S_{jj}
} \geq \epsilon$$
<br>
Only one decomposition is enough

[pca](https://www.coursera.org/learn/machine-learning/lecture/GBFTt/principal-component-analysis-problem-formulation)

# Additional Materials

[jupyter-notebook tips&tricks&shortcuts](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/)