# 2.11 The Determinant

* 딥러닝 세미나 : 이론 [1]
* 김무성

#### 참고
* [2] Determinant (Wikipedia) - https://en.wikipedia.org/wiki/Determinant

The determinant of a square matrix, 
* $det(A)$
* is a function mapping matrices to real scalars. 
* the product of all the eigenvalues of the matrix.
* The absolute value of the determinant
    - can be thought of as a measure of 
        - how much 
            - multiplication 
                - by the matrix expands 
            - or contracts space. 
    - If the determinant is 0, 
        - then space is contracted completely 
            - along at least one dimension, 
            - causing it to lose all of its volume. 
    - If the determinant is 1, 
        - then the transformation 
            - preserves volume

<img src="https://upload.wikimedia.org/wikipedia/commons/a/ad/Area_parallellogram_as_determinant.svg" width=300 />

<img src="https://upload.wikimedia.org/wikipedia/commons/b/b9/Determinant_parallelepiped.svg" width=300 />

# 2.12 Example: Principal Components Analysis 

#### 참고
* [3] 차원축소 훑어보기 (PCA, SVD, NMF) - http://www.slideshare.net/madvirus/pca-svd
* [4] Principal Component Analysis - http://www.slideshare.net/rickwendell/principal-component-analysys
* [5] Probabilistic PCA, EM, and more</font> - http://www.slideshare.net/hsharmasshare/probabilistic-pca-em-and-more
* [6] 머피's 머신러닝: Latent Linear Model - http://www.slideshare.net/JungkyuLee1/s-latent-linear-model
* [7] Lecture 7 (prelude)  Some linear generative models and a coding perspective - https://www.cs.toronto.edu/~hinton/csc2515/notes/lec7pre.ppt 

<img src="http://www.nlpca.org/fig_pca_principal_component_analysis.png" width=700 />

<img src="http://images.slideplayer.com/15/4559668/slides/slide_8.jpg" width=400 />
             <img src="https://i.imgur.com/SOjew3N.png" width=400 />
             

<img src="figures/cap13.3.png" width=600 />

<img src="http://www.cs.ubc.ca/~murphyk/Bayes/Figures/gmka.gif" />

<img src="figures/cap_fa.png" width=600 />

<img src="figures/cap_ppca.png" width=600 />

One simple machine learning algorithm, principal components analysis or PCA can be derived using only knowledge of basic linear algebra.

Suppose we have a collection of $m$ points $\{x^{(1)}, . . . , x^{(m)}\}$ in $R^n$

Suppose we would like to apply <font color="red">lossy compression</font> to these points.

#### code vector

One way we can encode these points is to represent a lower-dimensional version of them. 
* For each point $x^{(i)}∈ R^{n}$ 
    - we will ﬁnd a corresponding <font color="red">code vector $c^{(i)} ∈ R^l$ </font>.
        - If $l$ is smaller than $n$, 
            - it will take less memory to store the code points than the original data.

#### encoding fucntion & decoding fucntion

We will want to ﬁnd 
* some <font color="red">encoding function</font> that 
    - produces the code for an input, 
        - $f(x)=c$, and 
* a <font color="red">decoding function</font> that 
    - produces the reconstructed input given its code, 
        - $x≈g(f (x))$

#### PCA

PCA is deﬁned by our choice of the decoding function.
* Speciﬁcally, to make the <font color="red">decoder very simple</font>, 
    - we choose to <font color="blue">use matrix multiplication</font> 
        - to map the code back in to $R^n$. 
* Let $g(c) = Dc$, 
    - where $D ∈ R^{n×l}$ 
        - is the <font color="red">matrix deﬁning the decoding</font>

#### orthogonal

Computing the optimal code for this decoder could be a diﬃcult problem. 
* To keep the encoding problem easy, 
    - PCA <font color="red">constrains the columns of $D$ to be orthogonal to each other</font>.
    - (Note that $D$ is still not technically “an orthogonal matrix” unless $l = n$)

#### unit norm

With the problem as described so far, many solutions are possible, 
* because we can increase the scale of $D_{:,i}$ 
    - if we decrease $c_i$ proportionally for all points. 

To give the problem a unique solution, 
* we <font color="red">constrain all of the columns of $D$ to have unit norm</font>.

#### how to generate the optimal code point $c^∗$

In order to turn this basic idea into an algorithm we can implement, 
* the ﬁrst thing we need to do is ﬁgure out 
    - how to generate the optimal code point $c^∗$ foreach input point $x$. 
* One way to do this is to 
    - <font color="red">minimize the distance</font> between 
        - the input point $x$ and 
        - its reconstruction, $g(c^∗)$.

####  $L^2$ norm

<img src="http://image.slidesharecdn.com/cs445linearalgebraandmatlabtutorial-150831010550-lva1-app6891/95/linear-algebra-and-matlab-tutorial-16-638.jpg?cb=1440983969" width=600 />

We can <font color="red">measure this distance using a norm</font>. 
* In the principal components algorithm, <font color="blue">we use the $L^2$ norm</font> :

<img src="figures/cap2.12.1.png" width=600 />

#### $L^2$ norm -> squared $L^2$ norm

#### 참고 
* [8] Monotonic Function - http://encyclopedia2.thefreedictionary.com/Monotonically+increasing

<img src="http://img.tfd.com/ggse/2c/gsed_0001_0016_0_img4169.png" width=600 />

We can switch to the squared $L^2$ norm instead of the $L^2$ norm itself, 
* because both are minimized by the same value of $c$. 
* Both are minimized by the same value of $c$ 
    - because the $L^2$ norm is non-negative and 
        - the squaring operation is monotonically increasing 
            - for non-negative arguments.

<img src="figures/cap2.12.2.png" width=600 />

The function being minimized simpliﬁes to

<img src="figures/cap2.12.3.png" width=600 />

(by the deﬁnition of the L2norm, equation 2.30)

<img src="figures/cap2.12.4.png" width=600 />

(by the distributive property)

<img src="figures/cap2.12.5.png" width=600 />

(because the scalar $g(c)^Tx$ is equal to the transpose of itself).

#### the ﬁrst term

We can now change the function being minimized again, <font color="red">to omit the ﬁrst term</font>, since this term does not depend on $c$

<img src="figures/cap2.12.6.png" width=600 />

To make further progress, we must substitute in the deﬁnition of g(c):

<img src="figures/cap2.12.7.png" width=600 />

(by the orthogonality and unit norm constraints on D)

<img src="figures/cap2.12.8.png" width=600 />

#### optimization

We can solve this optimization problem <font color="red">using vector calculus</font> (see section 4.3 if you do not know how to do this)

<img src="figures/cap2.12.9.png" width=600 />

#### encoder function

##### we can optimally encode $x$ just using a matrix-vector operation

This makes the algorithm eﬃcient: 
* we can optimally encode $x$ just using a matrix-vector operation. 
* <font color="red">To encode a vector, we apply the encoder function</font> :

<img src="figures/cap2.12.10.png" width=600 />

#### decoder function

Using a further matrix multiplication, we can also deﬁne the PCA reconstruction operation:

<img src="figures/cap2.12.11.png" width=600 />

#### encoding matrix $D$

Next, we need to choose the encoding matrix D. 
* To do so, we revisit the idea of minimizing the $L^2$ distance between inputs and reconstructions. 
* Since we will use the same matrix $D$ to decode all of the points, 
    - we can <font color="red">no longer consider the points in isolation</font>. 
* Instead, we must minimize 
    - the Frobenius norm of the matrix of errors computed over all dimensions and all points:

<img src="figures/cap2.12.12.png" width=600 />

### Finding $D^*$

To derive the algorithm for ﬁnding D∗, we will start by considering the case where $l = 1$.

#### the case where $l = 1$

* In this case, $D$ is just a single vector, $d$. 
    - Substituting equation 2.67 into equation 2.68 and 
    - simplifying $D$ into $d$, 
    
the problem reduces to

<img src="figures/cap2.12.13.png" width=600 />

The above formulation is the most direct way of performing the substitution, but is not the most <font color="red">stylistically pleasing way</font> to write the equation. 
* It places the scalar value $d^Tx^{(i)}$ on the right of the vector $d$. 
* It is more conventional to write scalar coeﬃcients on the left of vector they operate on. 
* We therefore usually writesuch a formula as

<img src="figures/cap2.12.14.png" width=600 />

or, exploiting the fact that a scalar is its own transpose, as

<img src="figures/cap2.12.15.png" width=600 />

The reader should aim to become familiar with such <font color="red">cosmetic rearrangements</font>.

#### design matrix

<img src="http://i.stack.imgur.com/VZtEr.jpg" width=600 />

At this point, it can be helpful to <font color="red">rewrite the problem in terms of a single design matrix of examples</font>, rather than as a sum over separate example vectors.This will allow us to use more compact notation.

Let $X ∈ R^{m×n}$ be the matrix deﬁned by 
* stacking all of the vectors 
    - describing the points, such that
        - $X_{i,:} = x^{(i)^T}$.
            
We can now rewrite the problem as

<img src="figures/cap2.12.16.png" width=600 />

Disregarding the constraint for the moment, we can <font color="red">simplify the Frobenius norm portion</font> as follows:

<img src="figures/cap2.12.17.png" width=600 />

(by equation 2.49)

<img src="figures/cap2.12.18.png" width=600 />

(because terms not involving d do not aﬀect the arg min)

<img src="figures/cap2.12.19.png" width=600 />

(because we can cycle the order of the matrices inside a trace, equation 2.52)

<img src="figures/cap2.12.20.png" width=600 />

(using the same property again)

At this point, we re-introduce the constraint:

<img src="figures/cap2.12.21.png" width=600 />

(due to the constraint)

<img src="figures/cap2.12.22.png" width=600 />

<img src="figures/cap2.12.23.png" width=600 />

#### eigendecomposition

This optimization problem may be solved using eigendecomposition. 
* Speciﬁcally,the optimal $d$ is given by 
    - the eigenvector of $X^TX$ 
        - corresponding to the <font color="red">largest eigenvalue</font>.

#### More generally

* This derivation is speciﬁc to the case of $l=1$ and recovers only the ﬁrst principal component. 
* More generally, when we wish to recover a basis of principal components, 
    - the matrix $D$ is given by 
        - the <font color="red"> $l$ eigenvectors</font>
            - corresponding to the <font color="red">largest eigenvalues</font>.

#### Go !!

* Linear algebra is one of the fundamental mathematical disciplines that isnecessary to understand deep learning. 
* Another key area of mathematics that is ubiquitous in machine learning is probability theory, presented next.

# 참고자료 
* [1] 2 Linear Algebra (Deep Learning Book) - http://www.deeplearningbook.org/contents/linear_algebra.html
* [2] Determinant (Wikipedia) - https://en.wikipedia.org/wiki/Determinant
* [3] 차원축소 훑어보기 (PCA, SVD, NMF) - http://www.slideshare.net/madvirus/pca-svd
* [4] Principal Component Analysis - http://www.slideshare.net/rickwendell/principal-component-analysys
* [5] Probabilistic PCA, EM, and more</font> - http://www.slideshare.net/hsharmasshare/probabilistic-pca-em-and-more
* [6] 머피's 머신러닝: Latent Linear Model - http://www.slideshare.net/JungkyuLee1/s-latent-linear-model
* [7] Lecture 7 (prelude)  Some linear generative models and a coding perspective - https://www.cs.toronto.edu/~hinton/csc2515/notes/lec7pre.ppt 
* [8] Monotonic Function - http://encyclopedia2.thefreedictionary.com/Monotonically+increasing