# Latent Semantic Analysis

* **Synonymy**: multiple words with same meaning
    * buy and purchase
    * big and large
    * quick and speedy
* **Polysemy**: one word with multiple meanings 
    * Man , as in a human and as in a man vs woman
    * milk, as in a verb, or as in a noun

## Latent Variables
* For example, computer, laptop, and PC are probably seen together very often, meaning they are highly correlated
* we can think of a latent or hidden variable that represents all of them, below it is z
* z = $0.7*computer + 0.5* PC + 0.6*laptop$
* The job of LSA is to find these variables and transform the original data into these new variables
* Hopefully the dimensionality of these new variables is much smaller than the original 
* it is important to note that LSA helps to solve the synonomy problem, by combining correlated variables
* However, there are conflicting view points when it comes to how well LSA helps when it comes to polysemy 

---
# PCA and SVD - The Underlying Math behind LSA 
* LSA is really just the application of SVD (**singular value decomposition**) to a term document matrix 
* PCA is a simpler form of SVD, so we will look at the first to get an intuition 

## PCA (Principle Component Analysis)
* at its most basic, PCA does a transformation on our input vectors
### $$z = Qx$$
* Q is a matrix 
* this is a matrix multiplication, and we know that:
* when you multiply a vector by a scalar, we end up with another vector in the same direction 
* But when you multiply a vector by a matrix, you could possibly get a vector in a different direction 
* So what PCA does is rotate our original input vectors 
* Aohter way of thinking of it is that it is the same input vectors but in a different coordinate system 

## PCA does 3 things
1. Decorrelates all of our input data - aka our data in the new coordinate system has 0 correlation between any of the input features 
2. It orders the transformed data by its information content 
    * the first dimension carries the most information (explains the most variance) regarding the output
    * the second dimension carries the second most information regarding the output
    * and so on 
3. It allows us to reduce the dimensionality of our data
    * if our original vocabulary was 1000 words, when we join all of the words by how often they co occur in each document, maybe the total number of distinct latent terms is only 100 
    * this is a direct consequence of number 2, since cutting off any of the higher dimensions if they contain less than 5% of the total information, won't result in a big loss
* Note: by removing information, you are not always decreasing predictive ability 
* Note: One helpful aspect of PCA is that it does denoising (smoothing, improves generalization)

## Covariance 
* the three ideas above should give us a big hint about what we need to do here! 
* the central item in all of these ideas is the covariance matrix
* The diagonals of a covariance matrix tell us the variance in that direction
* The off diagonals tell us how correlated two different dimensions are with eachother 

![covariance%20matrix%20.jpg](attachment:covariance%20matrix%20.jpg)

* recall that in most classical statistical methods, we consider **more variance** to be synonomous with **more information**
    * Why? Well think about it like this, if there is a random variable X with a low variance (or even lets just say 0 variance for the sake of this argument), how much information can it carry? If we know that X is most likely to be 5 100% of the time, that doesn't carry very much information that will help us make our decision 
* As a correlary of this, think of a variable that is completely deterministic to contain 0 information
* Why is this? Well if we can already predict this variable exactly, then measuring it won't tell us anything new, since we already knew the answer we would get 
* How is the covariance matrix calculated:

![covariance%20matrix%20eq1.png](attachment:covariance%20matrix%20eq1.png)

* we can see that when i = j this is just the sample variance 
* A convenient form of this equation is the matrix notation

![covariance%20matrix%20eq2.png](attachment:covariance%20matrix%20eq2.png)

* where above X is the input matrix
* we can drop the means here if we center the data before doing any processing, in which case it will just be:
### $$\Sigma_X = \frac{X^TX}{N}$$

## Eigenvalues and Eigenvectors 
* So now that we have this D x D covariance matrix, what do you actually do with it? 
---
#### Review
* Review on Eigenvectors: https://www.youtube.com/watch?v=PFDu9oVAE-g&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab&index=14
* Review on determinants: https://www.youtube.com/watch?v=Ip3X9LOh2dk&index=7&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
* The main thing to remember: during a linear transformation, certain vectors may stay on the their original span, and only be stretched inthe process. Those vectors that stay on their span are called eigenvectors, and the amount they are stretched is the eigenvalue
* The video above is excellent at showing how this can squish space into a lower dimension 
* Great tutorial: https://deeplearning4j.org/eigenvector
* And another tutorial: https://lazyprogrammer.me/tutorial-principal-components-analysis-pca/
* and one more: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
---
* It turns out that there are actually D eigenvectors, and D eigenvalues 
* we turn them both into matrices
* We place the eigenvalues into a diagonal matrix called $\Lambda$
* We placed the eigenvectors into a stacked matrix $Q$
* The $\Lambda$ matrix which contains the eigenvalues, is actually just the covariance of the transformed data 
* So since the off diagonals are 0, that means that the transformed data is completely decorrelated 
* Since we are free to sort the eigenvalues in any order we wish, we can ensure that the largest eigenvalues come first, keeping in mind that the eigenvalues are just the variances of the transformed data

## Extending PCA
* PCA helps us combine input features (aka words/terms, if you put the words along the columns of your input matrix)
* But what if we wanted to combine and decorrelate by document? Just do PCA on the transpose!
* usually in NLP we create term document matrices
* You can think of each term as an input dimension 
* and each document as a sample
* note that this is reversed from what we usually work with, where the samples go along the rows, and the input features go along the columns 
* So what we just did with PCA was combine similar terms, but what if we wanted to combine similar documents 
* Well then we could just do PCA after transposing the original matrix
* One weird thing that happens is that even though we get an N x N covariance matrix, there are still only D eigenvalues  

## SVD (singular value decomposition)
* So now that your know that PCA just finds correlations between input features 
* and PCA on the transposed data finds correlations on the input samples 
* SVD does both PCAs at the same time 
    * We find eigenvalues (S^2) and eigenvectors (U) of XX^T
    * We find eigenvalues (S^2) and eigenvectors (V) of X^TX
* we then take the square root of the eigenvalues and put them into a matrix S
* These are related by: 
### $$X = USV^T$$
* I.e. X is decomposed into 3 parts
* We can transform both terms AND documents
* get the low rank approximation of X by keeping the first k elements of U,S,V
### $$X_k=U_kS_kV_k^T$$