## Latent Semantic Analysis (LSA)

LSA utilises a common technique for dimension reduction, namely Singular Value Decomposition (SVD).
<br>
SVD decomposes a matrix into three square matrices - one of which is diagonal.
* Applications - Given that SVD utilises matrix inversion as its core transformation, it allows for many real word uses within data science including behaviour-based recommendation engines that run alongside content-based NLP recommendation engines
* SVD purpose - Allows Truncation of those matrices (ignore some rows/columns) before multiplying them back together, which reduces the number of dimensions one has to deal with in our vector space model
* Modified transformation - Truncated matrices can give a slightly better TF-IDF matrix representation then the one started with. The new representation of documents contains the essence (latent semantics) of those documents. It captures the essence of a dataset and ignores the noise, making it useful for applications the require compression
* Summary - SVD used in NLP is seen as LSA, which uncovers the meanings of words that is hidden and urging to be explored

***Technical explanation behind LSA:***
<br>
<br>
LSA is a mathematical technique for finding the 'best' way to linearly transform (rotate and stretch) any set of NLP vectors e.g. BOW or TF-IDF vectors.
* Optimisation - The ideal method for different applications is to line up the axes (dimensions) in the new vectors with the greatest variance in the word frequencies
* Filtering - We can then eliminate those dimensions in the new vector space that do not contribute much to the variance in the vectors from document to document
* Related concept - **Principal Component Analysis** (PCA) on TF-IDF vectors is identical to LSA on natural language documents, which is useful for problems and areas involving *feature engineering*
* Computation - LSA uses SVD to find the combinations of words that are responsible (together), for the greatest variation in the data. As mentioned earlier, we rotate TF-IDF vectors so that the new dimensions (basis vectors) of our rotated vectors all allign with these maximum variance directions. The basis vectors comprise of the axes of our new vector space, which are analogous to our new vector space. Each of the dimensions becomes a combination of word frequencies rather than a single word frequency.
* Interpretation - We can think of the output vectors as the weighted combinations of words that make up various 'topics' used throughout a given corpus

The machine/programme doesn't know what the combinations of words means, it just identifies that they go together.
* Words together - Seeing words like 'dog', 'cat' and 'love' together frequently means the programme will cluster them terms together under a topic
* Topic identification - The programme doesn't automatically know that such a topic is likely about 'pets'. If they occur together frequently in the same documents, LSA will give them high scores for the same topics together
* Human interaaction - The programme depends on the humans/developers to identify what words have a high weight in each topic and give them a name

Thinking about LSA similarly to the 'IDF' portion of TF-IDF signifies which dimensions in such vectors are important to the semantics of any given documents.
* Discarding - We discard those dimensions (topics) that have the least amount of variance between documents, given that low-variance topics are distractions (noise) for any ML algorithm
* Interpretation - If every document has around the same amount of some topic and that topic doesn't help one tell the documents apart, then we can get rid of it
* Generalization - Doing the above discarding process will help generalize a vector representation, which will make it operate more optimally for documents in a pipeline where unseen data for predictions is presented, even documents from a different context

***Summary of LSA:***
<br>
<br>
The generalization and compression that LSA performs accomplishes what is achieved when we ignore stop words during text normalization.
<br>
Although the LSA dimension reduction is more optimal as it retains as much information as possible and doesn't discard any words, it only discards dimensions (topics) - LSA compresses more meaning into fewer dimensions.
<br>
We only have to retain the high-variance dimensions, the major topics that a corpus mentions about in a variety of ways (with high variance).
<br>
Each of these dimensions becomes our 'topics', with some weighted combination of all the words captured in each one.

### LSA thought experiment

We can use an algorithm to compute some topics like 'animalness', 'petness' and 'cityness' from our thought experiment - bearing in mind we can't tell the algorithm immediately what we want the topics to be about.
<br>
For a small corpus of short documents like tweets, chat messages, lines of poetry etc. it takes only a few dimensions (topics) to capture the semantics of those documents.  

In [3]:
from nlpia.book.examples.ch04_catdog_lsa_3x6x16 import word_topic_vectors
word_topic_vectors.T.round(1)

Unnamed: 0,cat,dog,apple,lion,nyc,love
top0,-0.6,-0.4,0.5,-0.3,0.4,-0.1
top1,-0.1,-0.3,-0.4,-0.1,0.1,0.8
top2,-0.3,0.8,-0.1,-0.5,0.0,0.1


The rows in this topic matrix are the 'word topic vectors' or just 'topic vectors' for each word.
* **Semantic vectors** - These are vectors one can use to represent the meaning of a word in any ML pipeline
* Computaton - Topic vectors for each word can be added up to compute a topic vector for a document