## Finding meaning in words counts (semantic analysis)

**Semantic analysis** emphasises the use of machines/programmes to understand the 'meaning' of words.
<br>
A reminder of TF-IDF word vectors scores for n-grams is that the application is useful for searching text if the exact words/n-grams that are to be subsequently search are known. 
<br>
NLP applications in the past have found such algorithms for revealing the meaning of word combinations and computing vectors to represent this meaning - ***latent semantic analysis (LSA)***. Utilising this tool not only represents the meaning of words as vectors, but can also be used to represent the meaning of entire documents.

Learning more about ***semantic/topic vectors***, we can use weighted frequency scores from TFIDF vectors to compute the topic 'scores' that make up the dimensions of our topic vectors. The idea is to use the correlation of normalized term frequencies with each other to group words together in topics to define the dimensions of our new topic vectors.
<br>
Such methods make it possible to utilise interesting applications e.g. making it possible to search for documents based on their meaning - ***semantic search***. At times, semantic search returns search results that are much better than keyword search (TF-IDF search). It can return documents that are exactly what the user is searching for, even when the user can't think of the right words to put in the query.

Semantic vectors help us identify the words or n-grams that best represent the subject (topic) of a statement, document or corpus (collection of documents). Given this vector representation of words along with their *relative* importance, you can provide someone with the most meaningful words for a document - a set of keywords that summarizes its meaning.
<br>
Semantic vectors enable the possibility to compare any two statements/documents and tell how 'close' they are in *meaning* to each other.

Combinations (linear) of words that make up the dimensions of our topic vectors are powerful representations of meaning.

### Word Counts to topic vectors 

Here, we want to score the meanings and topics the words are used for.

#### TF-IDF vectors and lemmatization

Any word vector representation such as TF-IDF count the exact spellings of terms in a document. 
<br>
As a reminder, texts that restate the same meaning will have completely different TF-IDF vector representations if they spell things differently or use different words. Such cases can confuse search engines and also document similarity comparisons relying on token counts.

The lemmatization approach kept similarly *spelled* words together within an analysis, but not necessarily words with similar meanings - failing to pair up most synonyms.
<br>
This is challenging as synonyms differ in more ways than just word endings that lemmatization and stemming deal with.
<br>
Sometimes, lemmatization and stemming can actually mistakenly group together antonyms.

Consequently, two chunks of text that talk about the same thing but use different words will not be 'close' to each other in our lemmatized TF-IDF vector space model.
<br> 
An instance might be that the TF-IDF vector for one chapter in a book about NLP may not be close at all to similar-meaning passages in university textbooks about latent semantic indexing.
<br>
Generally, the NLP book might use more modernised jargon/terms than the university textbook, where university researchers use more consistent and rigorous language within textbooks/lectures.

#### Topic vectors 

We need to search for a more optimal method to extract additional information and meaning from word statistics i.e. a better estimate of what the words in a document 'signify'. 
<br>
Also, being wary of a better estimate of what the words in a document 'signify', all the while trying to understand what the combination of words *means* in a certain document. This would mean we'd like to represent that meaning with a vector that's like a TF-IDF vector, yet more compact and meanigful.
<br>


We come up with two terms to describe the context:
<br>
1. **word-topic vectors** - Compact meaning vectors 
<br>
2. **document-topic** - Document meaning vectors 
<br>
<br>
In any case, either of these can be called **topic vectors**
<br>
Topic vectors can be compact or as expansive (high dimensions) as desired. LSA topic vectors can have as little as one dimension or thousands of dimensions.

In [None]:
t