## Finding meaning in words counts (semantic analysis)

**Semantic analysis** emphasises the use of machines/programmes to understand the 'meaning' of words.
<br>
A reminder of TF-IDF word vectors scores for n-grams is that the application is useful for searching text if the exact words/n-grams that are to be subsequently search are known. 
<br>
NLP applications in the past have found such algorithms for revealing the meaning of word combinations and computing vectors to represent this meaning - ***latent semantic analysis (LSA)***. Utilising this tool not only represents the meaning of words as vectors, but can also be used to represent the meaning of entire documents.

Learning more about ***semantic/topic vectors***, we can use weighted frequency scores from TFIDF vectors to compute the topic 'scores' that make up the dimensions of our topic vectors. The idea is to use the correlation of normalized term frequencies with each other to group words together in topics to define the dimensions of our new topic vectors.
<br>
Such methods make it possible to utilise interesting applications e.g. making it possible to search for documents based on their meaning - ***semantic search***. At times, semantic search returns search results that are much better than keyword search (TF-IDF search). It can return documents that are exactly what the user is searching for, even when the user can't think of the right words to put in the query.

Semantic vectors help us identify the words or n-grams that best represent the subject (topic) of a statement, document or corpus (collection of documents). Given this vector representation of words along with their *relative* importance, you can provide someone with the most meaningful words for a document - a set of keywords that summarizes its meaning.
<br>
Semantic vectors enable the possibility to compare any two statements/documents and tell how 'close' they are in *meaning* to each other.

Combinations (linear) of words that make up the dimensions of our topic vectors are powerful representations of meaning.

### Word Counts to topic vectors 

Here, we want to score the meanings and topics the words are used for.

#### TF-IDF vectors and lemmatization

Any word vector representation such as TF-IDF count the exact spellings of terms in a document. 
<br>
As a reminder, texts that restate the same meaning will have completely different TF-IDF vector representations if they spell things differently or use different words. Such cases can confuse search engines and also document similarity comparisons relying on token counts.

The lemmatization approach kept similarly *spelled* words together within an analysis, but not necessarily words with similar meanings - failing to pair up most synonyms.
<br>
This is challenging as synonyms differ in more ways than just word endings that lemmatization and stemming deal with.
<br>
Sometimes, lemmatization and stemming can actually mistakenly group together antonyms.

Consequently, two chunks of text that talk about the same thing but use different words will not be 'close' to each other in our lemmatized TF-IDF vector space model.
<br> 
An instance might be that the TF-IDF vector for one chapter in a book about NLP may not be close at all to similar-meaning passages in university textbooks about latent semantic indexing.
<br>
Generally, the NLP book might use more modernised jargon/terms than the university textbook, where university researchers use more consistent and rigorous language within textbooks/lectures.

#### Topic vectors 

We need to search for a more optimal method to extract additional information and meaning from word statistics i.e. a better estimate of what the words in a document 'signify'. 
<br>
Also, being wary of a better estimate of what the words in a document 'signify', all the while trying to understand what the combination of words *means* in a certain document. This would mean we'd like to represent that meaning with a vector that's like a TF-IDF vector, yet more compact and meanigful.
<br>


We come up with two terms to describe the context:
<br>
1. **word-topic vectors** - Compact meaning vectors 
<br>
2. **document-topic** - Document meaning vectors 
<br>
<br>
In any case, either of these can be called **topic vectors**
<br>
Topic vectors can be compact or as expansive (high dimensions) as desired. LSA topic vectors can have as little as one dimension or thousands of dimensions.

the mathematical operations between topic vectors (addition and subtraction) mean more than they did with TF-IDF vectors.
<br>
The distances between topic vectors is useful for things like clustering documents or semantic search (including search by semantics) - whereas doing TF-IDF topic modelling could only cluster and search using keywords via TF-IDF vectors.

After these computations are completed, we'll have one document-topic vector for each document in the corpus - which also usually translates to not having to reprocess the entire corpus to compute a new topic vector for a new document or phrase.
<br>
We'll have a topic vector for each word in our lexicon (vocabulary), where we can use/compute these word topic vectors for any new document by adding up all its word topic vectors. 

Numerical representations of the semantics of words/sentences can be tricky.
<br>
Given languages like English have multiple diaclects and different interpretations of the same words - the concept of words with multiple meanings is known as ***polysemy***:
* ***polysemy*** - The existence of words and phrases with more than one meaning 

Some ways polysemy can affect the semantics of word/statements. LSA actually manages handling these situations for us:
* ***Homonyms*** - Words with the same spelling and pronounciation but different meanings e.g Bat (baseball bat and animal)
* ***Zeugma*** - Use of two meanings of a word simultaneously in the same sentence e.g. I held **her hand** and **my tongue**

LSA also deals with some of the challenges of polysemy in a voice interface (chatbot) that one can talk to, like Alexa or Siri:
* ***Homographs*** — Words spelled the same, but with different pronunciations and meanings e.g. Bass (type of fish OR low deep voice)
* ***Homophones*** - Words with the same pronunciation, but different spellings and meanings (an NLP challenge with voice interfaces) e.g. Blew and Blue



#### Thought experiments

Assuming we have a TF-IDF vector for a certain document and we want to convert that to a topic vector - think about how much each contributes to such named topics.

> Let’s say we're processing some sentences about pets in Central Park in New York City (NYC). We can create three topics: one about pets, one about animals, and another about cities. 
<br>
Call these topics 'petness', 'animalness', and 'cityness'. So your 'petness' topic about pets will score words like 'cat' and 'dog' significantly, but probably ignore words like 'NYC' and 'apple'. The 'cityness' topic will ignore words like 'cat' and 'dog', but might give a little weight to 'apple', just because of the 'Big Apple' association.

An example would be where we 'train' the topic model as specified, and without using a computerised based solution (solely logic/common sense), we might come up with some weights like this.

In [3]:
import numpy as np 

In [4]:
# Example tf-idf dict with randomised scores/weights
topic = {} 
tfidf = dict(list(zip('cat dog apple lion NYC love'.split(), np.random.rand(6))))

In [9]:
topic['petness'] = (0.3 * tfidf['cat'] +\
                    0.3 * tfidf['dog'] +\
                    0 * tfidf['apple'] + 0 * tfidf['lion'] - 0.2 * tfidf['NYC'] +\
                    0.2 * tfidf['love'])

In [17]:
topic['animalness'] = (0.1 * tfidf['cat'] +\
                       0.1 * tfidf['dog'] +\
                       0.1 * tfidf['apple'] + 0.5 * tfidf['lion'] + 0.1 * tfidf['NYC'] -\
                       0.1 * tfidf['love'])

In [18]:
topic['cityness'] = (0 * tfidf['cat'] -\
                     0.1 * tfidf['dog'] +\
                     0.2 * tfidf['apple'] - 0.1 * tfidf['lion'] + 0.5 * tfidf['NYC'] +\
                     0.1 * tfidf['love'])

Manual weights e.g like in 'petness' topic (0.3, 0.3, 0, 0, -0.2, 0.2) are multiplied by the randomised/imaginary tfidf values to create topic vectors for this imaginary/random document.

In this example, we added up word frequencies that might be indicators of each topics.
<br>
We weighted the word frequencies (TF-IDF values) by how likely the word is associated with the topic. Similarly, for words that might be talking about something that is in some way the opposite of our topic, -ve weights would mean a disimilarity.
<br>
Running through this process, we get an illustration 