## Topic vectors 

Topic vectors add more tools for assessing textual data (search queries), such as comparing the meaning of words, documents, statements and corpora.
<br>
This stems from retrieving 'clusters' of similar documents and statements. In this way, there's no longer a need to compare the distance (cosine similarity) between documents based merely on their word usage.
<br>
In this sense, users are no longer limited to keyword search and relevance ranking based entirely on word choice or vocabulary. We can find documents that are relevant to our query - not just a highlighted match for word statistics per se.

Hence, this paradigm is known as **semantic search**, which is what strong search engines are able to do when they they are given documents that do not contain many of the words in our query, but are exactly what one is/perhaps looking for. 

### Sematic search

Semantic search gives us a tool for finding and generating meaningful text.
<br>
When we search for a document based on a word or partial phrase it contains, which is normally known as ***full text search***.
<br>
This is the typical purpose of a search engine - which breaks down a document into chunks (words) that can be indexed via a *inverted index* as usually found at the back of a non-fiction book. It can take a lot of auditing and guesswork to deal with spelling errors and typos - but it generally works well for simple referencing.

***Semantic search*** is full text serch that takes into account the meaning of the words in a corresponding query and document one is searching for.
<br>
As with prior notebooks within this repo, there are two ways - LSA (PCA) and Latent Dirichlet allocation (LDiA) - to compute topic vectors that capture the semantics (meaning) of words and documents in a vector.
<br>
One of the reasons that latent semantic analysis (LSA) was initially known as latent semantic ***indexing*** (LSI) was because it was said to power semantic search with an index of numerical values, like BOW and TF-IDF tables. In this way, sematic search was then ext big thing in information retrieval.

However, unlike BOW and TF-IDF tables, tables of semantic vectors cannot be easily discretized and index using traditional inverted index techniques.
<br>
<br>
Traditional indexing approaches work with binary word occurrence vectors, discrete vectors (BOW vectors), sparse continuous vectors (TF-IDF vectors), and low-dimensional continuous, such as topic vectors from LSA/LDiA are a challenge.
<br>
Inverted indexes work for discrete vectors or binary vectors, like tables of binary/integer word-document vectors, because the index only needs to maintain an entry for each nonzero discrete dimension. Given that most of the vector transformations mentioned above such as TF-IDF vectors are sparse (mostly zero), we don't need an entry in our index for most dimensions for most documents.

LSA/LDiA produce topic vectors that are high-dimensional, continous and dense (i.e. zero values are rare). Also, the semantic analysis algorithm doesn't produce an efficient for scalable search.
<br>
One solution to the challenge of high-dimensional vectors is to index them with a ***locality sensitive hash (LSH)***. LSH is like a postal code that designates a region of hyperspace so that it can easily be found again later. Also, like a regular hash, it's discrete and depends only on the values in the vector.

<img src="img-vects/semantic-search-accuracy.png" alt="Semantic search accuracy deteriorates at around 12-D" width="400" height='250'/>

Figure 1 NLPIA Lane, Howard and Hapke (2019) chp 4.8.1 pp. 372 Apple iBooks.

As seen from figure 1, this process doesn't work perfectly once we exceed about 12 dimensions. 
Also in figure 1, from column one, each row represents a topic vector size (dimensionality), starting with 2 dimensions and working up to 16 dimensions 

The table shows how good one's search results would be if you LSH were used to index a large number of semantic vectors.
A noticeable heuristic is that once a specified vector has more than 16 dimensions, we'd have a hard time returning two search results that were any good.

> How can we do semantic search on 100-D vectors without an index?

To find precise semantic matches, we need to find all the closest document topic vectors to a particular query (search) topic vector.
<br>
If we have *n* documents, we need to do *n* comparisons with our query topic vector - which is a lost of dot products.

Also, we can vectorize the operation in numpy via matrix multiplication; yet this doesn't reduce to number (order) of operation i.e. it only makes the computations (run-time) 100 times faster.
<br>
Precise semantic search still requires O(N) - linear complexity - multiplications and additions for each query. Hence, it scales linearly with the size of our corpus (n as input). 
<br>
Arguably, this wouldn't work for a large corpus, such as Google Search or even Wikipedia semantic search.


The key is to settle for 'good enough' rather than striving for a perfect index of LSH algorithm for our high-dimensional vectors.
<br>
There are for some time now several open source implementations of some efficient and accurate ***approximate nearest neighbors*** algorithms that use LSH to efficiently implement semantic search.
<br>
A couple of the easiest to use and install are:
* `Spotify` - `Annoy` package 
* `Gensim` - `gensim.models.KeyedVector` class

Hence, these indexing or hashing solutions cannot gurantee that we'll find all the best matches for our semantic search query.
<br>
However, they can get us back a good list of close matches almost as fast as with a conventional reverse index on a TF-IDF/BOW vector, provided we're willing to give up a little precision.