# Building a thesarus with Data Science

We're going to build an intelligent thesaurus using a publicly availible word embedding dataset. 

Approximate Nearest Neighbors will be used to cluster and neighborhood the word embeddings for realtime lookup on our pre-trained corpus.

In [None]:
from typing import List, Dict, IO

from annoy import AnnoyIndex
from tqdm import tqdm

## Training New Embeddings

First we have to get our word vectors from [FastText](https://fasttext.cc/docs/en/english-vectors.html). They have a few pretrained corpora avalible. 

If you have a few million lines of unstructured text sitting around, you can [train your own as well](https://fasttext.cc/docs/en/unsupervised-tutorial.html).

In [None]:
! wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
! unzip wiki-news-300d-1M.vec.zip

#### Lets take a peek at one file just so we know what we're working with

In [None]:
! head -n 2 wiki-news-300d-1M.vec

This is a standard format for embeddings. The first line contains two pieces of metadata:  
* The number of embeddings, or the length
* The length of each embedding, or the dimensionality

Subsequent lines contain a word, and space seperated floats comprising it's corresponding vector

# Indexing Embeddings

Now that we "trained" our vectors, let's neighborhood them using [`Annoy`](https://github.com/spotify/annoy).

For a minimal demo, we'll need the embedding file, model dimensionality, and a mapping of integer IDs to Words.

In [None]:
embedding_file_name: str = "wiki-news-300d-1M.vec"
index_map: Dict[int,str] = {}
word_map: Dict[str, int] = {}

with open(embedding_file_name, "r") as embedding_file:
    embedding_length, embedding_dimensions = map(
        int, 
        embedding_file.readline().strip().split()
    )
    incomplete_neighborhood = AnnoyIndex(embedding_dimensions, "euclidean")
    for num, line in tqdm(enumerate(embedding_file), total=embedding_length):
        tokens = line.strip().split()
        word = tokens[0]
        vector = [float(el) for el in tokens[1:]]
        index_map[num] = word
        word_map[word] = num
        incomplete_neighborhood.add_item(num, vector)
    incomplete_neighborhood.build(10)
    incomplete_neighborhood.save("wiki_news_neighbors.ann")

## Searching Embeddings

Now that we have an neighborhood of embeddings, we can search for similar embeddings by word.

In [None]:
new_neighborhood = AnnoyIndex(embedding_dimensions, "euclidean")
new_neighborhood.load("wiki_news_neighbors.ann")

In [None]:
def search_neighborhood(
    query: str, 
    n_neighbors: int = 7, 
    neighborhood=new_neighborhood, 
    idx_map: Dict[int, str] = index_map, 
    wrd_map: Dict[str, int] = word_map,
    verbose: bool = False
) -> List[str]:
    if verbose:
        print(f"query string: {query}")
    query_idx = wrd_map[query]
    if verbose:
        print(f"query index: {query_idx}")
    neighbor_ids = neighborhood.get_nns_by_item(query_idx, n_neighbors)
    if verbose:
        print(f"neighbor ids: {neighbor_ids}")
    neighbors = [idx_map.get(n_id, "NOT FOUND!") for n_id in neighbor_ids]
    if verbose:
        print(f"neighbors: {neighbors}")
    return neighbors
    

In [None]:
search_neighborhood("dog", verbose=True)

# 🎆🎊🎆

There you have it a digital thesarus!

### Links:

* [Spotify/Annoy](https://github.com/spotify/annoy)
* [FastText](https://fasttext.cc/)