This notebook is a demonstration for how to use ThirdAI's document search engine. It assumes you are running in the ThirdAI demo docker container. Once in the container, you can get this notebook by running
```
wget https://raw.githubusercontent.com/ThirdAILabs/Demos/main/DocSearch.ipynb
```

First, we're going to import our document search package and the embedding model we will use to embed the documents, and a couple of other things that will help us along the way.

In [None]:
import time

import warnings
warnings.filterwarnings('ignore')

import numpy as np

import thirdai
from thirdai.search import DocRetrieval

from embeddings import DocSearchModel


 Now you're going to need a dataset, which just consists of a collection of (document_id, document_text) pairs. Our engine works best with document sizes between 20 and 200 words long. Feel free to split big documents up into many small passages!

In [None]:
# TODO for you: Populate dataset (choose whatever you want!)
dataset = []

Now let's create an initially empty index! As a quick description of the input to the constructor, 
* dense_input_dimension is the dimension of the output of our embedding model (128 with our preloaded model)

* num_tables and hashes_per_table are hyperparameters of the index
  * increasing num_tables increases the accuracy of the model at the cost of speed and memory (a good value is the prepopulated 16)
  
  * hashes_per_table has a sweet spot in accuracy at around log_2(average document size)
  
* centroids are a numpy array that represents precomputed centroids for the embedding space; we've already calculated and stored these for you! We've precomputed 2^18 of them, but if you're only adding a few points a simple heuristic is to select a random subsample of them, which we do below.

In [None]:
all_centroids = embedding_model.centroids()
np.random.shuffle(all_centroids)
reduced_centroids = all_centroids[:len(dataset) / 10]

our_index = DocRetrieval(
  dense_input_dimension=128, 
  num_tables=16, 
  hashes_per_table=6, 
  centroids=reduced_centroids)

Now let's populate the index!

In [None]:
for doc_id, doc_text in dataset:
  embedding = embedding_model.encodeDocs([doc_text])[0]
  scratch_index.add_doc(doc_id=doc_id, doc_text=doc_text, doc_embeddings=embedding)

Finally, we can query it! The following cell starts up an interactive demo, where you can type any query you want and the index will return the result most semantically likely to answer your query.

In [None]:
while True:
  print("> ", end='')
  query_text = input()
  if (query_text == "q"):
    break

  start = time.time()
  embedding = embedding_model.encodeQuery(query_text)
  result = our_index.query(embedding, top_k=8192)
  total_time = time.time() - start

  print(result[0])
  print(f"Took {total_time} seconds")