# Neural search with doc2vec document embeddings

This tutorial shows an alternative approach to neural search. We continue to use `gensim` and to work on the *Lee Background Corpus*, but we train document embeddings directly from the data.

**Note:** This tutorial is an adapted version of the [Gensim Doc2Vec tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html).

### Install gensim

First, we need to make sure that `gensim` is installed. The installation may take a while.

On a command-line Python version, install it the usual way: `pip install gensim`.

In [None]:
import sys

!{sys.executable} -m pip install gensim

### Load the corpus

Let us load the corpus and display the beginnings of the first few news stories:

In [None]:
import os
import gensim

lee_bg_file = os.path.join(gensim.__path__[0], 'test', 'test_data', 'lee_background.cor')
lee_bg_corpus = open(lee_bg_file, 'r')
texts = lee_bg_corpus.readlines()
lee_bg_corpus.close()
for i in range(20):
    print(f"({i:>2})  {texts[i][:100]}...")
print()

### Convert the corpus to a list of TaggedDocuments

In order to train document embeddings, the training data needs to follow a particular data structure. Therefore, we're going to preprocess each news story and save it as a `TaggedDocument`. The tag/label of the story is simply its numeric id (the number in parentheses above).

In [None]:
docs = []
for i, line in enumerate(texts):
	text_without_stopwords = gensim.parsing.preprocessing.remove_stopwords(line)
	tokens = gensim.utils.simple_preprocess(text_without_stopwords)
	doc = gensim.models.doc2vec.TaggedDocument(tokens, [i])
	docs.append(doc)
print(docs[0])

### Train the document vectors

This time, we create the document vectors/embeddings from scratch using the `doc2vec` algorithm. With `gensim`, this is as simple as the three lines of code below:

In [None]:
# create document vectors with 50 dimensions and train for 100 epochs
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=100)
model.build_vocab(docs)
model.train(docs, total_examples=model.corpus_count, epochs=model.epochs)
print("Done!")

### Query the document collection

Now we can start writing queries and searching for the most relevant documents. For each query, we:

- Process it in the same way as the document collection
- Compute the document vector using the model trained above
- Find the 5 most similar documents to the query
- Display the similarity score, the document id and the beginning of the document text


In [None]:
queries = [
    "Israel",
    "investors capital profits financial services",
    "militant terrorist killed",
    "earthquake flood rain tsunami forest fire"
]

for q in queries:
    print("Query:", q)
    q_without_stopwords = gensim.parsing.preprocessing.remove_stopwords(q)
    q_tokens = gensim.utils.simple_preprocess(q_without_stopwords)
    query_vector = model.infer_vector(q_tokens)
    most_similar = model.dv.most_similar([query_vector], topn=5)
    for doc_position, doc_score in most_similar:
        print(f"- {doc_score:.4f}  ({doc_position:>3})  {texts[doc_position][:100]}...")
    print()

Are these results better or worse than the ones with the averaged word vectors? Why do you think this is the case?

What happens if you increase/decrease the vector size or the number of epochs?