# Neural search with averaged word vectors

In this tutorial, we use the `gensim` toolkit to demonstrate a simple approach to neural search. In a nutshell, we create document vectors for each document by averaging the *GloVe* vectors of each word occurring in the document. We encode the query in the same way and find the most similar documents using cosine similarity. We will work on the *Lee Background Corpus* containing 300 short news items. This corpus is distributed together with `gensim`.

### Install gensim

First, we need to install `gensim`. This may take a while.

On a command line Python version, install it the usual way: `pip install gensim`.

In [None]:
import sys

!{sys.executable} -m pip install gensim

### Load the corpus

Let us load the corpus and display the beginnings of the first few news stories:

In [None]:
import os
import gensim

lee_bg_file = os.path.join(gensim.__path__[0], 'test', 'test_data', 'lee_background.cor')
lee_bg_corpus = open(lee_bg_file, 'r')
texts = lee_bg_corpus.readlines()
lee_bg_corpus.close()
for i in range(20):
    print(f"({i:>2})  {texts[i][:100]}...")
print()

### Load the word embeddings

Now, we need to load the word embeddings. The original *word2vec* embedding file is too large for the CSC notebooks, and we're going to use a *GloVe* embedding file instead. The *GloVe* embeddings are distributed together with this notebook, but you can look at the [gensim-data](https://github.com/RaRe-Technologies/gensim-data) repository for other options.

**Attention:** The embedding file will use about 2GB of RAM and loading it will take a while!

In [None]:
word_vectors = gensim.models.KeyedVectors.load_word2vec_format('~/shared/glove-wiki-gigaword-200.gz')
print("Done!")

### Compute document vectors

Next, we need to compute a document vector for each of the 300 news stories. We take one story at a time and do the following:

- Remove the stopwords
- Further preprocess the data (remove punctuation, lowercase, tokenize)
- Compute the average vector of all tokens appearing in the story
- Store the average vector in `doc_vectors`


In [None]:
# create a vector collection with the same number of dimensions as the word vectors and as many entries as there are texts
doc_vectors = gensim.models.keyedvectors.KeyedVectors(word_vectors.vector_size, count=len(texts))

for i, line in enumerate(texts):
    # gensim provides procedures for preprocessing and stopword removal
    text_without_stopwords = gensim.parsing.preprocessing.remove_stopwords(line)
    tokens = gensim.utils.simple_preprocess(text_without_stopwords)
    # the function get_mean_vector computes the average vector for all tokens
    dv = word_vectors.get_mean_vector(tokens)
    doc_vectors.add_vector(i, dv)

### Query the document collection

Now we can start writing queries and searching for the most relevant documents.

Here are four queries in a list:

In [None]:
queries = [
    "Israel",
    "investors capital profits financial services",
    "militant terrorist killed",
    "earthquake flood rain tsunami forest fire"
]

Let's take one query at a time and do the following:

- Process it in the same way as the document collection
- Compute the average vector for all tokens in the query
- Find the 5 most similar documents to the query
- Display the similarity score, the document id and the beginning of the document text


In [None]:
for q in queries:
    print("Query:", q)
    q_without_stopwords = gensim.parsing.preprocessing.remove_stopwords(q)
    q_tokens = gensim.utils.simple_preprocess(q_without_stopwords)
    qv = word_vectors.get_mean_vector(q_tokens)
    most_similar = doc_vectors.most_similar([qv], topn=5)
    for doc_position, doc_score in most_similar:
        print(f"- {doc_score:.4f}  ({doc_position:>3})  {texts[doc_position][:100]}...")
    print()

That's it... What do you think of the results? Feel free to modify the queries and the `topn` value!