# FAISS With Haystack

FAISS is unfortunately **not** presently supported on Windows, so if you are on Windows then you will need to stick with Elasticsearch. If you have access to Linux or Mac then continue.

We'll be using Haystack again, so fortunately setup is very straight-forward. We first import and initialize a FAISS document store using a very similiar logic to what we used before - but this time we will be storing the FAISS index locally.

Storing the index locally means that we will need two files, a SQLite database, and the FAISS index. We create the FAISS index later, but we create the SQLite database on initialization.

We will store both in the `models` directory, but adjust this to your own needs.

In [1]:
path = '../../models/faiss'

import os

if not os.path.exists(path):
    os.makedirs(path)

And now we include this path within a SQLite database location string in the following document store initialization.

In [2]:
from haystack.document_store.faiss import FAISSDocumentStore

# initialize FAISS
document_store = FAISSDocumentStore(
    faiss_index_factory_str='Flat',
    sql_url=f'sqlite:///{path}/squad_dev.db',
    return_embedding=True
)

03/27/2021 10:40:51 - INFO - faiss -   Loading faiss with AVX2 support.
03/27/2021 10:40:51 - INFO - faiss -   Loading faiss.


Next, we load our validation data from file, which we will be adding to the FAISS index.

In [3]:
import json

with open('../../data/squad/dev.json', 'r') as f:
    squad = json.load(f)

## Adding Data

As we saw with Elasticsearch, our current FAISS index has been initialized but contains nothing. Now we need to populate the index with our *dev.json* data. 

This time, we'll be making use of the Haystack `Document` object. Which we import with:

In [1]:
from haystack import Document

AttributeError: module 'numpy' has no attribute 'typeDict'

This object prepares our data into the correct object format for our document stores - which in this case is FAISS.

As before where we had a dictionary with two keys `'text'` and `'meta'`, the *Document* object provides two corresponding arguments, `text` and `meta`. So rather than using the format we used before which looked like:

```json
{
    'text': '<document text here>',
    'meta': {
        'other': '<other info here>'
    }
}
```

We will be using this *Document* object format instead:

```python
Document(
    text='<document text here>',
    meta={
        'other': '<other info here>'
    }
)
```

Just like before, we will be feeding these *Document* objects into a list, which we will then feed into our FAISS `write_documents` method. Remember, our dataset contains duplicate contexts, so we must remove them first using `list(set(...))`.

In [5]:
# create list of contexts
contexts = [sample['context'] for sample in squad]

# remove duplicates
contexts = list(set(contexts))

# create list of Document objects
squad_docs = [Document(text=sample) for sample in contexts]

Now, because we're storing our FAISS index on file, we may find (if running this script more than once) that we first need to delete any documents that already exist in the index.

In [6]:
document_store.delete_all_documents()

Then we add the data to the index just like before:

In [7]:
document_store.write_documents(squad_docs)

The way that our documents are indexed will depend on the embedding model being used by our retriever. So, we need to initialize our DPR model (the retriever), and then `update_embeddings` using this retriever.

In [8]:
from haystack.retriever.dense import DensePassageRetriever

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
    passage_embedding_model='facebook/dpr-ctx_encoder-single-nq-base',
    use_gpu=True,
    embed_title=True
)

document_store.update_embeddings(retriever=retriever)

03/27/2021 10:41:05 - INFO - haystack.document_store.faiss -   Updating embeddings for 1204 docs...
  0%|          | 0/1204 [00:00<?, ?it/s]
Creating Embeddings:   0%|          | 0/76 [00:00<?, ? Batches/s][A
Creating Embeddings:   1%|▏         | 1/76 [00:00<00:27,  2.73 Batches/s][A
Creating Embeddings:   5%|▌         | 4/76 [00:00<00:19,  3.70 Batches/s][A
Creating Embeddings:   9%|▉         | 7/76 [00:00<00:13,  4.95 Batches/s][A
Creating Embeddings:  13%|█▎        | 10/76 [00:00<00:10,  6.48 Batches/s][A
Creating Embeddings:  17%|█▋        | 13/76 [00:00<00:07,  8.29 Batches/s][A
Creating Embeddings:  21%|██        | 16/76 [00:01<00:05, 10.30 Batches/s][A
Creating Embeddings:  25%|██▌       | 19/76 [00:01<00:04, 12.41 Batches/s][A
Creating Embeddings:  29%|██▉       | 22/76 [00:01<00:03, 14.50 Batches/s][A
Creating Embeddings:  33%|███▎      | 25/76 [00:01<00:03, 16.40 Batches/s][A
Creating Embeddings:  37%|███▋      | 28/76 [00:01<00:02, 17.98 Batches/s][A
Creating Embe

Now that we've fully prepared our document store, we can save it. We will save to the same location we saved our SQLite database, but this time we will be using the *.faiss* filetype.

In [9]:
document_store.save(f'{path}/squad_dev.faiss')

Our FAISS index is now saved to file! We'll go ahead and delete the `document_store` and `retriever`, and try reinitializing both using the data we've saved to file.

In [10]:
del document_store, retriever

All we do now is apply the `load` method directly from `FAISSDocumentStore`, including both the FAISS index location, and SQLite database location:

In [11]:
document_store = FAISSDocumentStore.load(f'{path}/squad_dev.faiss', f'sqlite:///{path}/squad_dev.db')

And now we can re-initialize our retriever, using the same arguments as before.

In [12]:
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
    passage_embedding_model='facebook/dpr-ctx_encoder-single-nq-base',
    use_gpu=True,
    embed_title=True
)

Finally, we can begin retrieving relevant contexts to our questions using `retriever.retrieve`, which requires a single argument, `query`.

In [13]:
retriever.retrieve('What subject is most abstract?')

Creating Embeddings: 100%|██████████| 1/1 [00:00<00:00, 98.28 Batches/s]


[{'text': "A Turing machine is a mathematical model of a general computing machine. It is a theoretical device that manipulates symbols contained on a strip of tape. Turing machines are not intended as a practical computing technology, but rather as a thought experiment representing a computing machine—anything from an advanced supercomputer to a mathematician with a pencil and paper. It is believed that if a problem can be solved by an algorithm, there exists a Turing machine that solves the problem. Indeed, this is the statement of the Church–Turing thesis. Furthermore, it is known that everything that can be computed on other models of computation known to us today, such as a RAM machine, Conway's Game of Life, cellular automata or any programming language can be computed on a Turing machine. Since Turing machines are easy to analyze mathematically, and are believed to be as powerful as any other model of computation, the Turing machine is the most commonly used model in complexity 

And now we've extracted a few contexts stored within FAISS, that our DPR model believes answers our query.