# Basic QA Pipeline with WeaviateDocumentStore

[Haystack](https://haystack.deepset.ai/overview/intro), a frramework for NLP applications, provides a range of [Document Store](https://haystack.deepset.ai/components/document-store) integrations. [Weaviate]() is one of the possible DocumentStore integrations. Weaviate is a vector database which enables fast and scalable vector storage, search and retrieval. In this tutorial we will be walking through how you could use the `WeaviateDocumentStore` within a Haystack Question Answering pipeline. Here, we will be using the Harry Potter wiki data in CSV format to answer some questions about the wizarding world ⚡️

This demo is built by Tuana Çelik (Deepset) and Laura Ham (Weaviate).

### First: enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [None]:
# Make sure you have a GPU running
!nvidia-smi

### Install Haystack

In [None]:
# Install the latest Haystack from GitHub's master
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,weaviate]

## 1. Write the data into a list of objects of Document format

Haystack expects text to come in [Document](https://haystack.deepset.ai/reference/primitives) format. So, first we have to prepare the data to write to WeaviateDocumentStore. We read each line of the CSV and write them into a `dict`. Each object we write will have a "content" and "meta" field to conform with the Document type.

In [1]:
from haystack.utils import clean_wiki_text
import pandas as pd

harry = pd.read_csv("https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/harry_potter_wiki.csv")

dicts = []

for ix, row in harry.iterrows():
    dic = {

        'content': clean_wiki_text(row.text),
        'meta': {
            'name': row['name'],
            'url': row.url
        }
    }
    dicts.append(dic)

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/


## 2. Launch a Weaviate cluster

There are two options to get a Weaviate instance running:
1. Run a Weaviate cluster on Weaviate Cluster Service (WCS), which is hosted by SeMI Technologies (for free). You can create an account on https://console.semi.technology. If you have created an account, you can create free a Weaviate instance on https://console.semi.technology/ or you can follow the steps in the code blocks below to create a Weaviate instance. 
2. Run a Weaviate instance on your local machine. In this case, you need to run this notebook on your local machine too (not as Google Colab, which runs on Google servers). In this case, you can use the Haystack's `launch_weaviate()` function to get Weaviate running on `localhost:8080`, and initiate `WeaviateDocumentStore()` with this host (see below)

### Option 1: Create a Weaviate instance on WCS

In [None]:
# Only run if you choose for option 1: Create a Weaviate instance on WCS

!pip install weaviate-client==3.3.3 

In [None]:
# Only run if you choose for option 1: Create a Weaviate instance on WCS

from getpass import getpass # hide password
import weaviate # to communicate to the Weaviate instance

from weaviate.wcs import WCS 

In [None]:
# Only run if you choose for option 1: Create a Weaviate instance on WCS

my_credentials = weaviate.auth.AuthClientPassword(username=input("User name: "), password=getpass('Password: '))
my_wcs = WCS(my_credentials)

Now that we connected to WCS, we can `create` a Weaviate instance.


In [None]:
# Only run if you choose for option 1: Create a Weaviate instance on WCS

with_auth = False # set to true if you don't want your instance to be publically available
weaviate_url = my_wcs.create(with_auth=with_auth, wait_for_completion=True) # create a WCS cluster with no vectorization module (ML model) attached (this will be handled by Haystack's EmbeddingRetriever, see below)
weaviate_url

### Option 2: Run Weaviate locally

Run the cell below onlly if you're usins this notebook locally. This will locally start a Weaviate docker container for you.


In [3]:
# Only run if you choose for option 2: Run Weaviate (and this notebook) locally. Uncomment lines below

# from haystack.utils import launch_weaviate
# launch_weaviate()

bdc5d137926c1880ed4271a8ae10f7ba68f9907d1f509377e4e10cdbc6439ada


## 3. Write the documents to a WeaviateDocumentStore

Now that you have Weaviate ready,  you can intialize a WeaviateDocumentStore for your Haystack Question Answering pipeline. Depending on which option you picked to start a Weaviate instance above, uncomment and use the relevant bits of code below.

In [4]:
from haystack.document_stores import WeaviateDocumentStore

# if you choose for option 1 (run Weaviate with WCS)
document_store = WeaviateDocumentStore(host=weaviate_url, port=443)

# if you choose for option 2 (run Weaviate locally), you can uncomment the line below, and uncomment the line above
# document_store = WeaviateDocumentStore() # assumes Weaviate is running on http://localhost:8080

document_store.write_documents(documents=dicts, batch_size=100)

13700it [02:05, 109.30it/s]                           


## 4. The Retriever

Next, we define our [Retriever](https://haystack.deepset.ai/components/retriever). This component acts as a filter to retrieve only the relevant document from your DocumentStore, based on your questy. You have a few options, depending on your usecase and desired accuracy outcomes, you can pick and choose which to use:

`EmbeddingRetriever`

`DensePassageRetriever`

In this example, we're using the `EmbeddingRetriever`, a dense retriever which is able to work with dense vectors (embeddings) and yields great results. This  means we also have to call `update_embeddings()`, which will create the embeddings for our documents in WeaviateDocumentStore. It also means it will take more time.

In [5]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(document_store=document_store, model_format="sentence_transformers", embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",)
document_store.update_embeddings(retriever)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model sentence-transformers/multi-qa-mpnet-base-dot-v1
INFO - haystack.document_stores.weaviate -  Updating embeddings for all 13670 docs ...
Batches: 100%|██████████| 4/4 [01:18<00:00, 19.68s/it]


## 5. The Reader

[The Reader](https://haystack.deepset.ai/components/reader) is the component that will be doing the actual Question Answering task. It allows you to use the latest transfromer models which you can find on HuggingFace Model Hub and provide as the `model_name_or_path`. Here, we use the [deepset/tinyroberta-squad2](https://huggingface.co/deepset/roberta-base-squad2) model. 

 

In [6]:
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="deepset/tinyroberta-squad2", use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/tinyroberta-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded deepset/tinyroberta-squad2


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

INFO - haystack.modeling.infer -  Got ya 11 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0     0     0     0     0     0     0     0     0     0  
INFO - haystack.modeling.infer -  /w\   /w\   /w\   /w\   /w\   /w\   /w\   /|\  /w\   /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \   /'\   /'\   / \   / \   /'\   /'\   /'\   /'\   /'\ 


## 6. Initialize an ExtractiveQAPipeline

Now we have our Retriever and Reader ready, the only thing left to do is initialize a Question Answering pipeline that uses them!

In [7]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)

## 7. Ask a question!

Now try asking a question to your pipeline 🥳
Change the `query` parameter below to ask some thing new, like "What does McGonagall teach?"

In [8]:
prediction = pipe.run(query="What does McGonagall teach?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}})

from haystack.utils import print_answers

print_answers(prediction)

Batches: 100%|██████████| 1/1 [00:00<00:00,  8.73it/s]
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.modeling.model.optimization -  ap


Query: What does McGonagall teach?
Answers:
[   <Answer {'answer': 'Transfiguration', 'type': 'extractive', 'score': 0.9635432064533234, 'context': 'James Potter and R. J. H. King.\nMcGonagall may have been related to Transfiguration teacher Minerva McGonagall. It is possible that M. G. McGonagall i', 'offsets_in_document': [{'start': 327, 'end': 342}], 'offsets_in_context': [{'start': 68, 'end': 83}], 'document_id': '006a7918-9a88-33bf-b685-b7666919e0fc', 'meta': {'name': 'M._G._McGonagall', 'url': 'https://harrypotter.fandom.com/wiki/M._G._McGonagall'}}>,
    <Answer {'answer': 'Defence Against the Dark Arts', 'type': 'extractive', 'score': 0.0015463014133274555, 'context': 'ld take moving pictures.\nHe got his photo with the celebrity Defence Against the Dark Arts teacher Gilderoy Lockhart. However, the photographic Harry ', 'offsets_in_document': [{'start': 1756, 'end': 1785}], 'offsets_in_context': [{'start': 61, 'end': 90}], 'document_id': '0144511c-4e9b-5ae2-0a2d-83aa573e1479',


