<a href="https://colab.research.google.com/github/BNkosi/Zeus/blob/master/Zeus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Zeus.py

## Contents
1. First installation
2. Imports
3. Data
4. Data cleaning and Preprocessing
5. Retriever
6. Reader
7. Finder
8. Prediction

In [1]:
# First instalation
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-mdfmf9d9
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-mdfmf9d9
Building wheels for collected packages: farm-haystack
  Building wheel for farm-haystack (setup.py) ... [?25l[?25hdone
  Created wheel for farm-haystack: filename=farm_haystack-0.3.0-cp36-none-any.whl size=103022 sha256=9306ccc663bc6561085bb770c5937842de87ceee5e3fed1e006956b01fa9655d
  Stored in directory: /tmp/pip-ephem-wheel-cache-562li8l_/wheels/ab/41/a4/4fbf362de283352078ecb6705c08b6525347aaea2eead2a60c
Successfully built farm-haystack
Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [2]:
# Make sure you have a GPU running
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



## Imports

In [6]:
# Minimum imports
from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers
from haystack.database.faiss import FAISSDocumentStore
from haystack.retriever.dense import DensePassageRetriever

## Load Data

In [8]:
def fetch_data_from_repo(doc_dir = "data5/website_data/", 
                         s3_url = "https://github.com/Thabo-5/Chatbot-scraper/raw/master/txt_files.zip",
                         doc_store=FAISSDocumentStore()):
    """
    Function to download data from s3 bucket/ github
    Parameters
    ----------
        doc_dir (str): path to destination folder
        s3_url (str): path to download zipped data
        doc_store (class): Haystack document store
    Returns
    -------
        document_store (object): Haystack document store object
    """
    document_store=doc_store
    fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
    import os
    for filename in os.listdir(path=doc_dir):
        with open(os.path.join(doc_dir, filename), 'r', encoding='utf-8', errors='replace') as file:
            text = file.read()
            file.close()
        with open(os.path.join(doc_dir, filename), 'w', encoding='utf-8', errors='replace') as file:
            file.write(text)
            file.close()
    # Convert files to dicts
    dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

    # Now, let's write the dicts containing documents to our DB.
    document_store.write_documents(dicts)
    return document_store

In [9]:
document_store = fetch_data_from_repo()

08/28/2020 18:09:34 - INFO - haystack.indexing.utils -   Found data stored in `data5/website_data/`. Delete this first if you really want to fetch new data.


## Initialize Retriver, Reader and Finder

In [16]:
def initFinder():
    """
    Function to initiate retriever, reader and finder
    Parameters
    ----------
    Returns
    -------
        finder (object): Haystack finder
    """
    retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=False,
                                  embed_title=True,
                                  max_seq_len=256,
                                  batch_size=16,
                                  remove_sep_tok_from_untitled_passages=True)
    # Important: 
    # Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
    # previously indexed documents and update their embedding representation. 
    # While this can be a time consuming operation (depending on corpus size), it only needs to be done once. 
    # At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
    document_store.update_embeddings(retriever)
    reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)
    return Finder(reader, retriever)

In [17]:
finder = initFinder()

08/28/2020 18:18:42 - INFO - haystack.database.faiss -   Updating embeddings for 38 docs ...
08/28/2020 18:19:15 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/28/2020 18:19:15 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
08/28/2020 18:19:29 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/28/2020 18:19:29 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
08/28/2020 18:19:29 - INFO - farm.infer -    0 
08/28/2020 18:19:29 - INFO - farm.infer -   /w\
08/28/2020 18:19:29 - INFO - farm.infer -   /'\
08/28/2020 18:19:29 - INFO - farm.infer -   


08/28/2020 18:11:28 - INFO - haystack.database.faiss -   Updating embeddings for 38 docs ...
	nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
	nonzero(Tensor input, *, bool as_tuple)


08/28/2020 18:12:57 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/28/2020 18:12:57 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
08/28/2020 18:12:57 - INFO - filelock -   Lock 139699641703616 acquired on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=559.0, style=ProgressStyle(description_…

08/28/2020 18:12:57 - INFO - filelock -   Lock 139699641703616 released on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock





08/28/2020 18:12:58 - INFO - filelock -   Lock 139699641703784 acquired on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498637366.0, style=ProgressStyle(descri…

08/28/2020 18:13:35 - INFO - filelock -   Lock 139699641703784 released on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock





	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
08/28/2020 18:13:48 - INFO - filelock -   Lock 139699648868984 acquired on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…

08/28/2020 18:13:48 - INFO - filelock -   Lock 139699648868984 released on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock
08/28/2020 18:13:48 - INFO - filelock -   Lock 139699637987536 acquired on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

08/28/2020 18:13:48 - INFO - filelock -   Lock 139699637987536 released on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
08/28/2020 18:13:48 - INFO - filelock -   Lock 139699648868984 acquired on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…

08/28/2020 18:13:49 - INFO - filelock -   Lock 139699648868984 released on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock
08/28/2020 18:13:49 - INFO - filelock -   Lock 139699637988880 acquired on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=189.0, style=ProgressStyle(description_…

08/28/2020 18:13:49 - INFO - filelock -   Lock 139699637988880 released on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock





08/28/2020 18:13:49 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/28/2020 18:13:49 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
08/28/2020 18:13:49 - INFO - farm.infer -    0 
08/28/2020 18:13:49 - INFO - farm.infer -   /w\
08/28/2020 18:13:49 - INFO - farm.infer -   /'\
08/28/2020 18:13:49 - INFO - farm.infer -   


In [23]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = finder.get_answers(question="When is the next data engineering course?", top_k_retriever=10, top_k_reader=5)

08/28/2020 19:09:30 - INFO - haystack.finder -   Reader is looking for detailed answer in 296427 chars ...
Inferencing Samples: 100%|██████████| 2/2 [00:49<00:00, 25.00s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:06<00:00, 33.10s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:05<00:00, 32.90s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:16<00:00, 38.08s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:16<00:00, 16.59s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:15<00:00, 15.71s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:12<00:00, 12.31s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:05<00:00, 32.91s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:07<00:00,  7.48s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.78s/ Batches]


In [24]:
print_answers(prediction, details="all")

{   'answers': [   {   'answer': 'Jul 05, 2021',
                       'context': 'ascience.net/long-courses/data-engineering\n'
                                  'Paragraph: Jan 13, 2021 - Jul 05, 2021\n'
                                  'Paragraph: Applications will open again in '
                                  'the future\n'
                                  'H3: Explore th',
                       'document_id': '3625cb68-e605-428f-bfd2-2d3477d516c8',
                       'meta': {   'name': 'data-engineering.txt',
                                   'vector_id': '23'},
                       'offset_end': 81,
                       'offset_end_in_doc': 24992,
                       'offset_start': 69,
                       'offset_start_in_doc': 24980,
                       'probability': 0.7017886766075356,
                       'score': 6.846639633178711},
                   {   'answer': '2021',
                       'context': 'e before the start of the course\n'
  

In [27]:
prediction['answers'][0]

{'answer': 'Jul 05, 2021',
 'context': 'ascience.net/long-courses/data-engineering\nParagraph: Jan 13, 2021 - Jul 05, 2021\nParagraph: Applications will open again in the future\nH3: Explore th',
 'document_id': '3625cb68-e605-428f-bfd2-2d3477d516c8',
 'meta': {'name': 'data-engineering.txt', 'vector_id': '23'},
 'offset_end': 81,
 'offset_end_in_doc': 24992,
 'offset_start': 69,
 'offset_start_in_doc': 24980,
 'probability': 0.7017886766075356,
 'score': 6.846639633178711}

In [40]:
def getAnswers(retrieve=3, read=5, num_answers=1):
    while(True):
        query = input("You: ")
        if query == "bye":
            print("Goodbye!")
            break
        prediction = finder.get_answers(question=query, top_k_retriever=retrieve, top_k_reader=read)
        for i in range(0, num_answers):
            print(f"\nAnswer\t: {prediction['answers'][i]['answer']}")
            print(f"Context\t: {prediction['answers'][i]['context']}")
            print(f"Document name\t: {prediction['answers'][i]['meta']['name']}")
            print(f"Probability\t: {prediction['answers'][i]['probability']}\n\n")

In [39]:
getAnswers()

You: When  is the next data science course?


08/28/2020 19:58:20 - INFO - haystack.finder -   Reader is looking for detailed answer in 102356 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:18<00:00, 18.06s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:10<00:00, 35.19s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:49<00:00, 24.98s/ Batches]


Answer	: Jul 06, 2020

Context	: ip yourself for the future. Learn the skills that matter.
Paragraph: Jul 06, 2020 - Sep 30, 2020
Paragraph: Applications will open again in the future

Document name	: datascience-for-highschool.txt

Probability	: 0.6954488987506536


You: When is the next data engineering course?


08/28/2020 20:01:59 - INFO - haystack.finder -   Reader is looking for detailed answer in 140677 chars ...
Inferencing Samples: 100%|██████████| 2/2 [00:48<00:00, 24.44s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:07<00:00, 33.51s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:05<00:00, 32.74s/ Batches]


Answer	: Jul 05, 2021

Context	: ascience.net/long-courses/data-engineering
Paragraph: Jan 13, 2021 - Jul 05, 2021
Paragraph: Applications will open again in the future
H3: Explore th

Document name	: data-engineering.txt

Probability	: 0.7017886766075356


You: how do i apply?


08/28/2020 20:05:19 - INFO - haystack.finder -   Reader is looking for detailed answer in 12646 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:10<00:00, 10.18s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.62s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:03<00:00,  3.44s/ Batches]


Answer	: via our website

Context	:  long). Competition for these places is 
A: All applications happen via our website. Select the course and location (on campus or online) which works 

Document name	: faq.txt

Probability	: 0.7067321198226408


You: bye
Goodbye!
