<a href="https://colab.research.google.com/github/BNkosi/Zeus/blob/master/Zeus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Zeus.py

## Contents
1. First installation
2. Imports
3. Data
4. Data cleaning and Preprocessing
5. Retriever
6. Reader
7. Finder
8. Prediction

In [None]:
# First instalation
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

In [3]:
# Make sure you have a GPU running
!nvidia-smi

Mon Aug 31 08:48:22 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Imports

In [4]:
# Minimum imports
from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers
from haystack.database.faiss import FAISSDocumentStore
from haystack.retriever.dense import DensePassageRetriever

08/31/2020 08:48:40 - INFO - faiss -   Loading faiss with AVX2 support.
08/31/2020 08:48:40 - INFO - faiss -   Loading faiss.


## Load Data

In [5]:
def fetch_data_from_repo(doc_dir = "data5/website_data/", 
                         s3_url = "https://github.com/Thabo-5/Chatbot-scraper/raw/master/txt_files.zip",
                         doc_store=FAISSDocumentStore()):
    """
    Function to download data from s3 bucket/ github
    Parameters
    ----------
        doc_dir (str): path to destination folder
        s3_url (str): path to download zipped data
        doc_store (class): Haystack document store
    Returns
    -------
        document_store (object): Haystack document store object
    """
    document_store=doc_store
    fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
    import os
    for filename in os.listdir(path=doc_dir):
        with open(os.path.join(doc_dir, filename), 'r', encoding='utf-8', errors='replace') as file:
            text = file.read()
            file.close()
        with open(os.path.join(doc_dir, filename), 'w', encoding='utf-8', errors='replace') as file:
            file.write(text)
            file.close()
    # Convert files to dicts
    dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

    # Now, let's write the dicts containing documents to our DB.
    document_store.write_documents(dicts)
    return document_store

In [6]:
document_store = fetch_data_from_repo()

08/31/2020 08:48:49 - INFO - haystack.indexing.utils -   Fetching from https://github.com/Thabo-5/Chatbot-scraper/raw/master/txt_files.zip to `data5/website_data/`
100%|██████████| 102378/102378 [00:00<00:00, 3047150.55B/s]


## Initialize Retriver, Reader and Finder

In [7]:
def initFinder():
    """
    Function to initiate retriever, reader and finder
    Parameters
    ----------
    Returns
    -------
        finder (object): Haystack finder
    """
    retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=False,
                                  embed_title=True,
                                  max_seq_len=256,
                                  batch_size=16,
                                  remove_sep_tok_from_untitled_passages=True)
    # Important: 
    # Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
    # previously indexed documents and update their embedding representation. 
    # While this can be a time consuming operation (depending on corpus size), it only needs to be done once. 
    # At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
    document_store.update_embeddings(retriever)
    reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)
    return Finder(reader, retriever)

In [8]:
finder = initFinder()

08/31/2020 08:48:58 - INFO - filelock -   Lock 140300911750448 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…

08/31/2020 08:48:59 - INFO - filelock -   Lock 140300911750448 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock





08/31/2020 08:49:00 - INFO - filelock -   Lock 140303357000672 acquired on /root/.cache/torch/transformers/4b05580c0bfb2b640a50c1c6ae3fe9bca923871a29e0182927c086905d6c4c47.7652e92693c670fb8dfd7ec1f9191e3f82673742ff6a86cde9133a4ea6002ced.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=493.0, style=ProgressStyle(description_…

08/31/2020 08:49:01 - INFO - filelock -   Lock 140303357000672 released on /root/.cache/torch/transformers/4b05580c0bfb2b640a50c1c6ae3fe9bca923871a29e0182927c086905d6c4c47.7652e92693c670fb8dfd7ec1f9191e3f82673742ff6a86cde9133a4ea6002ced.lock





08/31/2020 08:49:01 - INFO - filelock -   Lock 140303357000672 acquired on /root/.cache/torch/transformers/8fdd0d2838c23f921379f2b0322aecf406cbdaa97ffecc544e3a1d49a7c302bd.6f90756c59007364d7842118056ad653f39f4d340fbe20bcc04037d2a45cb0f7.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=437986065.0, style=ProgressStyle(descri…

08/31/2020 08:49:42 - INFO - filelock -   Lock 140303357000672 released on /root/.cache/torch/transformers/8fdd0d2838c23f921379f2b0322aecf406cbdaa97ffecc544e3a1d49a7c302bd.6f90756c59007364d7842118056ad653f39f4d340fbe20bcc04037d2a45cb0f7.lock





08/31/2020 08:49:49 - INFO - filelock -   Lock 140300911304096 acquired on /root/.cache/torch/transformers/f6388f32b32eac5dad8f0f9c7009ce69e967c1b65ebae62f805fced8022ea991.9500f04f28d7c0ca5f9c265db7ba5030897a2d752451412827f7dec185b1ee36.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=492.0, style=ProgressStyle(description_…

08/31/2020 08:49:50 - INFO - filelock -   Lock 140300911304096 released on /root/.cache/torch/transformers/f6388f32b32eac5dad8f0f9c7009ce69e967c1b65ebae62f805fced8022ea991.9500f04f28d7c0ca5f9c265db7ba5030897a2d752451412827f7dec185b1ee36.lock
08/31/2020 08:49:50 - INFO - filelock -   Lock 140300909779656 acquired on /root/.cache/torch/transformers/d1c705617c02da7a616f4b5a8cb445a7f78e84bc4f9e26378c89901d97e16d78.232fed629becb590e5b2ac6c6124f9d1561ef7a1d17ad0394232dd46a0835002.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=437983985.0, style=ProgressStyle(descri…

08/31/2020 08:50:32 - INFO - filelock -   Lock 140300909779656 released on /root/.cache/torch/transformers/d1c705617c02da7a616f4b5a8cb445a7f78e84bc4f9e26378c89901d97e16d78.232fed629becb590e5b2ac6c6124f9d1561ef7a1d17ad0394232dd46a0835002.lock





08/31/2020 08:50:37 - INFO - haystack.database.faiss -   Updating embeddings for 38 docs ...
	nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
	nonzero(Tensor input, *, bool as_tuple)
08/31/2020 08:51:10 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/31/2020 08:51:10 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
08/31/2020 08:51:11 - INFO - filelock -   Lock 140300904435552 acquired on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=559.0, style=ProgressStyle(description_…

08/31/2020 08:51:12 - INFO - filelock -   Lock 140300904435552 released on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock





08/31/2020 08:51:12 - INFO - filelock -   Lock 140300904433144 acquired on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498637366.0, style=ProgressStyle(descri…

08/31/2020 08:51:23 - INFO - filelock -   Lock 140300904433144 released on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock





	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
08/31/2020 08:51:39 - INFO - filelock -   Lock 140300910442648 acquired on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…

08/31/2020 08:51:41 - INFO - filelock -   Lock 140300910442648 released on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock





08/31/2020 08:51:42 - INFO - filelock -   Lock 140300910442648 acquired on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

08/31/2020 08:51:43 - INFO - filelock -   Lock 140300910442648 released on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock





08/31/2020 08:51:45 - INFO - filelock -   Lock 140300910442648 acquired on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…

08/31/2020 08:51:46 - INFO - filelock -   Lock 140300910442648 released on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock





08/31/2020 08:51:46 - INFO - filelock -   Lock 140300910442648 acquired on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=189.0, style=ProgressStyle(description_…

08/31/2020 08:51:47 - INFO - filelock -   Lock 140300910442648 released on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock





08/31/2020 08:51:48 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/31/2020 08:51:48 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
08/31/2020 08:51:48 - INFO - farm.infer -    0 
08/31/2020 08:51:48 - INFO - farm.infer -   /w\
08/31/2020 08:51:48 - INFO - farm.infer -   /'\
08/31/2020 08:51:48 - INFO - farm.infer -   


In [9]:
def getAnswers(retrieve=3, read=5, num_answers=1):
    while(True):
        query = input("You: ")
        if query == "bye":
            print("Goodbye!")
            break
        prediction = finder.get_answers(question=query, top_k_retriever=retrieve, top_k_reader=read)
        for i in range(0, num_answers):
            print(f"\nAnswer\t: {prediction['answers'][i]['answer']}")
            print(f"Context\t: {prediction['answers'][i]['context']}")
            print(f"Document name\t: {prediction['answers'][i]['meta']['name']}")
            print(f"Probability\t: {prediction['answers'][i]['probability']}\n\n")

In [None]:
getAnswers()

You: When  is the next data science course?


08/28/2020 19:58:20 - INFO - haystack.finder -   Reader is looking for detailed answer in 102356 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:18<00:00, 18.06s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:10<00:00, 35.19s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:49<00:00, 24.98s/ Batches]


Answer	: Jul 06, 2020

Context	: ip yourself for the future. Learn the skills that matter.
Paragraph: Jul 06, 2020 - Sep 30, 2020
Paragraph: Applications will open again in the future

Document name	: datascience-for-highschool.txt

Probability	: 0.6954488987506536


You: When is the next data engineering course?


08/28/2020 20:01:59 - INFO - haystack.finder -   Reader is looking for detailed answer in 140677 chars ...
Inferencing Samples: 100%|██████████| 2/2 [00:48<00:00, 24.44s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:07<00:00, 33.51s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:05<00:00, 32.74s/ Batches]


Answer	: Jul 05, 2021

Context	: ascience.net/long-courses/data-engineering
Paragraph: Jan 13, 2021 - Jul 05, 2021
Paragraph: Applications will open again in the future
H3: Explore th

Document name	: data-engineering.txt

Probability	: 0.7017886766075356


You: how do i apply?


08/28/2020 20:05:19 - INFO - haystack.finder -   Reader is looking for detailed answer in 12646 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:10<00:00, 10.18s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.62s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:03<00:00,  3.44s/ Batches]


Answer	: via our website

Context	:  long). Competition for these places is 
A: All applications happen via our website. Select the course and location (on campus or online) which works 

Document name	: faq.txt

Probability	: 0.7067321198226408


You: bye
Goodbye!


In [None]:
getAnswers(5,3,1)

You: When is the next data science course?


08/31/2020 08:53:14 - INFO - haystack.finder -   Reader is looking for detailed answer in 213380 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:19<00:00, 19.33s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:07<00:00, 33.55s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:48<00:00, 24.16s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:17<00:00, 38.79s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [01:07<00:00, 33.55s/ Batches]



Answer	: Jul 06, 2020 - Jun 30, 2021
Context	: roblems using the latest advances in Data Science.
Paragraph: Jul 06, 2020 - Jun 30, 2021
Paragraph: Applications will open again in the future
H3: Jo
Document name	: data-science.txt
Probability	: 0.7106661747683017


