<a href="https://colab.research.google.com/github/BNkosi/Zeus/blob/master/Zeus_starter_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Zeus - a closed domain question answering chatbot

<img src ="https://infobush.com/wp-content/uploads/2020/01/Facts-About-Zeus.jpg" align="left">

**Team Members**: B. Nkosi, K. Galela, K. Mnguni, N. Msibi, N. Magudulela, O. Mkhuhlane, T. Muthego, V. Mthemba

* [Model Repo](https://github.com/BNkosi/Zeus)
* [Data Repo]
* [Model Trello Board](https://trello.com/b/pNaYe3pe/zeus)
* [Data Trello Board]

## Table of Contents
---
1. [Introduction](#intro)
  * Background
  * Problem Statement
---
2. [Imports](#imports)
  * Libraries
  * Data
---
3. [Modelling](#model)
  * Basic QA Pipeline
  * Model selection
  * Retriever selection
  * Fine-tuning
---
4. [Evaluation](#evaluation)
---
5. [Model Analysis](#analysis)
  * Results
---
6. [Conclusion](#conclusion)
---
7. [References](#ref)
  


<a id="intro"></a>
## 1. Introduction
### Background

The EDSA QA chatbot has two parts - Onboarding Chatbot and Zues.

### Onboarding Chatbot

Your co-pilot when you join the company or step into a new role

We all have access to the internet and the majority of us have become very good at using search engines to find the kind of information we need. The world is also slowly moving toward interactive search functionality such as chatbots 

Many corporates have built chatbot with Q+A functionality to improve customer experience and rapidly return the kinds of basic answers that customers need to make their product purchasing decisions. However, this technology appears to be vastly underutilized within the internal staff structures of company. Most corporate staff work within a team and division and their intimate knowledge of company policies, procedures, best practices and tools fall within the domain of that team or division. If they need to work on something that is usually foreign to their role, they have to ask a member of another team or division.

This either takes place in the form of an email (cumbersome and often gets ignored) or being pestered in person (distracting and annoying at times). This tool would be trained on the companies internal policies, regulations, best practices, corporate communications, tool documentations etc. to generate an understanding of how to best answer questions that staff may have about things that they are not intimately aware of but need near real-time feedback on to continue with their own projects.

### Zeus

Answer Domain specific questions to aid the user when they need it

Covid-19 brings new challenges to the education sector in that learning can no longer take place face-to-face. A key tool in education is learner engagement in the form of live questions and answers and the instructor being able to gauge whether they’ve lost their audience to re-explain a key concept in different ways. A key skill of an instructor is to be able to support their answers with reference to principles and provide relatable examples.

Company X would like to build a virtual instructor that can answer complex questions accompanying their online media content. This tool would answer a content viewers’ questions even though they are not viewing live and eventually stand alone as a student resource. The tool would be trained on student handbooks, past exams, suggested solutions, video-content, etc. in order to present an answer with extracts from key text and examples to the students’ question.

### Problem Statement

1. Build an closed domain question answering pipeline to answer textual questions.

2. Train the pipeline on:
  * Companies internal policies, regulations, best practices, corporate communications, etc. - Onboarding
  * EDSA problem statement, preprocessing, trains and other external resources.

<a id="imports"></a>
## Imports

### Libraries

In [3]:
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-_br7dgo_
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-_br7dgo_
Collecting farm==0.4.6
[?25l  Downloading https://files.pythonhosted.org/packages/e2/93/1beb613753a9845b689eee4571ba4a7f3210b60b4bd90f024fc324c96785/farm-0.4.6-py3-none-any.whl (184kB)
[K     |████████████████████████████████| 194kB 2.7MB/s 
[?25hCollecting fastapi
[?25l  Downloading https://files.pythonhosted.org/packages/82/cb/96cb7cc6a807af493f0083e7d854fdd568ae5335f8f93b96c966fabd8d2f/fastapi-0.61.0-py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 4.8MB/s 
[?25hCollecting uvicorn
[?25l  Downloading https://files.pythonhosted.org/packages/32/9a/5f619c02f36e751071c2b7eaa37a7c4b767feb41e4c2de48e8fbe4e7b451/uvicorn-0.11.8-py3-none-any.whl (43kB)
[K     |████████████████████████████████| 51kB 5.4MB/s 
[?25hCollecting

In [4]:
from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

## Document Store

Haystack finds the aswrrs to queries within the documents sotred in a DocumentStore. The current implementations include:
* ElasticsearchDocumentStore;
* SQLDocumentStore; and
* InMemoryDocumentStore

It is recommended to use Elasticsearch as it comes with additional features. We may try the other stores in later implementations as the SQLDocumentStore may be most effective for deploying the solution to an RDS instance.

In [5]:
# Initializing Elasticsearch on a local machine
# ! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

In [6]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [7]:
# Connect to Elasticsearch

from haystack.database.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

08/22/2020 17:39:51 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.473s]
08/22/2020 17:39:51 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.249s]


## Cleaning & indexing documents

Haystack provides a customizable cleaning and indexing pipeline for ingesting documents in Document Stores.

In [17]:
# path to documents
doc_dir = '/content/data/documents'

# Convert files to dicts
# Optional cleaning function here - input(str), output(str).
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=False)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is: {"name": "<some-document-name>, "text": "<the-actual-text>"}
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Finder)

# Let's have a look at the first 3 entries:
print(dicts[:3])

# Now, lets write the dicts containing the documents to our DG.
document_store.write_documents(dicts)

[{'text': '\ufeffScoring at EXPLORE - an explanation\nA number of you have had questions about your marks for each sprint and what they mean. This \ncommunication is meant to address those questions (and hopefully answer one or two you didn’t \nHow do I get marks and what do they mean?\nAt EXPLORE the Data Science qualification is aligned with SETA, meaning that when you have \nsuccessfully finished with the course, you will have an NQF 5 accredited qualification. Since the \ncourse is SETA accredited, you don’t "pass" or "fail". You are found either "competent" or "not \nyet competent" in a given area, or “unit standard”. The trains, tests and predicts are built to align \nIn each Sprint, there are a set number of belt points available. These points are divided among \nTrains, Tests and Predicts and align with the various unit standards. Your overall sprint score is \nan indication of the proportion of available belt points you have gained for that sprint (i.e. a score \nof 70% means 

08/22/2020 17:42:55 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.239s]


## Initialize Retriecer, Reader, & Finder

## Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. They use some simple but fast algorithm.

**Here**: We will use Elasticsearch's default BM25 algorithm

**Alternatives**:
* Customize the ElasticsearchRetriever with custom queries (e.g. boosting) and filters.
* Use TfifdRetriever in combination with a SQL or InMemory Document store for simple prototyping and debugging.
* Use EmbeddingRetriever to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT).
* Use DensePassageRetriever to use different embedding models for passage and query.

In [18]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [19]:
# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.

# from haystack.retriever.sparse import TfidfRetriever
# retriever = TfidfRetriever(document_store=document_store)

### Reader

A Reader scans the texts returned by retievers in detail and extracts the k best answers. They are based on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers. With both you can either [load a local model](https://colab.research.google.com/drive/18Xvjo49WIOB2MhHre66OfLkdRXioVJvP#scrollTo=HDPEEBeJGnuz) or one from [Hugging Face's model hub](https://huggingface.co/models)

**Here**: a medium sized [RoBERTa QA](https://huggingface.co/deepset/roberta-base-squad2) model using a Reader based on FARM

**Alternatives (Reader)**: TransformersReader

**Alternatices (Models)**:
* distillbert-base-uncased-distilled-squad - FAST
* deepset/bert-;arge-uncased-whole-word-masking-squad2 - ACCURATE

The model can be adjusted to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

### FARMReader

In [20]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

08/22/2020 17:43:04 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/22/2020 17:43:04 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
08/22/2020 17:43:25 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/22/2020 17:43:25 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
08/22/2020 17:43:25 - INFO - farm.infer -    0 
08/22/2020 17:43:25 - INFO - farm.infer -   /w\
08/22/2020 17:43:25 - INFO - farm.infer -   /'\
08/22/2020 17:43:25 - INFO - farm.infer -   


### TransformersReader

In [21]:
# Alternative:
# reader = TransformersReader(model="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

### Finder

The Finder sticks together reader and retriever in a pipeline to answore our actual questions.

In [22]:
finder = Finder(reader, retriever)

## Now we can finally ask a question!

The number of candidates the reader and retirever return can be configured in the reader

The higher top_k_retriever, the better (but slower) your answers.

In [23]:
query = 'Will I be kicked out if I do badly?'

In [24]:
prediction = finder.get_answers(question=query, top_k_retriever=1, top_k_reader=3)

08/22/2020 17:43:25 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.048s]
08/22/2020 17:43:25 - INFO - haystack.retriever.sparse -   Got 1 candidates from retriever
08/22/2020 17:43:25 - INFO - haystack.finder -   Reader is looking for detailed answer in 4593 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:05<00:00,  5.62s/ Batches]


In [25]:
print_answers(prediction, details = 'all')

{   'answers': [   {   'answer': 'Not at all!',
                       'context': 'I fail the course? Will I \n'
                                  'be kicked out if I do badly in any sprint?\n'
                                  'Not at all! Belt points are cumulative, so '
                                  'if you do not have enough belt points',
                       'document_id': '97067a7f-4261-425c-bb82-c619b00b9f09',
                       'meta': {   'name': 'Scoring at EXPLORE an '
                                           'explanation.txt'},
                       'offset_end': 81,
                       'offset_end_in_doc': 1581,
                       'offset_start': 70,
                       'offset_start_in_doc': 1570,
                       'probability': 0.5493792648547936,
                       'score': 1.5853039026260376},
                   {   'answer': 'Not at all',
                       'context': 'I fail the course? Will I \n'
                                 

## Fine-tuning

Feedback can be gathered by production systems using Haystack's [REST API interface](https://github.com/deepset-ai/haystack#rest-api). This includes a customizable user feedback API for providing feedback on the answer returned. The API provides feedback export endpoint to obtain the feedback data for further fine-tuning.

Once training data has been collected, base models can be tuned. We initialize a base reader as a base model and fine-tune it on our own custom SQuAD-like dataset.

In [46]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
train_data = '/content/data/training'
reader.train(data_dir=train_data, train_filename="answers.json", use_gpu=True, n_epochs=1, save_dir="my_model")

08/22/2020 18:04:35 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
08/22/2020 18:04:35 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
08/22/2020 18:04:56 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
08/22/2020 18:04:56 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
08/22/2020 18:04:56 - INFO - farm.infer -    0 
08/22/2020 18:04:56 - INFO - farm.infer -   /w\
08/22/2020 18:04:56 - INFO - farm.infer -   /'\
08/22/2020 18:04:56 - INFO - farm.infer -   
08/22/2020 18:04:56 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
Preprocessing Dataset /content/data/training/answers.json:   0%|    

TypeError: ignored

In [None]:
# Saving the model happens automatically at the end of training into the `save_dir` you specified
# However, you could also save a reader manually again via:
reader.save(directory="my_model")

In [None]:
# If you want to load it at a later point, just do:
new_reader = FARMReader(model_name_or_path="my_model")

## Dense Passage Retrieval
better retrievers [here](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb)


### Types of Retrievers

#### Sparse
This family of algorithms is based on counting the occurences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.
Examples:
* BM25
* TF-IDF
Pros: Simple, fast, well explainable
Cons: Relies on exact keyword matches between query and text

#### Dense
These retrievers use neural network models to create "dense" embedding vectors. Within this family there are two approaches:
* Single encoder: Use a single model to embed both query and passage.
* Dual-encoder: Use **two models**, one to embed the query and one to embed the passage.

Recent work suggests that dual encoders work better, likely because they can deal better with the differenct nature of query and passage (length, style, syntax).
Examples:
* REALM
* DPR
* Sentence-Transformers
Pros: Capturs semantic similarity instead of "word matches" (e.g. synonyms, related topics)
Cons: Computationally more heavy, initial training

## Evaluation

add evaluation [here](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb)

## Conclusion