<a href="https://colab.research.google.com/github/BNkosi/Zeus/blob/master/Zeus_starter_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Mounting gdrive and data*

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [2]:
path2docs = "/content/gdrive/My Drive/EDSA/Zeus/documents"
path2train = "gdrive/My Drive/EDSA/Zeus/EQuAD"

# Zeus - a closed domain question answering chatbot

<img src ="https://infobush.com/wp-content/uploads/2020/01/Facts-About-Zeus.jpg" align="left">

**Contributors**: 
B. Nkosi
C. Pialat
L. Loubser
T. Ntsheke

* [Model Repo](https://github.com/BNkosi/Zeus)
* [Data Repo](https://github.com/Thabo-5/Chatbot-scraper)
* [Model Trello Board](https://trello.com/b/pNaYe3pe/zeus)
* [Data Trello Board](https://trello.com/b/xDWDHbWH/corpus-build)

## Table of Contents
---
1. [Introduction](#intro)
  * Background
  * Problem Statement
  * Value
---
2. [Imports](#imports)
  * Haystack
---
3. [Document Store](#data)
  * Data Cleaning and preprocessing
  * Document Store parameters
---
3. [Modelling](#model)
  * Basic QA Pipeline
  * Model selection
  * Retriever selection
  * Fine-tuning
---
4. [Evaluation](#evaluation)
---
5. [Model Analysis](#analysis)
  * Results
---
6. [Conclusion](#conclusion)
---
7. [References](#ref)
  


<a id="#intro"></a>
## 1. Introduction
### Background

#### What is Zeus

Zeus is a users co-pilot through your company's offering. It is a closed domain chatbot that will answer users questions about your company, product or policies.

There are three identified types people who engage with a company's information:
1. *Auditors* - generally want to perform due diligence procedures that may involve querying of data. An audit or a review may be required by Regulations 28 and 29 of the Companies Act or they may simply be conducted by customers or suppliers deciding whether to engage with your company.
2. *Investors* - may query data before making the investment decision.
3. *Customers* - may want to understand a product or service better. This could range from administrative queries to ongoing customer support.
4. *Employees* - who may require access to information to perform their duties.

Zeus will be trained on data for each of these people in order to give them increasing levels of access to information.

1. *Auditors and Investors* - [Explore websites](https://explore-datascience.net/) and other information required to be made public.
2. *Customers* - Product information from EDSA website, preprocessings, trains and  w3 schools.
3. *Employees* - Internal policies, regulations and best practices.

### Value
Zeus aims to reduce the need for staff engagement with with customers while ensuring that customers queries have been addressed.

### Problem Statement

1. Build an closed domain question answering pipeline to answer textual questions.

2. Train the pipeline on:
  * Companies internal policies, regulations, best practices, corporate communications, etc. - Onboarding
  * EDSA problem statement, preprocessing, trains and other external resources.

## First installation

<a id="imports"></a>
## Imports

### Libraries

In [None]:
# First instalation
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

In [4]:
# Minimum imports
from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

## Document Store

Haystack finds the asswers to queries within the documents stored in a DocumentStore. The current implementations include:
* ElasticsearchDocumentStore;
* SQLDocumentStore; and
* InMemoryDocumentStore

It is recommended to use Elasticsearch as it comes with additional features such as [full text query](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html), more on these later.

However, for protyping and debugging we will also enable a SQL document store to work efficiently on sigle documents. This will enable us to write custom cleaning functions for text data in order to make fine adjustments to our documents. Follow this [tutorial]() to set up a SQL/InMemory document stores.

### Instialize Elasticsearch server

In [7]:
# Initializing Elasticsearch on a local machine
# ! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

In [5]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [None]:
# Connect to Elasticsearch

from haystack.database.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

### Cleaning & indexing documents

Haystack provides a customizable cleaning and indexing pipeline for ingesting documents in Document Stores.

We need to write custom functions to clean the text files. They must take a str as an input and return a str.

In [14]:
# path to documents
doc_dir = path2docs

# Convert files to dicts
# Optional cleaning function here - input(str), output(str).
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=False)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is: {"name": "<some-document-name>, "text": "<the-actual-text>"}
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Finder)

# Let's have a look at the first 2 entries:
print(dicts[:1])

# Now, lets write the dicts containing the documents to our DG.
document_store.write_documents(dicts)

08/25/2020 06:05:24 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.172s]


[{'text': '\ufeffScoring at EXPLORE - an explanation\nA number of you have had questions about your marks for each sprint and what they mean. This \ncommunication is meant to address those questions (and hopefully answer one or two you didn’t \nHow do I get marks and what do they mean?\nAt EXPLORE the Data Science qualification is aligned with SETA, meaning that when you have \nsuccessfully finished with the course, you will have an NQF 5 accredited qualification. Since the \ncourse is SETA accredited, you don’t "pass" or "fail". You are found either "competent" or "not \nyet competent" in a given area, or “unit standard”. The trains, tests and predicts are built to align \nIn each Sprint, there are a set number of belt points available. These points are divided among \nTrains, Tests and Predicts and align with the various unit standards. Your overall sprint score is \nan indication of the proportion of available belt points you have gained for that sprint (i.e. a score \nof 70% means 

In [None]:
dicts[1]

In [10]:
def course_cleaner(dict):
    for i in range(0, len(courses[0])):
        if "Overview" in courses[0][i]:
            title_ = re.sub(" ", "_", courses[0][0]+'-'+courses[0][1]+'-'+courses[0][i-1]).lower()
            print(title_)

NameError: ignored

In [121]:
for course in courses:
    title_ = re.sub(" ", "_", course[0]+'-'+course[1]).lower()
    file = open(f"/content/gdrive/My Drive/EDSA/Zeus/documents/website/{title_}.txt", "w+")
    for i in range(0, len(course)):
        file.write(course[i]+'\n')
    file.close()

In [123]:
print(open('/content/gdrive/My Drive/EDSA/Zeus/documents/website/curriculum_detail-long_courses.txt').read())

Curriculum Detail
Long Courses
Data Science
---------------------------------------------------------------------------------------------Overview
Explore the ways to make an invaluable contribution to your
business.
Data Scientists bring a diverse set of skills to your business that help you make
data-driven decisions.
------------------------------------------------------------------------------------------Curriculum
1
Start thinking like a data scientist
This course will provide students with the knowledge, skills and experience to get
a job as a data scientist - which requires a mix of programming, statistical
understanding and the ability to apply both skills in new and challenging
domains. The course will teach students to gather data, visualise data, apply
statistical analysis to answer questions with that data and make their insights and
information as actionable as possible.
Fundamentals
Students will gain an overview of Data Science and the fundamental skills
required to be a 

In [98]:


short_courses

['Curriculum Detail',
 'Short Courses',
 'Machine Learning for Actuaries',
 '15',
 'Participants will master the model building process and the various machine',
 'learning algorithms to use when predicting or classifying insurance claims. For',
 'those not born programmers, there will be pre-course material to cover the',
 'basics of programming in python. To cover the whole machine learning spectrum',
 'in a few days is not possible, and therefore there will be post-course material to',
 'dive deeper into the more complex machine learning techniques - deep learning.',
 'Programming (Pre-course)',
 'programming simpler.',
 'Machine Learning Algorithms',
 'On day 1 participants will cover the whole model building process using python’s',
 'Sci-kit Learn library. They will start by pre-processing the data to be ready for',
 'modeling and then go through the machine learning algorithms for classifying',
 'whether someone will claim or not - algorithms will include (amongst others)',
 'lo

In [69]:
 # Write a cleaning function to clean text
import re
def clean_website(text):
    text = text.lower()
    text = re.sub(r"★ ", "", text)
    text = re.sub(r"● ", "", text)
    text = re.sub(r" - ", " ", text)
    text = re.sub(r"It's", "It is", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"FAQ", "Frequently asked questions", text)
    text = re.sub(r"[-()\"#/@;:<>{}`+=~\.!,]", "", text)
    return text

In [70]:
print(clean_website(text1))

curriculum detail
long courses
data science

overview
explore the ways to make an invaluable contribution to your
business
data scientists bring a diverse set of skills to your business that help you make
datadriven decisions

curriculum
start thinking like a data scientist
this course will provide students with the knowledge skills and experience to get
a job as a data scientist which requires a mix of programming statistical
understanding and the ability to apply both skills in new and challenging
domains the course will teach students to gather data visualise data apply
statistical analysis to answer questions with that data and make their insights and
information as actionable as possible

fundamentals
students will gain an overview of data science and the fundamental skills
required to be a data scientist they will learn how to clean analyse and visualise
data as well as how to effectively communicate the findings to drive actionable
interventions the tools to be used will be pyth

In [47]:

# Convert files to dicts
# Optional cleaning function here - input(str), output(str).
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=None, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is: {"name": "<some-document-name>, "text": "<the-actual-text>"}
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Finder)

# Let's have a look at the first 3 entries:
print(dicts[:3])

# Now, lets write the dicts containing the documents to our DG.
document_store.write_documents(dicts)

[{'text': '\ufeffScoring at EXPLORE - an explanation\nDear EXPLORERs,\nA number of you have had questions about your marks for each sprint and what they mean. This \ncommunication is meant to address those questions (and hopefully answer one or two you didn’t \nknow you had).\n \nHow do I get marks and what do they mean?\nAt EXPLORE the Data Science qualification is aligned with SETA, meaning that when you have \nsuccessfully finished with the course, you will have an NQF 5 accredited qualification. Since the \ncourse is SETA accredited, you don’t "pass" or "fail". You are found either "competent" or "not \nyet competent" in a given area, or “unit standard”. The trains, tests and predicts are built to align \nwith these unit standards.\nIn each Sprint, there are a set number of belt points available. These points are divided among \nTrains, Tests and Predicts and align with the various unit standards. Your overall sprint score is \nan indication of the proportion of available belt poin

08/24/2020 23:32:43 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.347s]


## Initialize Retriecer, Reader, & Finder

## Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. They use some simple but fast algorithm.

**Here**: We will use Elasticsearch's default BM25 algorithm

**Alternatives**:
* Customize the ElasticsearchRetriever with custom queries (e.g. boosting) and filters.
* Use TfifdRetriever in combination with a SQL or InMemory Document store for simple prototyping and debugging.
* Use EmbeddingRetriever to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT).
* Use DensePassageRetriever to use different embedding models for passage and query.

In [9]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [10]:
# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.

# from haystack.retriever.sparse import TfidfRetriever
# retriever = TfidfRetriever(document_store=document_store)

### Reader

A Reader scans the texts returned by retievers in detail and extracts the k best answers. They are based on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers. With both you can either [load a local model](https://colab.research.google.com/drive/18Xvjo49WIOB2MhHre66OfLkdRXioVJvP#scrollTo=HDPEEBeJGnuz) or one from [Hugging Face's model hub](https://huggingface.co/models)

**Here**: a medium sized [RoBERTa QA](https://huggingface.co/deepset/roberta-base-squad2) model using a Reader based on FARM

**Alternatives (Reader)**: TransformersReader

**Alternatives (Models)**:

We will test the top 5 most downloaded, large models from huggingface that have been trained on a SQuAD dataset. Presumably, the better models are the most downloaded models

* distilbert-base-cased-distilled-squad - FAST
* deepset/roberta-base-squad2
* distilbert-base-uncased-distilled-squad
* bert-large-uncased-whole-word-masking-finetuned-squad
* deepset/bert-large-uncased-whole-word-masking-squad2 - ACCURATE

The model can be adjusted to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

### FARMReader

In [11]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

readers = {"distilbert-base-cased-distilled-squad": FARMReader(model_name_or_path="distilbert-base-cased-distilled-squad", use_gpu=False),
           "deepset/roberta-base-squad2": FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False),
           "distilbert-base-uncased-distilled-squad": FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=False),
           "bert-large-uncased-whole-word-masking-finetuned-squad": FARMReader(model_name_or_path="bert-large-uncased-whole-word-masking-finetuned-squad", use_gpu=False),
          #  "bert-large-cased-whole-word-masking-finetuned-squad": FARMReader(model_name_or_path="bert-large-cased-whole-word-masking-finetuned-squad", use_gpu=False),
          #  "deepset/bert-base-cased-squad2": FARMReader(model_name_or_path="deepset/bert-base-cased-squad2", use_gpu=False),
          #  "deepset/xlm-roberta-large-squad2": FARMReader(model_name_or_path="deepset/xlm-roberta-large-squad2", use_gpu=False),
          #  "ktrapeznikov/albert-xlarge-v2-squad-v2": FARMReader(model_name_or_path="ktrapeznikov/albert-xlarge-v2-squad-v2", use_gpu=False),
           "deepset/bert-large-uncased-whole-word-masking-squad2": FARMReader(model_name_or_path="deepset/bert-large-uncased-whole-word-masking-squad2", use_gpu=False)
           }


08/24/2020 19:12:59 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/24/2020 19:12:59 - INFO - farm.infer -   Could not find `distilbert-base-cased-distilled-squad` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
08/24/2020 19:13:13 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/24/2020 19:13:13 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
08/24/2020 19:13:13 - INFO - farm.infer -    0 
08/24/2020 19:13:13 - INFO - farm.infer -   /w\
08/24/2020 19:13:13 - INFO - farm.infer -   /'\
08/24/2020 19:13:13 - INFO - farm.infer -   
08/24/2020 19:13:13 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/24/2020 19:13:13 - INFO - farm.infer -   Could not find `d

### TransformersReader

In [None]:
# Alternative:
# reader = TransformersReader(model="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

### Finder

The Finder sticks together reader and retriever in a pipeline to answore our actual questions.

In [None]:
finder = Finder(reader, retriever)

## Now we can finally ask a question!

The number of candidates the reader and retirever return can be configured in the reader

The higher top_k_retriever, the better (but slower) your answers.

In [None]:
query = 'Will I be kicked out if I do badly?'

In [None]:
prediction = finder.get_answers(question=query, top_k_retriever=1, top_k_reader=3)

08/24/2020 09:34:12 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.019s]
08/24/2020 09:34:12 - INFO - haystack.retriever.sparse -   Got 1 candidates from retriever
08/24/2020 09:34:12 - INFO - haystack.finder -   Reader is looking for detailed answer in 4593 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:05<00:00,  5.40s/ Batches]


In [None]:
print_answers(prediction, details = 'all')

{   'answers': [   {   'answer': 'Not at all!',
                       'context': 'I fail the course? Will I \n'
                                  'be kicked out if I do badly in any sprint?\n'
                                  'Not at all! Belt points are cumulative, so '
                                  'if you do not have enough belt points',
                       'document_id': '32ceeb24-8279-43ef-9726-c236821b1531',
                       'meta': {   'name': 'Scoring at EXPLORE an '
                                           'explanation.txt'},
                       'offset_end': 81,
                       'offset_end_in_doc': 1581,
                       'offset_start': 70,
                       'offset_start_in_doc': 1570,
                       'probability': 0.5493792648547936,
                       'score': 1.5853039026260376},
                   {   'answer': 'Not at all',
                       'context': 'I fail the course? Will I \n'
                                 

## Fine-tuning

Feedback can be gathered by production systems using Haystack's [REST API interface](https://github.com/deepset-ai/haystack#rest-api). This includes a customizable user feedback API for providing feedback on the answer returned. The API provides feedback export endpoint to obtain the feedback data for further fine-tuning.

Once training data has been collected, base models can be tuned. We initialize a base reader as a base model and fine-tune it on our own custom SQuAD-like dataset.

In [None]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
train_data = path2train
reader.train(data_dir=train_data, train_filename="answers.json", use_gpu=True, n_epochs=1, save_dir="my_model")

08/24/2020 09:35:18 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
08/24/2020 09:35:18 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
08/24/2020 09:35:33 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
08/24/2020 09:35:33 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
08/24/2020 09:35:33 - INFO - farm.infer -    0 
08/24/2020 09:35:33 - INFO - farm.infer -   /w\
08/24/2020 09:35:33 - INFO - farm.infer -   /'\
08/24/2020 09:35:33 - INFO - farm.infer -   
08/24/2020 09:35:33 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
Preprocessing Dataset gdrive/My Drive/EDSA/Zeus/EQuAD/answers.json: 

TypeError: ignored

In [None]:
# Saving the model happens automatically at the end of training into the `save_dir` you specified
# However, you could also save a reader manually again via:
reader.save(directory="my_model")

In [None]:
# If you want to load it at a later point, just do:
new_reader = FARMReader(model_name_or_path="my_model")

## Dense Passage Retrieval
better retrievers [here](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb)


### Types of Retrievers

#### Sparse
This family of algorithms is based on counting the occurences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.
Examples:
* BM25
* TF-IDF
Pros: Simple, fast, well explainable
Cons: Relies on exact keyword matches between query and text

#### Dense
These retrievers use neural network models to create "dense" embedding vectors. Within this family there are two approaches:
* Single encoder: Use a single model to embed both query and passage.
* Dual-encoder: Use **two models**, one to embed the query and one to embed the passage.

Recent work suggests that dual encoders work better, likely because they can deal better with the differenct nature of query and passage (length, style, syntax).
Examples:
* REALM
* DPR
* Sentence-Transformers
Pros: Capturs semantic similarity instead of "word matches" (e.g. synonyms, related topics)
Cons: Computationally more heavy, initial training

## Evaluation

add evaluation [here](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb)

## Conclusion