# Question-Answering Chatbot via Transformer


<div class="alert alert-block alert-danger">
<b>Google Colab:</b> This notebook tutorial will not work through Google Colab, or most similar hosted notebook services, since it requires you to expose a port on your machine.
    To follow along with this tutorial, please download the notebook and run it in you local Jupyter environment.
</div>

In this tutorial, you will build your own chatbot that can answer questions about COVID-19 through a web interface.


<div class="alert alert-block alert-info">
<b>See Also:</b> In this tutorial you will recreate Jina's chatbot example: https://docs.jina.ai/get-started/hello-world/covid-19-chatbot/.
The code here has some minor changes compared to the originial source, which you can find [here](https://github.com/jina-ai/jina/tree/master/jina/helloworld/chatbot).</div>

At the end of this tutorial, you will have your own chatbot. You will use text as an input and get a text results as
output. For this example, we will use a [covid dataset](https://www.kaggle.com/xhlulu/covidqa). You will understand how
every part of this example works and how you can create new apps with different datasets on your own.

## Define data and work directories

You can start by creating an empty folder.
Here, that folder is simply named 'tutorial', but you can name it whatever you want.

The chatbot will disply its answers in a browser, so download the static folder from
[here](https://github.com/jina-ai/jina/tree/master/jina/helloworld/chatbot/static), or by simply running the next cell.
This is only the CSS and HTML files to render our results.

In [None]:
! wget --directory-prefix=./static https://raw.githubusercontent.com/jina-ai/jina/master/jina/helloworld/chatbot/static/index.html https://raw.githubusercontent.com/jina-ai/jina/master/jina/helloworld/chatbot/static/script.js https://raw.githubusercontent.com/jina-ai/jina/master/jina/helloworld/chatbot/static/style.css https://raw.githubusercontent.com/jina-ai/jina/master/jina/helloworld/chatbot/static/license.txt

The bot uses a dataset in a `.csv` format. In this tutorial you will use
the [COVID](https://www.kaggle.com/xhlulu/covidqa) dataset from Kaggle.

Download it under your `tutorial` directory:

In [None]:
! wget https://static.jina.ai/chatbot/dataset.csv

## Create Documents from a csv file

In the most simple case, a `Document` can be created like this:

In [None]:
from docarray import Document

doc = Document(content='hello, world!')

In the case of your chatbot, the content of the Documents needs to be the dataset we want to use.
Additionally, if the dataset at hand is large compared to the available system memory, it makes sense to pass the Documents
as a *generator*

In [None]:
from docarray import Document, DocumentArray
from docarray.document.generators import from_csv

docs = from_csv('dataset.csv', field_resolver={'question': 'text'})

So what happened there? You created a generator of Documents `docs`, and you
used [from_csv](https://docarray.jina.ai/api/docarray.document.generators/?highlight=generators#module-docarray.document.generators) to
load our dataset. You used `field_resolver` to map the text from our dataset to the Document attributes.

## Create Flow

No you need to create a simple `Flow` that processes the Documents.

For now, your Flow will be little more than a placeholder pipeline.
You will add actual functionality later in this tutorial.

First, you should import everything we need:

In [None]:
import os
import webbrowser
from pathlib import Path
from jina import Flow, Executor, requests
from jina.logging.predefined import default_logger
from docarray.document.generators import from_csv

Then you can create a `main` and a `tutorial` function that creates a Flow and two dummy Executors.

In [None]:
def tutorial(port_expose):
    class MyTransformer(Executor):
        @requests(on='/foo')
        def foo(self, **kwargs):
            print(f'foo is doing cool stuff: {kwargs}')

    class MyIndexer(Executor):
        @requests(on='/bar')
        def bar(self, **kwargs):
            print(f'bar is doing cool stuff: {kwargs}')
    
    flow = (
        Flow()
            .add(name='MyTransformer', uses=MyTransformer)
            .add(name='MyIndexer', uses=MyIndexer)
    )
    with flow:
        flow.index(from_csv('dataset.csv', field_resolver={'question': 'text'}))

tutorial(8080)

If you run this, it should finish without errors. You won't see much yet because we are not showing anything after we
index.

To actually see something you need to specify how the outputs of the Flow will be displayed.
For our tutorial, that will happen through a web browser.
After indexing, the program will open a web browser to serve the static html files.

You also need to configure and serve the Flow
on a specific port with the HTTP protocol so that the web browser can make requests to the Flow. So, you need to pass the
parameter `port_expose` to configure the Flow and set the protocol to HTTP. Modify the function `tutorial` like so:


In [None]:
def tutorial(port_expose):
    class MyTransformer(Executor):
        @requests(on='/foo')
        def foo(self, **kwargs):
            print(f'foo is doing cool stuff: {kwargs}')
    
    class MyIndexer(Executor):
        @requests(on='/bar')
        def bar(self, **kwargs):
            print(f'bar is doing cool stuff: {kwargs}')
    
    flow = (
        Flow(cors=True, protocol='http', port_expose = port_expose)
            .add(name='MyTransformer', uses=MyTransformer)
            .add(name='MyIndexer', uses=MyIndexer)
    )
    with flow:
        flow.index(from_csv('dataset.csv', field_resolver={'question': 'text'}))
        url_html_path = 'file://' + os.path.abspath(
            os.path.join(
                os.path.dirname(os.path.abspath('')), 'static/index.html'
            )
        )
        try:
            webbrowser.open(url_html_path, new=2)
        except:
            pass  # intentional pass, browser support isn't cross-platform
        finally:
            default_logger.success(
                f'You should see a demo page opened in your browser, '
                f'if not, you may open {url_html_path} manually'
            )
        flow.block()

<div class="alert alert-block alert-info">
<b>See Also:</b> For more information on what the Flow is doing, and how to serve the Flow with 'f.block()' and configure the protocol, 
check the Flow fundamentals section in the Jina Documentation: https://docs.jina.ai/fundamentals/flow/</div>


<div class="alert alert-block alert-warning">
<b>Important:</b> Since you want to call your Flow from the browser, it's important to enable 
Cross-Origin Resource Sharing (https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) with 'Flow(cors=True)'.
</div>

Ok, so it seems that you have plenty of work done already. If you run this you will see a new tab open in your browser,
and there you will have a text box ready for you to input some text. However, if you try to enter anything you won't get
any results. This is because we are using dummy Executors. Our `MyTransformer` and `MyIndexer` aren't actually doing
anything. So far they only print a line when they are called. So we need real Executors.

## Create Executors

It is usuall godd practice to put your Executors in a separate file  (like `my_executors.py`).
Here, we will just put everything in the same notebook.

### Sentence Transformer

First, let's import the following:

In [None]:
from typing import Dict

from docarray import DocumentArray
from jina import Executor, requests
import numpy as np
import torch
from transformers import AutoModel, AutoTokenizer

Now, let's implement `MyTransformer`:

In [None]:
class MyTransformer(Executor):
    """Transformer executor class """

    def __init__(
        self,
        pretrained_model_name_or_path: str = 'sentence-transformers/paraphrase-mpnet-base-v2',
        pooling_strategy: str = 'mean',
        layer_index: int = -1,
        *args,
        **kwargs,
    ):
        super().__init__(*args, **kwargs)
        self.pretrained_model_name_or_path = pretrained_model_name_or_path
        self.pooling_strategy = pooling_strategy
        self.layer_index = layer_index
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.pretrained_model_name_or_path
        )
        self.model = AutoModel.from_pretrained(
            self.pretrained_model_name_or_path, output_hidden_states=True
        )
        self.model.to(torch.device('cpu'))

    def _compute_embedding(self, hidden_states: 'torch.Tensor', input_tokens: Dict):

        fill_vals = {'cls': 0.0, 'mean': 0.0, 'max': -np.inf, 'min': np.inf}
        fill_val = torch.tensor(
            fill_vals[self.pooling_strategy], device=torch.device('cpu')
        )

        layer = hidden_states[self.layer_index]
        attn_mask = input_tokens['attention_mask'].unsqueeze(-1).expand_as(layer)
        layer = torch.where(attn_mask.bool(), layer, fill_val)

        embeddings = layer.sum(dim=1) / attn_mask.sum(dim=1)
        return embeddings.cpu().numpy()

    @requests
    def encode(self, docs: 'DocumentArray', **kwargs):
        with torch.inference_mode():
            if not self.tokenizer.pad_token:
                self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
                self.model.resize_token_embeddings(len(self.tokenizer.vocab))

            input_tokens = self.tokenizer(
                docs[:, 'content'],
                padding='longest',
                truncation=True,
                return_tensors='pt',
            )
            input_tokens = {
                k: v.to(torch.device('cpu')) for k, v in input_tokens.items()
            }

            outputs = self.model(**input_tokens)
            hidden_states = outputs.hidden_states

            docs.embeddings = self._compute_embedding(hidden_states, input_tokens)

`MyTransformer` exposes only one endpoint: `encode`. This will be called whenever you make a request to the Flow, either
on query or index. The endpoint will create embeddings for the indexed or query Documents, which in turn can be used to
get the closest matches between a question and an answer.

<div class="alert alert-block alert-info">
<b>Note:</b> Encoding is a fundamental concept in neural search. It means representing the data in a vectorial form (embeddings). </div>

Encoding is performed through a sentence-transformers model (`paraphrase-mpnet-base-v2` by default). You get the text
attributes of docs in batch and then compute embeddings. Later, you will set the embedding attribute of each Document.

### Simple Indexer

Now, let's implement your indexer (`MyIndexer`):

<div class="alert alert-block alert-info">
<b>See Also:</b> In order to make this tutorial truly end-to-end, here you implement an Indexer yourself.
If you want the same functionality (and slightly more) out of the box, you can also use the
SimpleIndexer from Jina Hub: https://hub.jina.ai/executor/zb38xlt4. </div>

In [None]:
import os  # imports from above are also needed

class MyIndexer(Executor):
    """Simple indexer class """

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.table_name = 'qabot_docs'
        self._docs = DocumentArray(storage='sqlite',
                                   config={'connection': os.path.join(self.workspace, 'indexer'),
                                           'table_name': self.table_name})

    @requests(on='/index')
    def index(self, docs: 'DocumentArray', **kwargs):
        self._docs.extend(docs)

    @requests(on='/search')
    def search(self, docs: 'DocumentArray', **kwargs):
        """Append best matches to each document in docs

        :param docs: documents that are searched
        :param parameters: dictionary of pairs (parameter,value)
        :param kwargs: other keyword arguments
        """
        docs.match(
            self._docs,
            metric='cosine',
            normalization=(1, 0),
            limit=1,
        )

`MyIndexer` exposes 2 endpoints: `index` and `search`. To perform indexing, you use docarray's
[SQLite store](https://docarray.jina.ai/advanced/document-store/sqlite/).
Indexing is a simple as adding the Documents to the `DocumentArray` with SQLite store.

<div class="alert alert-block alert-info">
<b>See Also:</b> Learn more about Document Stores: https://docarray.jina.ai/advanced/document-store/ </div>

To perform the search operation, you use the method `match` which will return the top match for the query Documents using
the cosine similarity.

<div class="alert alert-block alert-info">
<b>See Also:</b> '.match()' is a method of DocumentArray. Learn more about it in the DocArray documentation: https://docarray.jina.ai/fundamentals/documentarray/matching/ </div>

Now you can modify your app to use the real Executors:

In [None]:
import os
import webbrowser
from jina import Flow
from jina.logging.predefined import default_logger
from docarray.document.generators import from_csv


def tutorial(port_expose):
    flow = (
        Flow(cors=True, protocol='http', port_expose=port_expose)
            .add(name='MyTransformer', uses=MyTransformer)
            .add(name='MyIndexer', uses=MyIndexer, uses_metas={'workspace': os.path.abspath('')})
    )
    with flow:
        flow.index(from_csv('dataset.csv', field_resolver={'question': 'text'}))
        
        url_html_path = 'file://' + os.path.abspath(
            os.path.join(
                os.path.abspath(''), 'static/index.html'
            )
        )
        print(url_html_path)
        try:
            webbrowser.open(url_html_path, new=2)
        except:
            pass  # intentional pass, browser support isn't cross-platform
        finally:
            default_logger.success(
                f'You should see a demo page opened in your browser, '
                f'if not, you may open {url_html_path} manually'
            )
        flow.block()

And finally, run it:

In [None]:
tutorial(8080)