# End-To-End NLP: Question Answering 

---

## Overview  

End-to-End NLP material is designed from a real-world perspective that follows Data processing, development, and deployment pipeline paradigm. The material consist of three labs and the goal is to walk you through the a single flow of raw text `data preprocessing` and how to build a SQuAD dataset format for Question Answering, train the dataset via `NVIDIA TAO` transfer learning BERT model, and deploy using `RIVA`. Furthermore, a challenge notebook is introduced to test your understanding of the material and solidify your experience in the Question Answering (QA) domain.

### Why End-to-End NLP?

Solving real-world problem in AI domain requires the use of set of tools (software stacks and frameworks) and the solution process always follows the `data processing`, `development`, and `deployment` pattern. This material is to:
- assist AI hackathon participants to learn and apply the knowledge to solve their task using NVIDIA software stacks and frameworks
- enables bootcamp attendees to solve real-world problem using end-to-end approach (data processing --> development --> deployment)


### Implementation Architecture
The architecture implementation components include:
- Data preprocessing phase
- TAO training using QA model
- RIVA deployment


<img src="jupyter_notebook/images/end-to-end-arch.jpg" width="700px" height="700px"/>

### Application Flow
The application flow is as follows:
- prepare your question speech/audio
- The question serves as input into the Speech-To-Text (STT)  Automatic Speech Recognition Model (ASR) that transcribe the speech into text 
- The output text from STT model is then passed to the NLP model to infer answer in text format. The text answer serves as input into the Text-to-speech (TTS) model 
- Finally, the TTS model synthesize the answer transcript into audio speech 


<img src="jupyter_notebook/images/application-flow.jpg" width="700px" height="700px"/>

The table of content below will walk you through the QA data processing phase of `End-to-End approach to NLP`, and the Exercise included will test your understanding of the concept.

### Table of Content

The following contents will be covered:
1. Data preprocessing
    1. [Overview of QA Dataset](jupyter_notebook/Overview.ipynb)
        1. [Introduction to QA](jupyter_notebook/Overview.ipynb#Introduction-to-NLP-Question-Answering-System)
        1. [Brief on QA Dataset](jupyter_notebook/Overview.ipynb#Brief-on-QA-Dataset)
    1. [Common Preprocessing Techniques for Raw Text Data](jupyter_notebook/General_preprocessing.ipynb)
    1. [QA Text Data preprocessing](jupyter_notebook/QandA_data_processing.ipynb)
        1. [SQuAD Dataset Structural Format](jupyter_notebook/QandA_data_processing.ipynb#SQUAD)
        1. [Text Data Source](jupyter_notebook/QandA_data_processing.ipynb#Text-Data-Source)    
        1. [Mannual QA Extraction](jupyter_notebook/QandA_data_processing.ipynb#Mannual-QA-Extraction)
        1. [Automatic QA Generation with T5 model](jupyter_notebook/QandA_data_processing.ipynb#Automatic-QA-Generation-with-T5-base-model)
    1. [Exercise](jupyter_notebook/Exercise.ipynb)
    1. [Summary](jupyter_notebook/Summary.ipynb)
1. Development
    1. [Transfer learning with NVIDIA TAO](jupyter_notebook/question-answering-training.ipynb)
1. Deployment
   1. [RIVA Deployment](jupyter_notebook/qa-riva-deployment.ipynb) 
1. [Challenge](jupyter_notebook/challenge.ipynb)


### Check your GPU

Let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the nvidia-smi command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting `Ctrl-Enter`, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!nvidia-smi

### Tutorial Duration

The material will be presented 2 labs in a total of 8hrs session as follows:
- Data preprocessing Lab: `3hrs: 30mins`
- Development & Deployment : `4hrs: 30mins`

### Content Level
Beginner to Advanced

### Target Audience and Prerequisites
The target audience for this labs are researchers, graduate students, and developers who are interested in End-to-End approach to solving NLP task via the use of GPUs. Audience are expected to have Python programming background Knowledge and and possess NVIDIA ngc key.


---
## Licensing

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.