<img src="./assets/ga-logo.png" style="float: left; margin: 20px; height: 55px">

# Lab: Build a Question Answer (QA) Pipeline for Financial Documents

---

# Objective
The goal of this lab is to build a Question Answer (QA) pipeline for financial documents that financial analysts can use to inform their research by processing tons of text based data quickly, and pulling out key insights.

#### Steps:
1. Build an initial pipeline using a single text document `financial_context`.
2. Test a few pretrained models and select one to use for this task.
3. Evaluate the performance of your model.
4. Make recommendations for when if should and should not be used.
5. _BONUS_: Extend the pipeline to a larger set of documents from the FinQA data.

# Data
In this lab you will data from the FinQA dataset. 

In the `data` folder you have a copy of the FinQA datasets. [Original FinQA Source](https://github.com/czyssrs/FinQA)

For question 1, you will use one sample document, labeled `financial_context`

### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json
from transformers import pipeline
from IPython.display import display, HTML

In [None]:
# Sample financial context from FinQA dataset
financial_context = """
ITEM 7: MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS

Net Sales
Fiscal 2018 net sales increased 7% to $8,716 million compared to $8,140 million in fiscal 2017. The increase in net sales 
primarily reflects higher sales in our Convenience Stores and Retail segments, which increased 8% and 7%, respectively. 
The higher Convenience Stores sales were driven by contributions from acquisitions and a 1.9% increase in same-store 
merchandise sales. Retail sales benefited from a 4.3% increase in same-store sales, including 38% growth in digital 
sales, and sales from new and relocated stores, partly offset by the impact of closed stores. Wholesale segment sales 
increased 4% from fiscal 2017, which reflected the benefit of sales to new customers and higher sales to existing customers, 
partly offset by customer attrition.

Gross Profit
Gross profit increased 7% to $2,164 million in fiscal 2018 versus $2,019 million in fiscal 2017, reflecting higher sales 
and improved Wholesale segment gross profit rates. Consolidated gross profit rate for fiscal 2018 was 24.8%, equal to 
the fiscal 2017 rate, as the benefit of the improved Wholesale segment gross profit rate was offset by a lower Retail 
segment rate. The Wholesale segment gross profit rate improved 16 basis points to 14.5%, due primarily to a shift 
in business mix, including the benefit of acquisitions, and lower inventory shrink. The Retail segment rate decreased 
30 basis points to 27.0%, primarily reflecting a higher LIFO expense and increased promotional activity.
"""


# Question 1: Build a QA pipeline 

1. Build a 'question-answering' pipeline using the Hugging Face pipeline and the [distilbert-base-cased-distilled-squad](https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad)
2. Ask a few related questions of the sample document `financial_context` which should be passed into your pipeline as context 



**Questions**: 
```python 
    questions = [
    "What was the net sales amount in fiscal 2018?",
    "By what percentage did fiscal 2018 net sales increase compared to fiscal 2017?",
    "What was the gross profit in fiscal 2018?",
    "What was the Wholesale segment gross profit rate?"
    ]
```


In [None]:
# create a QA pipeline using a pre-trained model using `distilbert-base-cased-distilled-squad`


In [None]:
# ask a question about the first question about the financial context 
question = "What was the increase in net sales for fiscal 2018?"

In [None]:
# call the QA pipeline


In [None]:
# display the result (be sure to print the score as well as the result)


#### How did the model do on this question? 

In [None]:
# try additional questions
questions = [
"What was the net sales amount in fiscal 2018?",
"By what percentage did fiscal 2018 net sales increase compared to fiscal 2017?",
"What was the gross profit in fiscal 2018?",
"What was the Wholesale segment gross profit rate?"
]

In [None]:
# write a for loop to iterate through the questions and call the QA pipeline


#### How did the model do on this question? 
Note your results may be different from the solutions. Evaluate your results. 


# Question 2:  Test a few pretrained models and select one to use for this task

-  Explore Hugging Face and select a few additional models to test for this task 
-  Feeling stuck? try the following 
   -  [bert-large-uncased-whole-word-masking-finetuned-squad](https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking-finetuned-squad)
   -  [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2)

#### Model 2 

In [None]:
# call the each question and evaluate the results 


#### How did the model do with the questions?
**Reminder: questions**
1. Net sales amount in fiscal 2018
2. Percentage increase for fiscal 2018
3. Gross profit in fiscal 2018
4. Wholesale segment gross profit rate

#### Model 3 

#### How did the model do with the questions?
**Reminder: questions**
1. Net sales amount in fiscal 2018
2. Percentage increase for fiscal 2018
3. Gross profit in fiscal 2018
4. Wholesale segment gross profit rate

# Question 3

#### a) Which model performed the best?

#### b) Besides accuracy what other factors are important to consider when evaluating the performance of a QA model?

### BONUS Question:  Extend the pipeline 
-  Extend the pipeline to a larger set of documents from the FinQA data 
Steps:
-  load the dataset using the helper code below 
-  Iterate through different contexts with a for loop

### Load the dataset 
In the `data` folder you have a copy of the FinQA datasets. The function below will load it into pandas. 

[Original FinQA Source](https://github.com/czyssrs/FinQA)

In [None]:
# path to the dataset
path = "data/FinQA/"

# open and load the training data
with open(f"{path}/train.json", "r") as f:
    train_data = json.load(f)

# convert to a pandas DataFrame
train_dataset = pd.DataFrame(train_data)

# print dataset info
print(f"Training dataset loaded: {len(train_dataset)} examples")

In [None]:
# explore the dataset


In [None]:
# function to prepare context from a sample


# test the QA pipeline on a few examples from the dataset


#### Bonus: How could you extend this to a real world model? 