### Imports and pre-recs

In [180]:
from haystack.nodes import AnswerParser, PromptNode, PromptTemplate, TransformersReader, BM25Retriever
from haystack.schema import Document
from haystack.document_stores import InMemoryDocumentStore
from haystack.utils import convert_files_to_docs, print_answers
from haystack import Pipeline

import requests
import time
import os
import re
from pprint import pprint

from bs4 import BeautifulSoup
import mechanicalsoup

### Test document with basic config

Testing basic [Haystack](https://haystack.deepset.ai/) default config.

In [2]:
# Read test file content
test_file = open("test_article.txt", "r", encoding="utf8")
test_content = test_file.read()
print(test_content)
test_file.close()

Every six weeks the members of THE Bank of England’s Monetary Policy Committee commute – I’d imagine from homes in Surrey (train to Waterloo then Waterloo & City Line) or Hampshire (train to Waterloo then Waterloo & City Line) or Sussex (train to Victoria then District/Circle Line to Mansion House and a short walk) – to Bank, where the BoE is situated, right next to the station. The members meet – every six weeks – to vote. To vote on what rate will be set as the Bank Rate.

Required background knowledge – BoE, BR, INF:

The BoE was formed in 1694 to help finance the war vs. France. Its modern purpose is to be the bank of banks (and the government). The idea is to separate the monetary from the fiscal: a government with the ability to issue debt and then basically set the interest on that debt was too much power for said government to have. The BoE would “work together” with the gov. to “manage” the economy, with the ability to keep gov. raising of money via debt issuance in check by k

In [3]:
# Create test Document
test_doc = [Document(test_content)]

In [4]:
# Setting up test
prompt_node = PromptNode()
question_answering_per_doc = PromptTemplate("deepset/question-answering-per-document", output_parser=AnswerParser())

In [5]:
# Generating answer from prompt
result = prompt_node.prompt(
    query="How often does the Bank of England’s Monetary Policy Committee meet?",
    documents=test_doc,
    prompt_template=question_answering_per_doc
)

Token indices sequence length is longer than the specified maximum sequence length for this model (1246 > 512). Running this sequence through the model will result in indexing errors
The prompt has been truncated from 1246 tokens to 412 tokens so that the prompt length and answer length (100 tokens) fit within the max token limit (512 tokens). Shorten the prompt to prevent it from being cut off


In [6]:
# Printing result
print(result[0])

<Answer: answer='to separate the monetary and fiscal: a government with the ability to issue debt and then basically set the interest on that debt was too much power for said government to have', score=None, context=None>


Fuck me that took a long time.

### Adding custom model

Try with [tinyroberta-squad2](https://huggingface.co/deepset/tinyroberta-squad2) QA model to see what we get.

In [7]:
HF_MODEL_NAME = 'deepset/tinyroberta-squad2' 

reader = TransformersReader(
    model_name_or_path=HF_MODEL_NAME,
    tokenizer=HF_MODEL_NAME,
    use_gpu=-1
)

In [8]:
result = reader.predict(
    query="How often does the Bank of England’s Monetary Policy Committee meet?",
    documents=test_doc,
    top_k=10
)

In [9]:
# Printing result
print(result['answers'][0])

<Answer: answer='Every six weeks', score=0.7901942729949951, context='Every six weeks the members of THE Bank of England...'>


### Multiple documents

In [10]:
# Change to docuemnt store template
question_anwering_with_scores = PromptTemplate("deepset/question-answering-with-document-scores")

In [158]:
# Document store from memory
document_store = InMemoryDocumentStore(use_bm25=True)

In [12]:
# Load docs into doc store
docs = convert_files_to_docs('ArticleStore')
document_store.write_documents(docs)

Updating BM25 representation...: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<?, ? docs/s]


In [13]:
# Test documents loaded
result = reader.predict(
    query="When was the BoE formed?",
    documents=document_store,
    top_k=10
)
print(result['answers'][0])

<Answer: answer='1694', score=0.9720327854156494, context='The BoE was formed in 1694 to help finance the war...'>


### Simple pipeline

In [14]:
# Create retriever and prompt_node
retriever = BM25Retriever(document_store)
reader = TransformersReader(
    model_name_or_path=HF_MODEL_NAME,
    tokenizer=HF_MODEL_NAME,
    use_gpu=-1
)

In [15]:
# Create basic pipeline
p = Pipeline()
p.add_node(component=retriever, name="Retriever", inputs=["Query"])
p.add_node(component=reader, name="Reader", inputs=["Retriever"])

In [16]:
# Test the pipeline
result = p.run(query="When was the BoE formed?")

In [17]:
print(result['answers'][0])

<Answer: answer='1694', score=0.9720327854156494, context='The BoE was formed in 1694 to help finance the war...'>


### Scraping blog posts

Use mechanical soup to navigate thropugh the blog post pages, and extract that text from each page using beautiful soup.

In [155]:
# Create a class to scrape text from blog posts
class BlogArticleScraper:
    """
    A class to scrape article text from a blog and save to separate text files.

    Methods:
        scrape_post_urls: Run through post pages to collect post titles and urls
        save_blog_text: Extract text from dict of blog post urls and save in dir
        read_and_save_blogs: Read all blog posts and save the text
    """
    
    def __init__(self, blog_posts_url, dest_dir):
        self.blog_posts_url = blog_posts_url
        self.dest_dir = dest_dir

        self.headers = {  # Fake headers to get around bot detection
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }

    def scrape_post_urls(self) -> dict:
        """Run through post pages to collect post titles and urls"""
        blog_urls_dict = {}  # Empty dict for storing blog post urls
        page_num = 1  # Start at page number 1
        while True:
            # Get blog post links for that page
            page_url = f'{self.blog_posts_url}/page/{page_num}'
            response = requests.get(page_url, headers=headers)
            raw_posts_soup = BeautifulSoup(response.content, 'html.parser')
            post_links = raw_posts_soup.select('.entry-title')
            # Check if page contains blog posts
            if post_links == []:
                break
            # Add title and link to blog url dict
            for link in post_links:
                blog_urls_dict[link.select_one('a')['title']] = link.select_one('a')['href']
            # Add one to page number to look for links
            page_num += 1
            # Add a delay before accessing the next URL
            time.sleep(2)
        # Return blog url dict
        return blog_urls_dict

    def save_blog_text(self, blog_title: str, blog_post_url: str, output_dir: str):
        """Extract text from a blog post and save in dir"""
        # Acces blog post
        response = requests.get(blog_post_url, headers=self.headers)
        raw_blog_soup = BeautifulSoup(response.content, 'html.parser')
        # Edit blog title
        no_special_char_title = re.sub(r'[^\w\s]', '', blog_title)
        no_space_title = no_special_char_title.replace(" ", "_") + '.txt'
        # Save text output
        with open(os.path.join(output_dir, no_space_title), 'w', encoding='utf-8') as file:
            file.write(raw_blog_soup.select_one('.entry-content').text)

    def read_and_save_blogs(self):
        """Read all blog posts and save the text"""
        for key, value in self.scrape_post_urls().items():
            self.save_blog_text(key, value, self.dest_dir)

In [156]:
# Use the class to scrape blog posts from my blog
BlogArticleScraper(blog_posts_url='https://www.perpetualprudence.com/blog',
                   dest_dir='ArticleStore').read_and_save_blogs()

### Getting answers from all documents

Let's just re-run all commands up to here to make it easy.

In [None]:
# Use the class to scrape blog posts from my blog
BlogArticleScraper(blog_posts_url='https://www.perpetualprudence.com/blog',
                   dest_dir='ArticleStore').read_and_save_blogs()

In [159]:
# Create a document store and add documents
question_anwering_with_scores = PromptTemplate("deepset/question-answering-with-document-scores")
document_store = InMemoryDocumentStore(use_bm25=True)
docs = convert_files_to_docs('ArticleStore')
document_store.write_documents(docs)

Updating BM25 representation...: 100%|████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 1501.21 docs/s]


In [160]:
# Define the model to use for QA
HF_MODEL_NAME = 'deepset/tinyroberta-squad2' 

In [161]:
# Define reader and receiver to be used in the pipeline
retriever = BM25Retriever(document_store)
reader = TransformersReader(
    model_name_or_path=HF_MODEL_NAME,
    tokenizer=HF_MODEL_NAME,
    use_gpu=-1
)

In [162]:
# Create pipeline
pp_qa_pipe = Pipeline()
pp_qa_pipe.add_node(component=retriever, name="Retriever", inputs=["Query"])
pp_qa_pipe.add_node(component=reader, name="Reader", inputs=["Retriever"])

In [191]:
# Test question one - easy
pp_qa_pipe.run(
    "When was the BoE founded?"
)['answers'][0].answer

'1694'

In [192]:
# Test question two - medium
pp_qa_pipe.run(
    "Is using active management a good idea?"
)['answers'][0].answer

'Itâ€™s a myth'

In [193]:
# Test question three - hard
pp_qa_pipe.run(
    "Should I buy a house?"
)['answers'][0].answer

'If you know you definitely want to buy a house as soon as possible'

### Changing the model to generative

These responses are not very good...what we really want is some context/fluff around them. For this we must introduce a generative component.

https://haystack.deepset.ai/tutorials/07_rag_generator