<a href="https://colab.research.google.com/github/RhiM1/hello-world/blob/main/Copy_of_Copy_of_Tutorial1_Basic_QA_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install urllib3==1.25.4

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-6ylcxrlo
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-6ylcxrlo
Collecting farm==0.4.9
[?25l  Downloading https://files.pythonhosted.org/packages/7b/6a/d30bc97eaca322d35979f7a9f8fd8102e53833d3eb5b3bd02add1196ac94/farm-0.4.9-py3-none-any.whl (190kB)
[K     |████████████████████████████████| 194kB 2.6MB/s 
[?25hCollecting fastapi
[?25l  Downloading https://files.pythonhosted.org/packages/48/65/454fb440d48098845875b5ba8599efafee1efabb97720a584c78674e6d26/fastapi-0.61.1-py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 3.9MB/s 
[?25hCollecting uvicorn
[?25l  Downloading https://files.pythonhosted.org/packages/30/cc/01cc4cb980dfcf04eb283b6497c7f280928a0b02c68c0f85b6901e7716ae/uvicorn-0.12.2-py3-none-any.whl (45kB)
[K     |████████████████████████████████| 51kB 3.3MB/s 
[?25hCollecting

In [None]:
from haystack import Finder
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

## Document Store

Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `FAISSDocumentStore`,  `SQLDocumentStore`, and `InMemoryDocumentStore`.

**Here:** We recommended Elasticsearch as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).

**Alternatives:** If you are unable to setup an Elasticsearch instance, then follow the [Tutorial 3](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb) for using SQL/InMemory document stores.

**Hint**: This tutorial creates a new document store instance with Wikipedia articles on Game of Thrones. However, you can configure Haystack to work with your existing document stores.

### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [None]:
# Recommended: Start Elasticsearch using Docker
#! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

In [None]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [None]:
# Connect to Elasticsearch

from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

10/27/2020 07:58:14 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.678s]
10/27/2020 07:58:14 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.302s]


## Preprocessing of documents

Haystack provides a customizable pipeline for:
 - converting files into texts
 - cleaning texts
 - splitting texts
 - writing them to a Document Store

In this tutorial, we download Wikipedia articles about Game of Thrones, apply a basic cleaning function, and index them in Elasticsearch.

In [None]:
!pip install wikipedia
!pip install wikipedia-api

Collecting wikipedia
  Downloading https://files.pythonhosted.org/packages/67/35/25e68fbc99e672127cc6fbb14b8ec1ba3dfef035bf1e4c90f78f24a80b7d/wikipedia-1.4.0.tar.gz
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-cp36-none-any.whl size=11686 sha256=b5632b4c98b6f0a94a96fa07f486a5249b141546d9228da08349887334069ab0
  Stored in directory: /root/.cache/pip/wheels/87/2a/18/4e471fd96d12114d16fe4a446d00c3b38fb9efcb744bd31f4a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Collecting wikipedia-api
  Downloading https://files.pythonhosted.org/packages/ef/3d/289963bbf51f8d00cdf7483cdc2baee25ba877e8b4eb72157c47211e3b57/Wikipedia-API-0.5.4.tar.gz
Building wheels for collected packages: wikipedia-api
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia-api: filename=Wikipedia_A

In [None]:
# Fetch documents based on book title
book_title = 'Great Expectations'

# Use wikipedia and wikipedia-api packages to get wikipedia articles returned by searching for the book title
import wikipedia
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')

# Here each document is an individual section from a Wikipedia page
dicts = []


# Split all sections into different documents
# def add_sections(title, sections, level=0):
#     for s in sections:
#         doc_dict = dict()
#         doc_dict['text'] = s.text
#         doc_dict['meta'] = dict()
#         doc_dict['meta']['name'] = title + '/' + s.title
#         dicts.append(doc_dict)
#         add_sections(title, s.sections, level+1)

# for counter, title in enumerate(wikipedia.search(book_title, results=50)):
#     # Exclude articles with certain strings in the title
#     if not any(x in title for x in ("film", "video game", "soundtrack", "album")):
#         try: 
#             page_py = wiki_wiki.page(title)
#             add_sections(title, page_py.sections)  
            
#         except:
#             pass

# Alternatively, only split the highest-level sections into individual documents

section = ''

def add_sub_sections(title, sections, level=0):
    string = ''

    for s in sections:
        string += s.text
        string += add_sub_sections(title, s.sections, level+1)
        print(string)

    return string

def add_sections(title, sections, level=0):
    for s in sections:
        section_text = add_sub_sections(title, s.sections, level+1)
        doc_dict = dict()
        doc_dict['text'] = s.text + section_text
        doc_dict['meta'] = dict()
        doc_dict['meta']['name'] = title + '/' + s.title
        dicts.append(doc_dict)

for counter, title in enumerate(wikipedia.search(book_title, results=2)):
    # Exclude articles with certain strings in the title
    if not any(x in title for x in ("film", "video game", "soundtrack", "album")):
        # try: 
        page_py = wiki_wiki.page(title)
        add_sections(title, page_py.sections)  
            
        # except:
        #   pass

# The default format for the dicts here is:
# {
#    'text': "<DOCUMENT_TEXT_HERE>",
#    'meta': {'name': "<DOCUMENT_NAME_HERE>", ...}
#}
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Finder)


10/27/2020 07:58:24 - INFO - wikipediaapi -   Request URL: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Great Expectations&explaintext=1&exsectionformat=wiki


On Christmas Eve, around 1812, Pip, an orphan about seven years old, is visiting the graves of his parents and siblings in the village churchyard, where he unexpectedly encounters an escaped prisoner. The convict scares Pip into stealing food and tools from Pip's hot-tempered elder sister and her amiable husband, Joe Gargery, a blacksmith, who have taken the orphan in. On early Christmas morning, Pip returns with a file, a pie, and brandy, though he fears being punished. 
During Christmas Dinner that evening, at the moment Pip's theft is about to be discovered, soldiers arrive and ask Joe to mend some shackles. Joe and Pip accompany them as they recapture the convict, who is fighting with another escaped convict. The first convict confesses to stealing food from the smithy, clearing Pip of suspicion.

A few years pass. Miss Havisham, a wealthy and reclusive spinster who was jilted at the altar and still wears her old wedding dress, lives in the dilapidated Satis House. She asks Mr Pumb

In [None]:
# Let's have a look at the first 3 entries:
print(dicts[:3])

[{'text': 'The book includes three "Stages" of Pip\'s Expectations.On Christmas Eve, around 1812, Pip, an orphan about seven years old, is visiting the graves of his parents and siblings in the village churchyard, where he unexpectedly encounters an escaped prisoner. The convict scares Pip into stealing food and tools from Pip\'s hot-tempered elder sister and her amiable husband, Joe Gargery, a blacksmith, who have taken the orphan in. On early Christmas morning, Pip returns with a file, a pie, and brandy, though he fears being punished. \nDuring Christmas Dinner that evening, at the moment Pip\'s theft is about to be discovered, soldiers arrive and ask Joe to mend some shackles. Joe and Pip accompany them as they recapture the convict, who is fighting with another escaped convict. The first convict confesses to stealing food from the smithy, clearing Pip of suspicion.\n\nA few years pass. Miss Havisham, a wealthy and reclusive spinster who was jilted at the altar and still wears her o

In [None]:
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

10/27/2020 07:58:26 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.048s]


## Initalize Retriever, Reader,  & Finder

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered.
They use some simple but fast algorithm.

**Here:** We use Elasticsearch's default BM25 algorithm

**Alternatives:**

- Customize the `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `DensePassageRetriever` to use different embedding models for passage and query (see Tutorial 6)

In [None]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [None]:
# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.

# from haystack.retriever.sparse import TfidfRetriever
# retriever = TfidfRetriever(document_store=document_store)

### Reader

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers.
With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)

**Alternatives (Models):** e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)

**Hint:** You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

#### FARMReader

In [None]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

# reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

#### TransformersReader

In [None]:
# Alternative:
reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

10/27/2020 07:58:28 - INFO - filelock -   Lock 140221921572008 acquired on /root/.cache/torch/transformers/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.8949e27aafafa845a18d98a0e3a88bc2d248bbc32a1b75947366664658f23b1c.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…

10/27/2020 07:58:29 - INFO - filelock -   Lock 140221921572008 released on /root/.cache/torch/transformers/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.8949e27aafafa845a18d98a0e3a88bc2d248bbc32a1b75947366664658f23b1c.lock





10/27/2020 07:58:29 - INFO - filelock -   Lock 140221921573576 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…

10/27/2020 07:58:31 - INFO - filelock -   Lock 140221921573576 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock





10/27/2020 07:58:32 - INFO - filelock -   Lock 140224330395488 acquired on /root/.cache/torch/transformers/c3f38b0676fe95e83b0eb2038453c6a79f727c44cbf566992fd9557e459abac5.455d944f3d1572ab55ed579849f751cf37f303e3388980a42d94f7cd57a4e331.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…

10/27/2020 07:58:33 - INFO - filelock -   Lock 140224330395488 released on /root/.cache/torch/transformers/c3f38b0676fe95e83b0eb2038453c6a79f727c44cbf566992fd9557e459abac5.455d944f3d1572ab55ed579849f751cf37f303e3388980a42d94f7cd57a4e331.lock





10/27/2020 07:58:33 - INFO - filelock -   Lock 140221921656960 acquired on /root/.cache/torch/transformers/e88f38f2c8bc669ef7873de68f36bf764d4f64b9833ca8401efe271aab476745.0f15800a5b4c30725c555e054e3d0262e9916635f0de9d397c30acd86c21dc73.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=451.0, style=ProgressStyle(description_…

10/27/2020 07:58:34 - INFO - filelock -   Lock 140221921656960 released on /root/.cache/torch/transformers/e88f38f2c8bc669ef7873de68f36bf764d4f64b9833ca8401efe271aab476745.0f15800a5b4c30725c555e054e3d0262e9916635f0de9d397c30acd86c21dc73.lock





10/27/2020 07:58:35 - INFO - filelock -   Lock 140221921546368 acquired on /root/.cache/torch/transformers/dfa987aac92dc15d249af90a287974fd64aedb6548e287a4c031a16b06eb173c.f4565e3948d4331d7e0460adbcbdcac536e9886f24a2fad1190d6b53c231a3a3.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=265481570.0, style=ProgressStyle(descri…

10/27/2020 07:58:39 - INFO - filelock -   Lock 140221921546368 released on /root/.cache/torch/transformers/dfa987aac92dc15d249af90a287974fd64aedb6548e287a4c031a16b06eb173c.f4565e3948d4331d7e0460adbcbdcac536e9886f24a2fad1190d6b53c231a3a3.lock





### Finder

The Finder sticks together reader and retriever in a pipeline to answer our actual questions. 

In [None]:
finder = Finder(reader, retriever)

## Voilà! Ask a question!

In [None]:
%%time
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = finder.get_answers(question="What does Pip steal for the convict?", top_k_retriever=3, top_k_reader=1)

10/27/2020 07:58:42 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.123s]
10/27/2020 07:58:42 - INFO - haystack.retriever.sparse -   Got 3 candidates from retriever
10/27/2020 07:58:42 - INFO - haystack.finder -   Reader is looking for detailed answer in 35913 chars ...


CPU times: user 27.1 s, sys: 626 ms, total: 27.7 s
Wall time: 29.2 s


In [16]:
print_answers(prediction, details="all")

{   'answers': [   {   'answer': 'food',
                       'context': 'with another escaped convict. The first '
                                  'convict confesses to stealing food from the '
                                  'smithy, clearing Pip of suspicion.\n'
                                  '\n'
                                  'A few years pass. Miss H',
                       'document_id': '4562f094-5f7b-41ee-9939-72d106935790',
                       'meta': {'name': 'Great Expectations/Plot summary'},
                       'offset_end': 822,
                       'offset_start': 818,
                       'probability': 0.9562171697616577,
                       'score': None}],
    'question': 'What does Pip steal for the convict?'}


In [38]:
from google.colab import drive
drive.mount('/colab_drive')

Mounted at /colab_drive


In [49]:
f = open('/colab_drive/My Drive/Colab Notebooks/test.txt','w+')
f.write(prediction['answers'][0]['answer'])
f.close

<function TextIOWrapper.close>

In [66]:
blah = "Hello, World!" #prediction['answers'][0]['answer']
f = open('/Home/Documents/Colab/test.txt','w+')
f.write("Hello, World!")
#f.close

FileNotFoundError: ignored

In [52]:
print(prediction['answers'][0]['answer'])


food


In [53]:
type(prediction['answers'][0])

dict

In [56]:
f = open('/colab_drive/My Drive/Colab Notebooks/test.txt','w+')
for key, value in prediction['answers'][0].items(): 
  f.write('%s:%s\n' % (key, value)) 
f.close

<function TextIOWrapper.close>