# Beispiellösung: QA-Systeme erstellen

Bevor wir starten, müssen alle benötigten Packages installiert werden...

In [1]:
# Install the latest release of Haystack in your own environment 
! pip install farm-haystack

# Install the latest master of Haystack
#!pip install git+https://github.com/deepset-ai/haystack.git
#!pip install urllib3==1.25.4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack
  Downloading farm_haystack-1.6.0-py3-none-any.whl (596 kB)
[K     |████████████████████████████████| 596 kB 29.8 MB/s 
Collecting quantulum3
  Downloading quantulum3-0.7.10-py3-none-any.whl (10.7 MB)
[K     |████████████████████████████████| 10.7 MB 30.6 MB/s 
[?25hCollecting azure-core<1.23
  Downloading azure_core-1.22.1-py3-none-any.whl (178 kB)
[K     |████████████████████████████████| 178 kB 64.1 MB/s 
[?25hCollecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 3.4 MB/s 
Collecting tika
  Downloading tika-1.24.tar.gz (28 kB)
Collecting transformers==4.20.1
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 62.0 MB/s 
Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[K     |████████████████████████████████| 5.

In [2]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [3]:
# Importieren aller Module
from haystack.utils import print_answers
from haystack.nodes import PreProcessor, TextConverter
from haystack.nodes import ElasticsearchRetriever
from haystack.pipelines import ExtractiveQAPipeline
from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers
from haystack.nodes import FARMReader, TransformersReader

## Harry Potter-QA

### Schritt 1: Mit Elasticsearch einen DocumentStore anlegen. 
Der Index kann beliebig benannt werden (z.B. ```hp_document_store```)

In [4]:
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore
hp_document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="hp_document_store")

INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry


### Schritt 2: Vorverarbeiten der Textdateien

Mit ```convert_files_to_dicts``` werden alle Textdateien in einem Ordner in das von Elasticsearch benötigte Dictionary-Format umgewandelt...

In [5]:
all_hp_books = convert_files_to_docs(dir_path="/content/potter")



INFO - haystack.utils.preprocessing -  Converting /content/potter/hp5.txt
INFO - haystack.utils.preprocessing -  Converting /content/potter/hp6.txt
INFO - haystack.utils.preprocessing -  Converting /content/potter/hp7.txt
INFO - haystack.utils.preprocessing -  Converting /content/potter/hp2.txt
INFO - haystack.utils.preprocessing -  Converting /content/potter/hp1.txt
INFO - haystack.utils.preprocessing -  Converting /content/potter/hp3.txt
INFO - haystack.utils.preprocessing -  Converting /content/potter/hp4.txt


Nach diesem Schritt liegen sieben Dictionaries vor, die jeweils den kompletten Text eines Buches beinhalten. Um die Retriever-Reader-Performance zu steigern, empfiehlt es sich, diese großen Dokumente in kleinere zu untergliedern und die rohen Textdateien vorzuverarbeiten. Genaue Infos zu den einzelnen Parametern können Sie der [Preprocessing-Seite](https://haystack.deepset.ai/tutorials/preprocessing) sowie den Empfehlungen zur [Optimierung](https://haystack.deepset.ai/guides/optimization) entnehmen.

In [6]:
hp_preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="sentence", # Dokument wird anhand von Sätzen getrennt
    split_length=100, # Nach 100 Sätzen erfolgt die Trennung -> ein Dokument besteht aus 100 Sätzen
    split_overlap = 2, # 2 Sätze Überlappung bei den einzelnen Dokumenten
    split_respect_sentence_boundary=False
)
nested_docs = [hp_preprocessor.process(doc) for doc in all_hp_books]
hp_docs = [doc for x in nested_docs for doc in x]

print(hp_docs[0:10])

print(f"n_files_input: {len(all_hp_books)}\nn_docs_output: {len(hp_docs)}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
  # Remove the CWD from sys.path while we load stuff.


n_files_input: 7
n_docs_output: 828


### Schritt 3: Retriever und Reader konfigurieren

Bevor Retriever und Reader konfiguriert werden, müssen die erzeugten Dokument noch in den DocumentStore geschrieben werden...

In [7]:
hp_document_store.write_documents(hp_docs)

Anschließend können Retriever- und Reader-Instanzen erzeugt werden

In [8]:
hp_retriever = ElasticsearchRetriever(document_store=hp_document_store)



Beim Reader gibt man zusätzlich noch den Link zum vortrainierten Sprachmodell an auf Huggingface an. Den korrekten Namen kann man [oben auf der Huggingface-Seite](https://huggingface.co/deepset/roberta-base-squad2) rauskopieren.

In [9]:
hp_reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2


Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


Nach der Initialisierung erfolgt das Zusammenbauen der einzelnen Komponenten - konkret von Reader und Retriever:

In [10]:
hp_pipe = ExtractiveQAPipeline(hp_reader, hp_retriever)

### Schritt 4: Fragen stellen

Stellt man eine Frage an das System...

In [16]:
mydic = {"Peter":["hp1.txt", "hp2.txt"], "Albert":["hp3.txt"], "josef":["hp4.txt", "hp5.txt", "hp6.txt", "hp7.txt"]}

In [17]:
def pipe(user_input:str, user):
  hp_prediction = hp_pipe.run(query= user_input, 
                              params={"Retriever": {"top_k": 10}, "filters": {"name": mydic[f"{user}"]}, "Reader": {"top_k": 5}})
  return hp_prediction

In [22]:
test = pipe("Who is Harry Potter's best friend?", "Peter")

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.68 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.75 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.58 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.93 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.37 Batches/s]


... wird die richtige Antwort geliefert. Es ist richtig, dass Ron und Hermine Harrys beste Freunde sind.

In [23]:
print_answers(test, details="minimal")




Query: Who is Harry Potter's best friend?
Answers:
[   <Answer {'answer': 'Piers Polkiss', 'type': 'extractive', 'score': 0.9187772870063782, 'context': "unt Petunia frantically -- and a moment later, Dudley's best friend, Piers Polkiss, walked in with his mother. Piers was a scrawny boy with a face lik", 'offsets_in_document': [{'start': 4544, 'end': 4557}], 'offsets_in_context': [{'start': 69, 'end': 82}], 'document_id': '9a86fed060e22a8bea46b16ae52b109e', 'meta': {'_split_id': 4, 'name': 'hp1.txt'}}>,
    <Answer {'answer': 'Muggle-born', 'type': 'extractive', 'score': 0.8976055383682251, 'context': ' not going anywhere!" said Harry fiercely. "One of my best\nfriends is Muggle-born; she\'ll be first in line if the Chamber really has\nbeen opened -"\n"H', 'offsets_in_document': [{'start': 6356, 'end': 6367}], 'offsets_in_context': [{'start': 70, 'end': 81}], 'document_id': '6d10be37a0920f60b600cbb9df34e9f7', 'meta': {'_split_id': 33, 'name': 'hp2.txt'}}>,
    <Answer {'answer': 'Jus

## Faust-QA

Die gleichen Schritte werden auch beim Faust-QA befolgt

In [None]:
faust_doc_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="faust_document_store")

Hat man nur eine Datei, genügt es, den ```TextConverter``` für die Weiterverarbeitung zu verwenden. Haystack kann Txt-, PDF- oder Docx-Dateien entgegennehmen.

In [None]:
text_converter = TextConverter(valid_languages=["de"])
faust_doc = text_converter.convert(file_path="faust/faust.txt", meta=None)

In [None]:
faust_preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_overlap = 2,
    split_respect_sentence_boundary=True
)
faust_preprocessed = faust_preprocessor.process(faust_doc)
print(f"n_files_input: {len(faust_doc)}\nn_docs_output: {len(faust_preprocessed)}")

In [None]:
faust_doc_store.write_documents(faust_preprocessed)

In [None]:
faust_retriever = ElasticsearchRetriever(document_store=faust_doc_store)

In [None]:
faust_reader = FARMReader(model_name_or_path="deepset/gelectra-large-germanquad", use_gpu=True)

In [None]:
faust_pipe = ExtractiveQAPipeline(faust_reader, faust_retriever)
faust_prediction = faust_pipe.run(query="Was hat Faust studiert?",
                                  params={"Retriever": {"top_k": 10, "filters": {"name": ""}}, "Reader": {"top_k": 5}})


Auch hier stimmt die Antwort. Die ersten Zeilen aus Faust hat wohl jeder noch in Erinnerung.

In [None]:
print_answers(faust_prediction, details="minimal")