# Scraping data from Twitter feed
If you don't want to scrape the data freshly, you can skip this step and clone the repo.

In [None]:
# !pip install git+https://github.com/woluxwolu/twint.git@origin/master#egg=twint

In [None]:
# !twint -u "DB_Bahn" --year 2023 -o "db_bahn_tweets.json" --json

A copy of `db_bahn_tweets.json` can be found in the repository below

#Clone repository containing test data



In [None]:
%cd /content/
! git clone https://github.com/ToastyDom/DataChallengesSoSe22.git

/content
Cloning into 'DataChallengesSoSe22'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 30 (delta 5), reused 13 (delta 0), pack-reused 0[K
Unpacking objects: 100% (30/30), done.


In [None]:
import json
from pprint import pprint

file = open('DataChallengesSoSe22/data/db_bahn_tweets_2_years.json', 'r')
lines = file.readlines()

extracted_tweets = []

for line in lines:
  tweet = json.loads(line)
  extracted_tweets.append([
    tweet["username"],
    tweet["created_at"],
    tweet["tweet"],
    tweet["id"],
    tweet["link"]
  ])

In [None]:
"""Testing if all the tweets are shown"""

print(len(extracted_tweets))

for tweet in extracted_tweets[0:10]:
  print(tweet)

71379
['db_bahn', '2020-12-31 21:00:37 UTC', 'Ihr Lieben, wir verabschieden uns nun für dieses Jahr von euch! 🎉 Rutscht oder rollt (ich rolle, weil vollgefuttert 🙈 😂) ins neue Jahr 2021! 🥳 Richtet den Blick voller Zuversicht nach vorne, fühlt euch von uns gedrückt und bleibt gesund. 🍀 🥂 /at', 1344750330848931846, 'https://twitter.com/DB_Bahn/status/1344750330848931846']
['db_bahn', '2020-12-31 20:46:37 UTC', '@atYildir Schon geschehen. /da', 1344746806383669268, 'https://twitter.com/DB_Bahn/status/1344746806383669268']
['db_bahn', '2020-12-31 20:40:59 UTC', '@atYildir Da die Strecke von der HLB betreiben wird, haben wir von den Ticketeinnahmen nichts. Dort verkehren aber auch Züge der Linie RB 41, die von DB Regio betrieben wird. Wann möchtest du von Frankfurt Hbf nach Marburg(Lahn) fahren? /da', 1344745390076260359, 'https://twitter.com/DB_Bahn/status/1344745390076260359']
['db_bahn', '2020-12-31 18:53:31 UTC', '@SebastianKueck Danke für die lieben Worte. Wir wünschen dir einen guten 

# Adding data to Haystack Pipeline
Reader: Close Anaylsis of documents, perform core task of question answering. Trained from the latest transformer

Retriever: Assists reader as filter for documents. Quickly identifies relevant parts

In [None]:
# Make sure you have a GPU running
!nvidia-smi

In [None]:
# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.1.2-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 5.1 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.1.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-21ypp5t0
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-21ypp5t0
  Resolved https://github.com/deepset-ai/haystack.git to commit a2905d05f798ea3335596247b98ec711eb6cd542
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build 

Haystack looks at documents stored in a class called "Document Store"

First we need to start an ElasticSearch (like a Bert-Model) server on local machine. Usually using docker but we can execute elastic search from source as well.

In [None]:
# Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

In [None]:
# Connect to Elasticsearch

from haystack.document_stores import ElasticsearchDocumentStore
from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers
from haystack.nodes import FARMReader, TransformersReader

document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry


Now we have Elastic Search running and a yet-empty document_store. Now we need to fill our document store with our data!



In [None]:
# Write function to put json data into document store format
"""
dicts = [
    {
        'content': DOCUMENT_TEXT_HERE,
        'meta': {'name': DOCUMENT_NAME, ...}
    }, ...
]

[author_name, tweet_date, tweet, id, url]
"""

def to_dstore(json_data):
  dicts = []
  for element in json_data:
    current_dict = {'content': element[2],
                    'meta': {'author': element[0], 'date': element[1], 'url': element[4]}}
    dicts.append(current_dict)

  return dicts

In [None]:
docs = to_dstore(extracted_tweets)
document_store.write_documents(docs)

In [None]:
# Important Imports
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import BM25Retriever
from haystack.pipelines import GenerativeQAPipeline
from haystack.nodes import RAGenerator
from haystack.nodes import BM25Retriever, EmbeddingRetriever, FARMReader
from haystack.pipelines import DocumentSearchPipeline
from haystack.utils import print_documents
from haystack.nodes import BM25Retriever
from haystack import Pipeline
from haystack.nodes import EmbeddingRetriever

# NEW AND IMPROVED PIPELINE
Extractive Q&A Pipeline

In [None]:
from haystack.nodes import EmbeddingRetriever
retriever = EmbeddingRetriever(
    document_store=document_store,
   embedding_model="clips/mfaq",
   model_format="sentence_transformers"
)
document_store.update_embeddings(retriever)

# Adding reader.
reader = FARMReader(model_name_or_path="deepset/gelectra-base-germanquad", use_gpu=True)

p_extractive = Pipeline()
p_extractive.add_node(component=retriever, name="Retriever", inputs=["Query"])
p_extractive.add_node(component=reader, name="Reader", inputs=["Retriever"])

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model clips/mfaq


Downloading:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/117 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/294 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/464 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
INFO - haystack.document_stores.elasticsearch -  Updating embeddings for all 71351 docs ...


Updating embeddings:   0%|          | 0/71351 [00:00<?, ? Docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/43 [00:00<?, ?it/s]

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/gelectra-base-germanquad locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/740 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/417M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Automatically detected language from language model name: german
INFO - haystack.modeling.model.language_model -  Loaded deepset/gelectra-base-germanquad


Downloading:   0%|          | 0.00/358 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/234k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


In [None]:
# Now we can run it
res = p_extractive.run(
    query="Wiviel kostet das 9€ Ticket?", params={"Retriever": {"top_k": 30}, "Reader": {"top_k": 3}}
)
print_answers(res, details="maximum")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.39 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 30.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00


Query: Wiviel kostet das 9€ Ticket?
Answers:
[   <Answer {'answer': '46 Euro', 'type': 'extractive', 'score': 0.797111451625824, 'context': 'en Verbundshop gegangen? Ich habe es gerade auf iOS getestet und kann dort ein 10er Ticket ab Korschenbroich Geltungsbereich B für 46 Euro buchen. /ne', 'offsets_in_document': [{'start': 193, 'end': 200}], 'offsets_in_context': [{'start': 131, 'end': 138}], 'document_id': 'b0d82054126a22c2bebcd194c9404fd9', 'meta': {'date': '2021-11-12 14:20:35 UTC', 'author': 'db_bahn', 'url': 'https://twitter.com/DB_Bahn/status/1459164221464272901'}}>,
    <Answer {'answer': '151,15 Euro', 'type': 'extractive', 'score': 0.5587798207998276, 'context': 'cht, wie du auf 194 Euro kommst, der Flexpreis mit BahnCard 25 kostet 151,15 Euro, Super Sparpreis Europa für morgen gibt es noch für 97,90 Euro. Hast', 'offsets_in_document': [{'start': 105, 'end': 116}], 'offsets_in_context': [{'start': 70, 'end': 81}], 'document_id': '62b648964e38953fcc3a9b812e20a0ea', 'meta': 

# Extractive Q&A Pipeline



In [None]:
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import BM25Retriever


# Adding reader. It will select the k-best answers. We will take a roBERTa model.
reader = FARMReader(model_name_or_path="deepset/gelectra-base-germanquad-distilled", use_gpu=True)

retriever = BM25Retriever(document_store=document_store)  # Adding retriever to make scanning process faster!
pipe = ExtractiveQAPipeline(reader, retriever)


INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/gelectra-base-germanquad-distilled locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/778 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/417M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Automatically detected language from language model name: german
INFO - haystack.modeling.model.language_model -  Loaded deepset/gelectra-base-germanquad-distilled


Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/234k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/468k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


In [None]:
question = "Ab wann gilt das 9€ Ticket?"


prediction = pipe.run(
    query=question, params={"Retriever": {"top_k": 40}, "Reader": {"top_k": 10}}
)


# Or just have the simple output:
print_answers(prediction, details="minimum")

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.44 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.94 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.06 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 23.03 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00


Query: Ab wann gilt das 9€ Ticket?
Answers:
[   {   'answer': 'Wenn ein Nahverkehrszug durch einen SEV ersetzt wird',
        'context': '@sarinamendes6 Wenn ein Nahverkehrszug durch einen SEV '
                   'ersetzt wird, dann gilt auch dort das 9-Euro-Ticket. /si'},
    {   'answer': 'ab Hamburg Altona',
        'context': '@tehabe @kkklawitter Ich würde auch sagen, dass das Ticket '
                   'ab Hamburg Altona gilt. /da'},
    {   'answer': 'bis Enschede',
        'context': 'mmen! Hier:  https://t.co/E2N18mjfee unter '
                   'Beförderungsbedingungen 9-Euro-Ticket ist jetzt alles '
                   'hinterlegt und das Ticket gilt auch bis Enschede. /an'},
    {   'answer': 'nur in der 2. Klasse',
        'context': '@ErkMerk Das 9- Euro-Ticket gilt nur in der 2. Klasse:  '
                   'https://t.co/vNMJwAsWBR /si'},
    {   'answer': 'Das 9-Euro-Ticket gilt nicht in den Fernverkehrszügen, bei '
                  'denen andere Nahverkehrsfahrkar

# Frontend Interface

In [None]:
import logging
# Disable info logs
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

def ask_chatbot(question):
  if question == "":
    return "", ""
  prediction = pipe.run(
    query=question,
    params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 1}}
  )
  if prediction["answers"]:
    return prediction["answers"][0].answer, prediction["answers"][0].meta["url"]
  else:
    return "", ""

In [None]:
from ipywidgets import *


style = HTML(value="""<style>
.app { 
  background-color: #eba8a2;
  padding: 50px 200px;
  max-width: 1000px;
}
.logo {
  background: url(https://raw.githubusercontent.com/ToastyDom/DataChallengesSoSe22/main/logo.png);
  height: 80px;
  background-position: center;
  background-repeat: no-repeat;
  background-size: contain;
}
.messages {
  height: 300px;
  margin: 40px 0;
  overflow-y: scroll;
  flex-direction: column-reverse;
}
.message {
  padding: 10px;
  margin: 5px 0px;
  background-color: #f5e1df;
  border-radius: 10px;
  width: 70%;
}
a {
  color: black;
  text-decoration: none;
}
.message.right {
  margin-left: 25%;
  border-top-right-radius: 0px;
}
.message.left {
  border-top-left-radius: 0px;
}
.input {
  width: 75%;
}
.button {
  width: 25%;
}
</style>""")

input = Text(placeholder="Frage eingeben")
button = Button(description="Fragen")
messages = VBox([
  HTML(value='<div class="message left">Hi, wie kann ich dir helfen?</div>'),
])
messages.add_class('messages')
app = VBox([
  HTML(value='<div class="logo"></div>'),
  messages,
  HBox([input, button]),
])
app.add_class('app')
input.add_class('input')
button.add_class('button')

def on_send_message(b):
  question = input.value
  answer, url = ask_chatbot(question)
  messages.children = (
    HTML(value=f'<a href="{url}" target="_blank"><div class="message left">{answer}</div></a>'),
    HTML(value=f'<div class="message right">{question}</div>'),
    *messages.children,
  )

button.on_click(on_send_message)
input.on_submit(on_send_message)


display(style, app)

HTML(value='<style>\n.app { \n  background-color: #eba8a2;\n  padding: 50px 200px;\n  max-width: 1000px;\n}\n.…

VBox(children=(HTML(value='<div class="logo"></div>'), VBox(children=(HTML(value='<div class="message left">Hi…

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.44 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 34.25 Batches/s]


# Generative QA-Pipeline  - This one is bad

In [None]:
from haystack.pipelines import GenerativeQAPipeline
from haystack.nodes import RAGenerator
from haystack.nodes import BM25Retriever, EmbeddingRetriever, FARMReader


# Initialize dense retriever
embedding_retriever = EmbeddingRetriever(
    document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1"
)

# Needed for generator
document_store.update_embeddings(embedding_retriever, update_existing_embeddings=False)
document_store.return_embedding = True



# Initialize generator
rag_generator = RAGenerator()



In [None]:

# Generative QA
pipe = GenerativeQAPipeline(generator=rag_generator, retriever=embedding_retriever)
res = pipe.run(query="Wann gilt das 9€ Ticket?", params={"Retriever": {"top_k": 10}})



# Or just have the simple output:
print_answers(prediction, details="minimum")

# FAQ Pipeline  - This one requires a different document-store
=> Reference: https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial4_FAQ_style_QA.ipynb

In [None]:
from haystack.pipelines import FAQPipeline


pipe = FAQPipeline(retriever=retriever)

In [None]:
from haystack.utils import print_answers

prediction = pipe.run(query="Wann gilt das 9€ Ticket?", params={"Retriever": {"top_k": 10}})
print_answers(prediction, details="medium")

# Document Search Pipeline - Useless

In [None]:
from haystack.pipelines import DocumentSearchPipeline
from haystack.utils import print_documents


bm25_retriever = BM25Retriever(document_store=document_store)  # Adding retriever to make scanning process faster!

p_retrieval = DocumentSearchPipeline(bm25_retriever)
res = p_retrieval.run(query="Wann gilt das 9€ Ticket?", params={"Retriever": {"top_k": 10}})
print_documents(res, max_text_len=200)

# Custom Build Extractive QA Pipeline

In [None]:
from haystack.nodes import BM25Retriever
from haystack import Pipeline
# Custom built extractive QA pipeline
bm25_retriever = BM25Retriever(document_store=document_store)

# Adding reader. It will select the k-best answers. We will take a roBERTa model.
reader = FARMReader(model_name_or_path="deepset/gelectra-base-germanquad-distilled", use_gpu=True)

p_extractive = Pipeline()
p_extractive.add_node(component=bm25_retriever, name="Retriever", inputs=["Query"])
p_extractive.add_node(component=reader, name="Reader", inputs=["Retriever"])

Trying out new retrievers

In [None]:
# New retriever?
from haystack.nodes import EmbeddingRetriever
retriever = EmbeddingRetriever(
    document_store=document_store,
   embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
   model_format="sentence_transformers"
)
document_store.update_embeddings(retriever)

# Adding reader. It will select the k-best answers. We will take a roBERTa model.
reader = FARMReader(model_name_or_path="deepset/gelectra-base-germanquad", use_gpu=True)

p_extractive = Pipeline()
p_extractive.add_node(component=retriever, name="Retriever", inputs=["Query"])
p_extractive.add_node(component=reader, name="Reader", inputs=["Retriever"])



"""
Was ist die DB-Navigator App? - Reise-App
Wann gilt das 9€ Ticket? - nur in der 2. Klasse
Bis wann gilt das 9€ Ticket? - nur in der 2. Klasse

"""

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model sentence-transformers/multi-qa-mpnet-base-dot-v1
INFO - haystack.document_stores.elasticsearch -  Updating embeddings for all 71351 docs ...


Updating embeddings:   0%|          | 0/71351 [00:00<?, ? Docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/43 [00:00<?, ?it/s]

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/gelectra-base-germanquad locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Automatically detected language from language model name: german
INFO - haystack.modeling.model.language_model -  Loaded deepset/gelectra-base-germanquad
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


'\nWas ist die DB-Navigator App? - Reise-App\nWann gilt das 9€ Ticket? - nur in der 2. Klasse\nBis wann gilt das 9€ Ticket? - nur in der 2. Klasse\n\n'

In [None]:
# New retriever?
from haystack.nodes import EmbeddingRetriever
retriever = EmbeddingRetriever(
    document_store=document_store,
   embedding_model="clips/mfaq",
   model_format="sentence_transformers"
)
document_store.update_embeddings(retriever)

# Adding reader. It will select the k-best answers. We will take a roBERTa model.
#reader = FARMReader(model_name_or_path="deepset/gelectra-base-germanquad-distilled", use_gpu=True)
reader = FARMReader(model_name_or_path="deepset/gelectra-base-germanquad", use_gpu=True)

p_extractive = Pipeline()
p_extractive.add_node(component=retriever, name="Retriever", inputs=["Query"])
p_extractive.add_node(component=reader, name="Reader", inputs=["Retriever"])



"""
Was ist die DB-Navigator App? - Reise-App
Wann gilt das 9€ Ticket? - frühestens ab dem 1. Juni
Bis wann gilt das 9€ Ticket? - frühestens ab dem 1. Juni'



Neuer reader:

Was ist die DB-Navigator App? - 'eine App mit der du z.B. Zugverbindungen suchen und Tickets '
                  'kaufen kannst'
Wann gilt das 9€ Ticket? - frühestens ab dem 1. Juni
Bis wann gilt das 9€ Ticket? - frühestens ab dem 1. Juni'

"""

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model clips/mfaq
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
INFO - haystack.document_stores.elasticsearch -  Updating embeddings for all 71351 docs ...


Updating embeddings:   0%|          | 0/71351 [00:00<?, ? Docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/43 [00:00<?, ?it/s]

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/gelectra-base-germanquad locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/740 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/417M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Automatically detected language from language model name: german
INFO - haystack.modeling.model.language_model -  Loaded deepset/gelectra-base-germanquad


Downloading:   0%|          | 0.00/358 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/234k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


"\nWas ist die DB-Navigator App? - Reise-App\nWann gilt das 9€ Ticket? - frühestens ab dem 1. Juni\nBis wann gilt das 9€ Ticket? - frühestens ab dem 1. Juni'\n\n"

Trying out new reader

Ask the question

In [None]:


# Now we can run it
res = p_extractive.run(
    query="Was ist die MyBahnCard?", params={"Retriever": {"top_k": 30}, "Reader": {"top_k": 5}}
)
print_answers(res, details="minimum")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.13 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.72 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.67 Batches/s


Query: Was ist die MyBahnCard?
Answers:
[   {   'answer': 'Jugend BahnCard',
        'context': '@Gitarre3 @L1zha_ Nach der Jugend BahnCard gibt es die My '
                   'BahnCard.  https://t.co/WomrqmjZJP Liebe Grüße /ti'},
    {   'answer': 'ein Aboprodukt',
        'context': 'a_TheKPanda Moin, online geht das leider nicht. Die My '
                   'BahnCard ist ein Aboprodukt und darf nicht von '
                   'Jugendlichen unter 18 Jahren erworben werden, so'},
    {   'answer': 'eine ermäßigte BahnCard 50 2. Klasse',
        'context': '@karinpatzer Hallo. Hast du vielleicht die My BahnCard '
                   'oder eine ermäßigte BahnCard 50 2. Klasse? /jn'},
    {   'answer': 'die normale BahnCard',
        'context': '@ZainCh02 Hi, da die Aktion gestern ausgelaufen ist, wird '
                   'es nur noch die normale BahnCard geben. /lu'},
    {   'answer': 'BahnCard 100 1. Klasse',
        'context': 'ebiet 6000 fährst, kannst du die Stadtbahn Bielefeld mit

# Huggingface Summarizer  - Der ist nicht gut


In [None]:
from transformers import pipeline
summarizer = pipeline("summarization", model = "ml6team/mt5-small-german-finetune-mlsum")
summarizer("Das 9 Euro-Ticket gibt es erst ab Juni. Hallo, das 9-Euro Ticket wird es voraussichtlich ab dem 23. Mai zu kaufen geben", min_length=5, max_length=50)