This code uses the BERT (Bidirectional Encoder Representations from Transformers) model to perform Q&A on a given Polish context. It utilizes the transformers library, and tracks the amount of data downloaded during the process. A function is defined to take a question as input and returns the answer using the Q&A pipeline with the provided context. The main purpose of this code is to provide a breakdown of payment information using BERT's Q&A capabilities instead of regular expressions.

In [1]:
!pip install transformers
!pip install humanize
!pip install psutil

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.0
Looking in indexes: https://pypi.org/simple, http

In [2]:
import psutil
initial_io_counters = psutil.net_io_counters()

In [3]:
from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="henryk/bert-base-multilingual-cased-finetuned-polish-squad2",
    tokenizer="henryk/bert-base-multilingual-cased-finetuned-polish-squad2"
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/711M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [4]:
from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="henryk/bert-base-multilingual-cased-finetuned-polish-squad2",
    tokenizer="henryk/bert-base-multilingual-cased-finetuned-polish-squad2"
)

In [5]:
final_io_counters = psutil.net_io_counters()
data_downloaded = final_io_counters.bytes_recv - initial_io_counters.bytes_recv
import humanize
print(f'Data downloaded: {humanize.naturalsize(data_downloaded)}')

Data downloaded: 717.0 MB


In [6]:
context="""
Informacja odnośnie rozliczenia za wyżywienie
Ilość dni w IX / 2023 r. - 21 dni 
Ilość dni zgłoszonych nieobecności z poprzedniego miesiąca: 14 

Wyliczenia: 21-14= 7
7x17,00= 119 


Kwota do zapłaty na konto : 119 zł. 
"""

In [7]:
def ask(question):
  return qa_pipeline({
    'context': context,
    'question': question})

In [14]:
ask("Ilość dni?")

{'score': 0.4355515241622925, 'start': 74, 'end': 76, 'answer': '21'}

In [12]:
ask("Ilość dni nie obecności?")

{'score': 0.00016279886767733842, 'start': 142, 'end': 144, 'answer': '14'}

In [16]:
ask("W jakim miesiącu i roku odbywa się rozliczenie za wyżywienie?")

{'score': 0.005733450409024954, 'start': 59, 'end': 68, 'answer': 'IX / 2023'}

In [11]:
%time ask("Jaka kwota do zapłaty?")

{'score': 0.8604817986488342, 'start': 212, 'end': 218, 'answer': '119 zł'}

In [22]:
def create_qa_input(questions, context):
    qa_input = []
    for question in questions:
        qa_input.append({'question': question, 'context': context})
    return qa_input

questions = ["ilość dni?", "Ilość dni nieobecności?", "W jakim miesiącu i roku odbywa się rozliczenie za wyżywienie?", "Jaka kwota do zapłaty?"]
qa_pipeline(create_qa_input(questions, context))

[{'score': 0.40143856406211853, 'start': 74, 'end': 76, 'answer': '21'},
 {'score': 0.2581802010536194, 'start': 142, 'end': 144, 'answer': '14'},
 {'score': 0.005733450409024954,
  'start': 59,
  'end': 68,
  'answer': 'IX / 2023'},
 {'score': 0.8604817986488342, 'start': 212, 'end': 218, 'answer': '119 zł'}]

In [13]:
#TODO: Next step is to implement transfer learning on the BERT model for further fine-tuning and improved performance 