## **NLP Praxis Projekt - Question Answering System**

### **Imports**

In [57]:
import os
import torch

import PyPDF2
from PyPDF2 import PdfReader
import textwrap

import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline

from langchain.llms import HuggingFacePipeline
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

from langchain.embeddings import HuggingFaceInstructEmbeddings
from InstructorEmbedding import INSTRUCTOR

### **Tokenizer und Modell**

Die erste Zeile des Codes lädt einen Tokenizer mithilfe der AutoTokenizer.from_pretrained Methode. Der Tokenizer wird verwendet, um den Input-Text in Tokens zu zerlegen, die das Modell verstehen kann. In diesem Fall wird der Tokenizer für das Modell "google/flan-t5-small" geladen.

Die zweite Zeile des Codes lädt ein vortrainiertes sequenz-zu-sequenz Modell (Seq2Seq) mit der AutoModelForSeq2SeqLM.from_pretrained Methode.

In [58]:
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

Pipeline für Text-zu-Text-Generierung erstellen

In [59]:
pipe = pipeline(
    "text2text-generation",
    model=model, 
    tokenizer=tokenizer, 
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.15
)

local_llm = HuggingFacePipeline(pipeline=pipe)

### **Preprocessing**

In [60]:
def pdf_to_txt(pdf_path, txt_path):
    try:
        pdf_file = open(pdf_path, 'rb')
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        pdf_text = ""

        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            pdf_text += page.extract_text()

        # Schließen Sie die PDF-Datei
        pdf_file.close()

        with open(txt_path, 'w', encoding='utf-8') as txt_file:
            txt_file.write(pdf_text)

        print("PDF erfolgreich in Text konvertiert und in {} gespeichert.".format(txt_path))
    except Exception as e:
        print("Fehler beim Konvertieren der PDF in Text:", str(e))


pdf_path = '../PDF and TXT Folder/accounts-payable.pdf'
txt_path = '../PDF and TXT Folder/accounts-payable.txt'
pdf_to_txt(pdf_path, txt_path)


PDF erfolgreich in Text konvertiert und in ../PDF and TXT Folder/accounts-payable.txt gespeichert.


In [61]:
loader = TextLoader('../PDF and TXT Folder/accounts-payable.txt')
documents = loader.load()

chunks aufteilen:

In [89]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

### **Instructor Emebeddings**

In [90]:
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")

load INSTRUCTOR_Transformer
max_seq_length  512


In [91]:
vectordb = Chroma.from_documents(documents=texts, embedding=instructor_embeddings, persist_directory='../db')

Retriever -> 21x wird es durchgegangen

In [96]:
retriever = vectordb.as_retriever(search_kwargs={"k": 21})

In [97]:
qa_chain = RetrievalQA.from_chain_type(llm=local_llm, 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)

In [98]:
def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

Hier ist der Prompt

In [99]:
query = "What is the Accounts Payable process?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

Accounts payable (AP) is the department in a company that’s responsible for paying


Sources:
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt


In [102]:
query2 = "What is the three way match meaning?"
llm_response = qa_chain(query2)
process_llm_response(llm_response)

Accounts payable teams use the three-way match process to ensure that they only pay invoices


Sources:
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payable.txt
../PDF and TXT Folder/accounts-payab