Importy oraz instalacja potrzebnych bilbiotek

In [33]:
!pip install langchain sentence-transformers faiss-cpu pypdf transformers torch langchain-community  #instalujemy potrzebne biblioteki



In [76]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate



Wybrałem artykuły naukowe dotyczące bezsenności. Mają różne dane część z nich wskazuje na konsekwencje inne skupiają się na przyczynach, ale temat przewodni jest oczywiście ten sam. Poniżej tworzona jest wektorowa baza danych używająca systemu FAISS.

In [35]:
files = [
    "Insomniapt1.pdf",
    "Insomniapt2.pdf",
    "Insomniapt3.pdf",
    "Insomniapt4.pdf",
    "Insomniapt5.pdf",
]
docs = []
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=100,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)
for file in files:
    loader = PyPDFLoader(file)
    docs.extend(text_splitter.split_documents(loader.load()))



model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=model_name)

db = FAISS.from_documents(docs, embeddings)
db.save_local("faiss_index")


Po zapisaniu możemy wczytać naszą wektorową bazę danych.

In [90]:
db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

query = ""
while True:
    print("------------------------------------------")
    query = input(f'Please provide a question for the system. If you wish to stop with the questions type "!End"\n')

    if query == "!End":
        break

    similar_docs = db.similarity_search(query, k=2)  # top 2 wyniki

    for doc in similar_docs:
        print(doc.page_content[:300] + "...\n---")


------------------------------------------
Please provide a question for the system. If you wish to stop with the questions type "!End"
what are the treatments for insomnia?
Current treatment options for insomnia include 
prescribed and over-the-counter medications, psy-
chological and behavioral therapies (also referred 
to as cognitive behavioral therapy for insomnia 
[CBT-I]), and complementary and alternative ther-...
---
Several challenges with regard to the management of insomnia remain. Although 
effective treatment options are available, and these are safer than therapies used in the 
past (for example, barbiturates), no single treatment modality is effective, tolerated or...
---
------------------------------------------
Please provide a question for the system. If you wish to stop with the questions type "!End"
what are the factors influencing insomnia?
Risk factors  
A wide range of sociodemographic correlates of insomnia have been identified, and 
include advanced age, fema

**Setup modelu QA**

In [87]:
model_name = "declare-lab/flan-alpaca-base"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    device=0,
    max_new_tokens=256,
    repetition_penalty=1.2,


)

llm = HuggingFacePipeline(pipeline=pipe)

# Ten prompt odpowiada za uniknięcie sytuacji w której nasz model zmyśla kiedy nie wie jak odpowiedzieć
prompt = PromptTemplate(
    template="""
Answer the question using only the context below.
If the answer is not in the context, say "I don't know".
Include the source of information.

Context:
{context}

Question:
{question}

Answer with sources:
""",
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

Device set to use cpu


Pętla pozwalająca na prowadzenie konwersacji

In [89]:
query = ""
while True:
    print("------------------------------------------")
    query = input(f'Please provide a question for the system. If you wish to stop with the questions type "!End"\n')
    if query == "!End":
        break

    result = qa_chain.invoke(query)
    print(f'Bot: {result['result']}\n')

    # Źródła
    unique_sources = set()
    for document in result["source_documents"]:
        source = document.metadata.get("source", "unknown")
        if source not in unique_sources:
            unique_sources.add(source)

    for i,document in enumerate(unique_sources):
        print(f"Source {i+1}: {document}")

------------------------------------------
Please provide a question for the system. If you wish to stop with the questions type "!End"
what are the treatments for insomnia?
Bot: Current treatment options for insomnia include prescribed and over-the-counter medications, psy- chological and behavioral therapies (also referred to as cognitive behavioral therapy for insomnia [CBT-I], and complementary and alternative ther- Several challenges with regard to the management of insomnia remain. Although effective treatment options are available, and these are safer than therapies used in the past (for example, barbiturates), no single treatment modality is effective, tolerated or past (for example, barbiturates), no single treatment modality is effective, tolerated or acceptable to all patients with insomnia. More clinical trials are needed to examine how to best combine psychological and pharmacological therapies to optimize treatment.

Source 1: Insomniapt4.pdf
Source 2: Insomniapt3.pdf
---

**Podsumowanie**

Model QA użyty w zadaniu jest dość przeciętny, co było jednak do przewidzenia, biorąc pod uwagę, że jest to raczej mały model. Użyty model do oceny podobieństwa zdań również jest niewielki, ale mimo to udało się ostatecznie stworzyć system RAG, który poprawnie odpowiada na znaczną część pytań. Dodatkowo wyświatlane są źródła oraz unika zmyślonych odpowiedzi dzięki użyciu prompta.