In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Pobranie potrzebnych bibliotek

In [11]:
!pip install langchain sentence-transformers faiss-cpu pypdf transformers torch langchain-community



Podzielenie tekstu na fragmenty, tokenizacja

In [42]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_split(filename):
  loader = PyPDFLoader(f'/content/drive/MyDrive/ML_kurs/Data/{filename}')
  pages = loader.load()

  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size=300,
      chunk_overlap=100,
      length_function=len,
      separators=["\n\n", "\n", " ", ""]
  )
  docs = text_splitter.split_documents(pages)
  return docs

In [25]:
docs1 = load_split('1.pdf')
docs2 = load_split('2.pdf')
docs3 = load_split('3.pdf')

docs = docs1 + docs2 + docs3



Embedding przy pomocy modelu BGE-M3 oraz wczytanie wszystkiego do wektorowej bazy danych Faiss

In [26]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-m3")

db = FAISS.from_documents(docs, embeddings)

Sprawdzenie czy wszystko działa poprawnie poprzez wprowadzenie zapytania do bazy.

In [27]:
query = "what is supervised learning?"
similar_docs = db.similarity_search(query, k=3)

for doc in similar_docs:
  print(doc.page_content[:300] + '...\n---')

 Supervised learning - the various algorithms generate a function that maps 
inputs to desired outputs. One standard formulation of the supervised learning 
task is the classification problem: the learner is required to learn (to...
---
Supervised learning is the most common technique for training for neutral 
networks and decision trees. Both of these are depended on the information given by  
the pre-determinate classification.  
Also, this learning is used in applications where historical data predicts likely...
---
Figure 1: Supervised learning process [18] 
 
Supervised learning (Figure 1) is the most common technique in the classification 
problems, since the goal is often to get the machine to learn a classification system 
that we’ve created....
---


Domyślny model do konwersacji przy użyciu modelu BART, z uwzględnieniem cyberpsychozy, cytowaniem dokumentu z którego model pobrał odpowiedź oraz z pamięcią historii konwersacji.

In [44]:
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
from langchain.memory import ConversationBufferMemory

model_name = "facebook/bart-large-cnn"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

gen_pipeline = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    min_length=50,
    do_sample=True
)

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

def conversational_rag(query, db, memory, k=3):
    conversation_context = memory.buffer
    full_query = f"Previous conversation:\n{conversation_context}\n\nUser question: {query}"

    docs = db.similarity_search(full_query, k=k)
    if not docs:
        return "Didn't found answer", []

    context = " ".join([d.page_content for d in docs])
    prompt = f"Answer the question based on the context.\n\nContext: {context}\n\nQuestion: {query}\nAnswer:"

    result = gen_pipeline(prompt)
    answer = result[0]['generated_text']

    sources = [f"{d.metadata.get('source', 'unknown')} (page {d.metadata.get('page', '?')})" for d in docs]

    memory.chat_memory.add_user_message(query)
    memory.chat_memory.add_ai_message(answer)

    return answer, sources

print("Your AI assistant, type 'exit' to quit")

while True:
    query = input("You: ")
    if query.lower() in ["exit", "quit"]:
        break

    answer, sources = conversational_rag(query, db, memory)
    print(f"Assistant: {answer}")
    if sources:
        print("Sources:", sources)


Device set to use cpu


RAG assistant (BART). Type 'exit' to quit.
You: what is supervised learning?
Assistant: Supervised learning is the most common technique in the classification problem. The goal is often to get the machine to learn a classification system that we’ve created. The various algorithms generate a function that maps inputs to desired outputs. One standard formulation of the supervised learning task is the classificationproblem.
Sources: ['/content/drive/MyDrive/ML_kurs/Data/2.pdf (page 2)', '/content/drive/MyDrive/ML_kurs/Data/2.pdf (page 10)', '/content/drive/MyDrive/ML_kurs/Data/2.pdf (page 4)']
You: and what is the difference between this and unsupervised learning?
Assistant: Supervised learning (Figure 1) is the most common technique in the classification problem. The goal is often to get the machine to learn a classification system. The main task of unsupervised learning is to automatically develop labels. These algorithms are searching the similarity between pieces of information.
Sourc

Odpowiedzi może nie są idealne, ale jakieś są xd