Let's upload the PDF file first. We can upload it using `PdfReader` function from `PyPDF2`. 

In [None]:
from PyPDF2 import PdfReader

pdf_file_path = "docs/crypto.pdf"
loader = PdfReader(pdf_file_path)

Next we can collect all text from that PDF. 

In [3]:
raw_text = ""

for page in loader.pages:
    content = page.extract_text()
    if content:
        raw_text += content

After that, we can split our text collection using `CharacterTextSplitter` from `LangChain`. The reason why we need to split it is because we will store all this data to a vector database. We will save it in multiple documents instead of just one. Each document will have different information. So if we need information regarding something, we only need to take the documents that has information about that thing. We don't need to extract information from all text.  

In [4]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator = "\n", 
                                      chunk_size = 1000, 
                                      chunk_overlap = 10, 
                                      length_function = len)
text = text_splitter.split_text(raw_text)
print(text)

['Inf2: Software Engineering and Professional\nPractice\nLecture 4: Requirements Engineering\nCristina Adriana Alexandru\nSchool of Informatics\nUniversity of EdinburghIn week 1, on the SE part of this course . . .\nIObjectives, motivation and structure of the Inf2C-SE course\nIWhy is Software Engineering still hard?\nISoftware engineering activities\nIBrief history of SE\nISoftware project vs software product engineering\nISoftware development processes: plan-driven and agile\n2 / 21This lecture\nIRequirements engineering\nIWhat is a requirement?\nIKinds of requirements\nIRequirements vs. design\nIThe concept of a stakeholder\nISub-activities of requirements engineering\n3 / 21What is a software requirement?\nA software requirement is \\a property that must be exhibited by\nsomething in order to solve some problem in the real world" (From\nSWEBOK V3, Ch1)\nRequirements re\rect the needs of di\x0berent people at various levels\nof the organisation.\nRequirements engineering is often us

Now we can transform the document using embedding function and store them to a Chroma vector database. 

In [5]:
import chromadb.api
import chromadb.api.client
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
import os
import chromadb

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# magic line of code that clear the cache and refresh the database content
chromadb.api.client.SharedSystemClient.clear_system_cache()

vectordb = Chroma.from_texts(text, embedding_function, persist_directory="docs/chroma_db")
vectordb.persist()
print(text)


  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")


['Inf2: Software Engineering and Professional\nPractice\nLecture 4: Requirements Engineering\nCristina Adriana Alexandru\nSchool of Informatics\nUniversity of EdinburghIn week 1, on the SE part of this course . . .\nIObjectives, motivation and structure of the Inf2C-SE course\nIWhy is Software Engineering still hard?\nISoftware engineering activities\nIBrief history of SE\nISoftware project vs software product engineering\nISoftware development processes: plan-driven and agile\n2 / 21This lecture\nIRequirements engineering\nIWhat is a requirement?\nIKinds of requirements\nIRequirements vs. design\nIThe concept of a stakeholder\nISub-activities of requirements engineering\n3 / 21What is a software requirement?\nA software requirement is \\a property that must be exhibited by\nsomething in order to solve some problem in the real world" (From\nSWEBOK V3, Ch1)\nRequirements re\rect the needs of di\x0berent people at various levels\nof the organisation.\nRequirements engineering is often us

  vectordb.persist()


Let's try to import the vector database now. 

In [6]:
vectordb = Chroma(persist_directory = "docs/chroma_db", embedding_function = embedding_function)

  vectordb = Chroma(persist_directory = "docs/chroma_db", embedding_function = embedding_function)


Since we already have the vector database, now we can set up a chain that can help us to answer question. 

"What do you mean by _chain_?"

It is a function that connects vector database with your prompt. So, if you write your prompt, LLM will then help you find the answer on the vector database. We can set up this chain using `load_qa_chain` from `langchain`.   

In [7]:
import os
from dotenv import load_dotenv
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA

# Load environment variables (if needed)
load_dotenv(dotenv_path='.env')

# Define Ollama LLM with DeepSeek-R1 (1.5B)
llm = Ollama(model="deepseek-r1:1.5b")

# Assuming vectordb is already initialized
retriever = vectordb.as_retriever()

# Create the QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    verbose=False
)

# Test the QA system
question = "Explain the theory of relativity."
result = qa_chain.invoke({"query": question})
print(result)

{'query': 'Explain the theory of relativity.', 'result': "<think>\nOkay, I need to explain the theory of relativity. Hmm, I remember it's a big deal in physics. There are two parts: special and general relativity. \n\nSpecial relativity deals with objects moving at constant speeds, especially near the speed of light. Einstein saw that as an object moves closer to the speed of light, time slows down relative to someone else. That's called time dilation. And he proposed that mass increases as speed goes up.\n\nThen there was general relativity, which is about gravity and how it curves spacetime. Mass and energy create curvature in spacetime, and objects move along these paths. This means that objects can orbit each other without needing a direct connection—like how planets orbit the sun even though they're not connected by mass.\n\nSo putting it all together, Einstein's theory shows how time is relative for different observers when moving at high speeds and how gravity works through curv

Now, let's start our chain. 

We will use this chain to do two things:
1. Make summary
2. Act as a chatbot

First we will summarise the documents first. 

In [9]:
chain_result = qa_chain("Can you give me a summary of the context I gave, be super clear and explicit?")
# print(chain_result)
answer = chain_result["result"]
print(answer)

<think>
Okay, so I'm trying to understand this context about software requirements elicitation methods. Let me read through it carefully.

They mentioned several methods like scenarios, prototypes, facilitations, observation, etc., each with different uses and features. The user provided a detailed breakdown of what each method does, focusing on how they help in gathering requirements, making the process more engaging, capturing user stories, and dealing with ambiguities.

Hmm, I think the main idea is that these methods are various ways to ask stakeholders for their needs or plans before building the software product. They want to make sure everyone's on the same page so the product works as intended. 

They also touched on how each method can be useful in different situations—like using a scenario when you have a specific interaction in mind, including prototypes for more detailed features, facilitating discussions with others to refine requirements, and observation if changing an ex

1. Make empty json `summary`
2. Collect all topics
3. Turn topics to keys
4. Make question prompt
5. Make question function
6. Build the loop

We can now save the summaries to `summaries.json`. We will show this summaries on our dashboard. 

In [10]:
import json 


# Convert the markdown string to a JSON format
all_summaries = json.loads(json.dumps({"summary": answer}))

with open("docs/summaries.json", "w") as f:
    json.dump(all_summaries, f)