Let's upload the PDF file first. We can upload it using `PdfReader` function from `PyPDF2`. 

In [1]:
from PyPDF2 import PdfReader

pdf_file_path = "docs/lecture4requirements.pdf"
loader = PdfReader(pdf_file_path)

Next we can collect all text from that PDF. 

In [2]:
raw_text = ""

for page in loader.pages:
    content = page.extract_text()
    if content:
        raw_text += content

After that, we can split our text collection using `CharacterTextSplitter` from `LangChain`. The reason why we need to split it is because we will store all this data to a vector database. We will save it in multiple documents instead of just one. Each document will have different information. So if we need information regarding something, we only need to take the documents that has information about that thing. We don't need to extract information from all text.  

In [3]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator = "\n", 
                                      chunk_size = 1000, 
                                      chunk_overlap = 10, 
                                      length_function = len)
text = text_splitter.split_text(raw_text)

Now we can transform the document using embedding function and store them to a Chroma vector database. 

In [4]:
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

embedding_function = SentenceTransformerEmbeddings(model_name = "all-MiniLM-L6-v2")

# vectordb = Chroma.from_texts(text, embedding_function, persist_directory = "docs/chroma_db")
# vectordb.persist()

  embedding_function = SentenceTransformerEmbeddings(model_name = "all-MiniLM-L6-v2")


Let's try to import the vector database now. 

In [5]:
vectordb = Chroma(persist_directory = "docs/chroma_db", embedding_function = embedding_function)

  vectordb = Chroma(persist_directory = "docs/chroma_db", embedding_function = embedding_function)
⚠️ It looks like you upgraded from a version below 0.5.6 and could benefit from vacuuming your database. Run chromadb utils vacuum --help for more information.


Since we already have the vector database, now we can set up a chain that can help us to answer question. 

"What do you mean by _chain_?"

It is a function that connects vector database with your prompt. So, if you write your prompt, LLM will then help you find the answer on the vector database. We can set up this chain using `load_qa_chain` from `langchain`.   

In [11]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain

load_dotenv(dotenv_path='.env')

api_key = os.getenv("API_KEY")
print(f"API Key: {api_key}")

retriever = vectordb.as_retriever()

llm = ChatOpenAI(api_key=api_key, temperature = 0.9, model='deepseek-reasoner', base_url="https://api.deepseek.com/v1")

qa_chain = RetrievalQA.from_chain_type(llm = llm,
                                       chain_type = "stuff",
                                       retriever = retriever,
                                       return_source_documents = True,
                                       verbose = False)

# load_qa_chain(llm, chain_type = "stuff")

API Key: sk-d5376bd4c45e451ab0f0b200ec486cf6


Now, let's start our chain. 

In [12]:
chain_result = qa_chain("What is requirement engineering ?")
# print(chain_result)
answer = chain_result["result"]
print(answer)

The provided context does not mention or discuss "requirement engineering." Therefore, based on the information given, I cannot provide an answer to your question. For general knowledge, requirement engineering typically refers to the process of defining, documenting, and maintaining requirements in software development or systems engineering, but this is outside the scope of the provided context.


We will use this chain to do two things:
1. Make summary
2. Act as a chatbot

First we will summarise the documents first. 

1. Make empty json `summary`
2. Collect all topics
3. Turn topics to keys
4. Make question prompt
5. Make question function
6. Build the loop

In [13]:
with open("topics.txt", "r", encoding = "utf-8") as r: 
    topics = r.read()
    
titles = topics.split("\n\n")
topics = [x.split("\n") for x in titles]

Now we have all list of topics. We can then run a loop where we make a prompt with the difference on the topic. Here's how our prompt looks like:

> _"Can you give me the summary of {topic} section given in the document"_

In [None]:
all_topics = []
for each in topics:
    all_topics.extend(each)

IndexError: list index out of range

In [62]:
from tqdm.notebook import tqdm

all_summaries = {}

progress_bar = tqdm(total=len(all_topics))

for t in all_topics:
    topic = f"Can you give me the summary of {t} section given in the document"
    chain_result = qa_chain(topic)
    answer = chain_result["result"]
    all_summaries[t] = answer 
    progress_bar.update(1)

progress_bar.close()




[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[AException ignored in: <function tqdm.__del__ at 0x13733dd80>
Traceback (most recent call last):
  File "/Users/jack/Desktop/PROJECTS/UniGPT/venv/lib/python3.10/site-packages/tqdm/std.py", line 1148, in __del__
    self.close()
  File "/Users/jack/Desktop/PROJECTS/UniGPT/venv/lib/python3.10/site-packages/tqdm/notebook.py", line 279, in close
    self.disp(bar_style='danger', check_delay=False)
AttributeError: 'tqdm_notebook' object has no attribute 'disp'
Exception ignored in: <function tqdm.__del__ at 0x13733dd80>
Traceback (most recent call last):
  File "/Users/jack/Desktop/PROJECTS/UniGPT/venv/lib/python3.10/site-packages/tqdm/std.py", line 1148, in __del__
    self.close()
  File "/Users/jack/Desktop/PROJECTS/UniGPT/venv/lib/python3.10/site-packages/tqdm/notebook.py", line 279, in close
    self.disp(bar_style='danger', check_delay=False)
A

KeyboardInterrupt: 

We can now save the summaries to `summaries.json`. We will show this summaries on our dashboard. 

In [12]:
import json 

with open("docs/summaries.json", "w") as f:
    json.dump(all_summaries, f)