# Natural Language Processing

# Retrieval-Augmented generation (RAG)

RAG is a technique for augmenting LLM knowledge with additional, often private or real-time, data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs.

<img src="./figures/RAG-process.png" >

Introducing `ChakyBot`, an innovative chatbot designed to assist Chaky (the instructor) and TA (Gun) in explaining the lesson of the NLP course to students. Leveraging LangChain technology, ChakyBot excels in retrieving information from documents, ensuring a seamless and efficient learning experience for students engaging with the NLP curriculum.

1. Prompt
2. Retrieval
3. Memory
4. Chain

In [1]:
# langchain library
!pip install langchain==0.0.350
!pip install langchain-community==0.0.4
# LLM
!pip install accelerate==0.25.0
!pip install transformers==4.36.2
!pip install bitsandbytes==0.45.3
# text Embedding
!pip install sentence-transformers==2.2.2
!pip install InstructorEmbedding==1.0.1
# vectorstore
!pip install pymupdf==1.23.8
!pip install faiss-cpu==1.7.4
# huggingface_hub
!pip install -U huggingface-hub==0.20.0

[0m

In [4]:
import os
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

## 1. Prompt

A set of instructions or input provided by a user to guide the model's response, helping it understand the context and generate relevant and coherent language-based output, such as answering questions, completing sentences, or engaging in a conversation.

In [56]:
from langchain import PromptTemplate

prompt_template = """
    I know about Kaung Nyo Lwin. You can ask me about him.
    {context}
    Question: {question}
    Answer:
    """.strip()

PROMPT = PromptTemplate.from_template(
    template = prompt_template
)

PROMPT
#using str.format 
#The placeholder is defined using curly brackets: {} {}

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='I know about Kaung Nyo Lwin. You can ask me about him.\n    {context}\n    Question: {question}\n    Answer:')

In [57]:
PROMPT.format(
    context = "I am a DSAI student",
    question = "What do you do?"
)

'I know about Kaung Nyo Lwin. You can ask me about him.\n    I am a DSAI student\n    Question: What do you do?\n    Answer:'

## 2. Retrieval

1. `Document loaders` : Load documents from many different sources (HTML, PDF, code). 
2. `Document transformers` : One of the essential steps in document retrieval is breaking down a large document into smaller, relevant chunks to enhance the retrieval process.
3. `Text embedding models` : Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of text that are similar.
4. `Vector stores`: there has emerged a need for databases to support efficient storage and searching of these embeddings.
5. `Retrievers` : Once the data is in the database, you still need to retrieve it.

### 2.1 Document Loaders 
Use document loaders to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

I put my three personal documents to load. They are
- Resume
- LinkedIn Profile
- Essay about me

In [43]:
from langchain.document_loaders import PyMuPDFLoader,DirectoryLoader

pdf_directory = './documents/'

# loader = PyMuPDFLoader(nlp_docs)
loader = DirectoryLoader(pdf_directory, glob="*.pdf", loader_cls=PyMuPDFLoader)
documents = loader.load()

In [44]:
documents

[Document(metadata={'producer': '', 'creator': '', 'creationdate': '', 'source': 'documents/Kaung_NyoLwin_Resume.pdf', 'file_path': 'documents/Kaung_NyoLwin_Resume.pdf', 'total_pages': 4, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='Pathum Thani 13\n+66648328635\nkaungnyolwin@gmail.com\nKAUNG NYO LWIN\nResults-driven Senior Business Analyst at KEYfields with expertise in data analysis and\nproject management. Proven ability to enhance operational efficiency through SQL\nprogramming and business process mapping, leading to automated workflows.\nSuccessfully leveraged machine learning and AI technologies to achieve significant\nproject outcomes. Committed to further integrating AI innovations to optimize\nprocesses and drive productivity.\nPROFESSIONAL\nSUMMARY\nSKILLS\nMachine learning\n•\nData analysis\n•\nRequirement gathering\n•\nProject management\n•\nBusiness

In [45]:
len(documents)

7

In [46]:
documents[1]

Document(metadata={'producer': '', 'creator': '', 'creationdate': '', 'source': 'documents/Kaung_NyoLwin_Resume.pdf', 'file_path': 'documents/Kaung_NyoLwin_Resume.pdf', 'total_pages': 4, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 1}, page_content="SENIOR ASSET MANAGEMENT ANALYST, 05/2020 - 09/2020\nTelecom International Myanmar\nASSET MANAGEMENT ANALYST, 06/2017 - 04/2020\nKAUNGNYOLWIN, Monywa, Myanmar\ntechnical specifications.\nImplemented SQL Server Agent jobs to automate routine database backup tasks\nseamlessly.\n•\nStreamlined internal workflows through the creation of dynamic PowerShell scripts.\n•\ntook over the pivotal role of lead expert for the accounting system module\n•\nMigrated data from various sources into the designated accounting systems,\nensuring accuracy and completeness, and configured financial reports for newly\nacquired clients.\n•\nPerformed root cause

### 2.2 Document Transformers

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough

In [47]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 100
)

doc = text_splitter.split_documents(documents)

In [48]:
doc[1]

Document(metadata={'producer': '', 'creator': '', 'creationdate': '', 'source': 'documents/Kaung_NyoLwin_Resume.pdf', 'file_path': 'documents/Kaung_NyoLwin_Resume.pdf', 'total_pages': 4, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='•\nBusiness process mapping\n•\nData mining\n•\nDatabase management\n•\nMicrosoft Excel mastery\n•\nSQL programming\n•\nPython Programming\n•\nFeature engineering\n•\nNeural networks\n•\nNoSQL databases\n•\nDeep learning\n•\nScikit-learn\n•\nPyTorch\n•\nData visualization\n•\nReinforcement learning\n•\nGit\n•\nData modeling\n•\nNatural language processing\n•\nComputer vision\n•\nLinux operating system\n•\nContainerization technologies\n•\nSENIOR BUSINESS ANALYST, 02/2023 - 07/2024\nKEYfields, Yangon, Myanmar\nEXPERIENCE\nAddressed customer queries and requests for helpdesk services with precision and\nefficiency.\n•\nIdentified and res

In [49]:
len(doc)

20

### 2.3 Text Embedding Models
Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

*Note* Instructor Model : [Huggingface](gingface.co/hkunlp/instructor-base) | [Paper](https://arxiv.org/abs/2212.09741)

In [25]:
import torch
import InstructorEmbedding
import huggingface_hub
from langchain.embeddings import HuggingFaceInstructEmbeddings

model_name = 'hkunlp/instructor-base'

embedding_model = HuggingFaceInstructEmbeddings(
    model_name = model_name,
    # model_kwargs = {"device" : device}
)

  from tqdm.autonotebook import trange
  _torch_pytree._register_pytree_node(


load INSTRUCTOR_Transformer


  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


max_seq_length  512


### 2.4 Vector Stores

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

In [26]:
#locate vectorstore
vector_path = './vector-store'
if not os.path.exists(vector_path):
    os.makedirs(vector_path)
    print('create path done')

In [50]:
#save vector locally
from langchain.vectorstores import FAISS

vectordb = FAISS.from_documents(
    documents = doc,
    embedding = embedding_model
)

db_file_name = 'a6'

vectordb.save_local(
    folder_path = os.path.join(vector_path, db_file_name),
    index_name = 'a6' #default index
)

### 2.5 retrievers
A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

In [51]:
#calling vector from local
vector_path = './vector-store'
db_file_name = 'a6'

from langchain.vectorstores import FAISS

vectordb = FAISS.load_local(
    folder_path = os.path.join(vector_path, db_file_name),
    embeddings = embedding_model,
    index_name = 'a6', #default index
    allow_dangerous_deserialization=True
)   

In [52]:
#ready to use
retriever = vectordb.as_retriever()

In [53]:
retriever.get_relevant_documents("What is your name?")

[Document(id='22d02c91-278c-4a50-ae04-2993a2d87322', metadata={'producer': 'Microsoft® Word 2019', 'creator': 'Microsoft® Word 2019', 'creationdate': '2025-03-16T20:10:35+07:00', 'source': 'documents/Essay.pdf', 'file_path': 'documents/Essay.pdf', 'total_pages': 1, 'format': 'PDF 1.7', 'title': '', 'author': 'User', 'subject': '', 'keywords': '', 'moddate': '2025-03-16T20:10:35+07:00', 'trapped': '', 'modDate': "D:20250316201035+07'00'", 'creationDate': "D:20250316201035+07'00'", 'page': 0}, page_content="My Name is Kaung Nyo Lwin and I was born in 1994. I am studying for a master's degree in data science \nand artificial intelligence at the Asian Institute of Technology(AIT). My studies at AIT made me progress \nin almost different aspects except that I am still struggling with time management, which is my greatest \nchallenge. I am interested in self-supervised learning, especially for vision. Moreover, I want to be a \ncompetent data scientist after graduation. \nAs a person who has

In [54]:
retriever.get_relevant_documents("How old are you?")

[Document(id='22d02c91-278c-4a50-ae04-2993a2d87322', metadata={'producer': 'Microsoft® Word 2019', 'creator': 'Microsoft® Word 2019', 'creationdate': '2025-03-16T20:10:35+07:00', 'source': 'documents/Essay.pdf', 'file_path': 'documents/Essay.pdf', 'total_pages': 1, 'format': 'PDF 1.7', 'title': '', 'author': 'User', 'subject': '', 'keywords': '', 'moddate': '2025-03-16T20:10:35+07:00', 'trapped': '', 'modDate': "D:20250316201035+07'00'", 'creationDate': "D:20250316201035+07'00'", 'page': 0}, page_content="My Name is Kaung Nyo Lwin and I was born in 1994. I am studying for a master's degree in data science \nand artificial intelligence at the Asian Institute of Technology(AIT). My studies at AIT made me progress \nin almost different aspects except that I am still struggling with time management, which is my greatest \nchallenge. I am interested in self-supervised learning, especially for vision. Moreover, I want to be a \ncompetent data scientist after graduation. \nAs a person who has

## 3. Memory

One of the core utility classes underpinning most (if not all) memory modules is the ChatMessageHistory class. This is a super lightweight wrapper that provides convenience methods for saving HumanMessages, AIMessages, and then fetching them all.

You may want to use this class directly if you are managing memory outside of a chain.


In [33]:
from langchain.memory import ChatMessageHistory

history = ChatMessageHistory()
history

InMemoryChatMessageHistory(messages=[])

In [34]:
history.add_user_message('hi')
history.add_ai_message('Whats up?')
history.add_user_message('How are you')
history.add_ai_message('I\'m quite good. How about you?')

In [35]:
history

InMemoryChatMessageHistory(messages=[HumanMessage(content='hi', additional_kwargs={}, response_metadata={}), AIMessage(content='Whats up?', additional_kwargs={}, response_metadata={}), HumanMessage(content='How are you', additional_kwargs={}, response_metadata={}), AIMessage(content="I'm quite good. How about you?", additional_kwargs={}, response_metadata={})])

### 3.1 Memory types

There are many different types of memory. Each has their own parameters, their own return types, and is useful in different scenarios. 
- Converstaion Buffer
- Converstaion Buffer Window

What variables get returned from memory

Before going into the chain, various variables are read from memory. These have specific names which need to align with the variables the chain expects. You can see what these variables are by calling memory.load_memory_variables({}). Note that the empty dictionary that we pass in is just a placeholder for real variables. If the memory type you are using is dependent upon the input variables, you may need to pass some in.

In this case, you can see that load_memory_variables returns a single key, history. This means that your chain (and likely your prompt) should expect an input named history. You can usually control this variable through parameters on the memory class. For example, if you want the memory variables to be returned in the key chat_history you can do:

#### Converstaion Buffer
This memory allows for storing messages and then extracts the messages in a variable.

In [36]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

  memory = ConversationBufferMemory()


{'history': "Human: hi\nAI: What's up?\nHuman: How are you?\nAI: I'm quite good. How about you?"}

In [24]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(return_messages = True)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': [HumanMessage(content='hi'),
  AIMessage(content="What's up?"),
  HumanMessage(content='How are you?'),
  AIMessage(content="I'm quite good. How about you?")]}

#### Conversation Buffer Window
- it keeps a list of the interactions of the conversation over time. 
- it only uses the last K interactions. 
- it can be useful for keeping a sliding window of the most recent interactions, so the buffer does not get too large.

In [25]:
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=1)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: How are you?\nAI: I'm quite good. How about you?"}

## 4. Chain

Using an LLM in isolation is fine for simple applications, but more complex applications require chaining LLMs - either with each other or with other components.

An `LLMChain` is a simple chain that adds some functionality around language models.
- it consists of a `PromptTemplate` and a `LM` (either an LLM or chat model).
- it formats the prompt template using the input key values provided (and also memory key values, if available), 
- it passes the formatted string to LLM and returns the LLM output.

Note : I use ChatGroq for genertor model for better reponses as it can not loaded locally.

In [3]:
API_KEY = "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"

import getpass
import os

if "GROQ_API_KEY" not in os.environ:
    os.environ["GROQ_API_KEY"] = getpass.getpass("Enter your Groq API key: ")

In [5]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    model="llama-3.1-8b-instant",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

### [Class ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/_modules/langchain/chains/conversational_retrieval/base.html#ConversationalRetrievalChain)

- `retriever` : Retriever to use to fetch documents.

- `combine_docs_chain` : The chain used to combine any retrieved documents.

- `question_generator`: The chain used to generate a new question for the sake of retrieval. This chain will take in the current question (with variable question) and any chat history (with variable chat_history) and will produce a new standalone question to be used later on.

- `return_source_documents` : Return the retrieved source documents as part of the final result.

- `get_chat_history` : An optional function to get a string of the chat history. If None is provided, will use a default.

- `return_generated_question` : Return the generated question as part of the final result.

- `response_if_no_docs_found` : If specified, the chain will return a fixed response if no docs are found for the question.


`question_generator`

In [7]:
from langchain.chains import LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import ConversationalRetrievalChain

In [58]:
CONDENSE_QUESTION_PROMPT

PromptTemplate(input_variables=['chat_history', 'question'], input_types={}, partial_variables={}, template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:')

In [59]:
question_generator = LLMChain(
    llm = llm,
    prompt = CONDENSE_QUESTION_PROMPT,
    verbose = True
)

In [60]:
query = 'Tell me the name and age'
chat_history = "Human:What is your name\nAI:\nHuman:How old are you\nAI:"

question_generator({'chat_history' : chat_history, "question" : query})



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
Human:What is your name
AI:
Human:How old are you
AI:
Follow Up Input: Tell me the name and age
Standalone question:[0m

[1m> Finished chain.[0m


{'chat_history': 'Human:What is your name\nAI:\nHuman:How old are you\nAI:',
 'question': 'Tell me the name and age',
 'text': 'Tell me the name and age.'}

`combine_docs_chain`

In [61]:
doc_chain = load_qa_chain(
    llm = llm,
    chain_type = 'stuff',
    prompt = PROMPT,
    verbose = True
)
doc_chain

StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='I know about Kaung Nyo Lwin. You can ask me about him.\n    {context}\n    Question: {question}\n    Answer:'), llm=ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x7f34662b3d30>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x7f34662e4d60>, model_name='llama-3.1-8b-instant', temperature=1e-08, model_kwargs={}, groq_api_key=SecretStr('**********')), output_parser=StrOutputParser(), llm_kwargs={}), document_prompt=PromptTemplate(input_variables=['page_content'], input_types={}, partial_variables={}, template='{page_content}'), document_variable_name='context')

In [62]:
query = "What is your highest level of education"
input_document = retriever.get_relevant_documents(query)

doc_chain({'input_documents':input_document, 'question':query})



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI know about Kaung Nyo Lwin. You can ask me about him.
    My Name is Kaung Nyo Lwin and I was born in 1994. I am studying for a master's degree in data science 
and artificial intelligence at the Asian Institute of Technology(AIT). My studies at AIT made me progress 
in almost different aspects except that I am still struggling with time management, which is my greatest 
challenge. I am interested in self-supervised learning, especially for vision. Moreover, I want to be a 
competent data scientist after graduation. 
As a person who has a passion for technological advancements, especially AI, the question of cultural 
values vs advancements is always dazzled in my head. In my opinion, advancements in civilization in

Master of Engineering - MEng, Data Science and Artificial Intelligence
 · (August 2024 - May 2026)
Myanmar Aerospace Engineering Universi

{'input_documents': [Document(id='22d02c91-278c-4a50-ae04-2993a2d87322', metadata={'producer': 'Microsoft® Word 2019', 'creator': 'Microsoft® Word 2019', 'creationdate': '2025-03-16T20:10:35+07:00', 'source': 'documents/Essay.pdf', 'file_path': 'documents/Essay.pdf', 'total_pages': 1, 'format': 'PDF 1.7', 'title': '', 'author': 'User', 'subject': '', 'keywords': '', 'moddate': '2025-03-16T20:10:35+07:00', 'trapped': '', 'modDate': "D:20250316201035+07'00'", 'creationDate': "D:20250316201035+07'00'", 'page': 0}, page_content="My Name is Kaung Nyo Lwin and I was born in 1994. I am studying for a master's degree in data science \nand artificial intelligence at the Asian Institute of Technology(AIT). My studies at AIT made me progress \nin almost different aspects except that I am still struggling with time management, which is my greatest \nchallenge. I am interested in self-supervised learning, especially for vision. Moreover, I want to be a \ncompetent data scientist after graduation. \

In [63]:
memory = ConversationBufferWindowMemory(
    k=3, 
    memory_key = "chat_history",
    return_messages = True,
    output_key = 'answer'
)

chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator,
    combine_docs_chain=doc_chain,
    return_source_documents=True,
    memory=memory,
    verbose=True,
    get_chat_history=lambda h : h
)
chain

  memory = ConversationBufferWindowMemory(
  chain = ConversationalRetrievalChain(


ConversationalRetrievalChain(memory=ConversationBufferWindowMemory(chat_memory=InMemoryChatMessageHistory(messages=[]), output_key='answer', return_messages=True, memory_key='chat_history', k=3), verbose=True, combine_docs_chain=StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='I know about Kaung Nyo Lwin. You can ask me about him.\n    {context}\n    Question: {question}\n    Answer:'), llm=ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x7f34662b3d30>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x7f34662e4d60>, model_name='llama-3.1-8b-instant', temperature=1e-08, model_kwargs={}, groq_api_key=SecretStr('**********')), output_parser=StrOutputParser(), llm_kwargs={}), document_prompt=PromptTemplate(input_variables=['page_content'], input_types={}, partial_variables={}, template='{page_content}'), document_

## 5. Chatbot

In [64]:
prompt_question = "Who are you by the way?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI know about Kaung Nyo Lwin. You can ask me about him.
    My Name is Kaung Nyo Lwin and I was born in 1994. I am studying for a master's degree in data science 
and artificial intelligence at the Asian Institute of Technology(AIT). My studies at AIT made me progress 
in almost different aspects except that I am still struggling with time management, which is my greatest 
challenge. I am interested in self-supervised learning, especially for vision. Moreover, I want to be a 
competent data scientist after graduation. 
As a person who has a passion for technological advancements, especially AI, the question of cultural 
values vs advancements is always dazzled in my head. In my opinion, advancements in civilization in

Master of Engineering - MEng, Data Science and Artificial Intelligence
 · 

{'question': 'Who are you by the way?',
 'chat_history': [],
 'answer': "You are Kaung Nyo Lwin, a data science and artificial intelligence student at the Asian Institute of Technology (AIT), pursuing a Master's degree.",
 'source_documents': [Document(id='22d02c91-278c-4a50-ae04-2993a2d87322', metadata={'producer': 'Microsoft® Word 2019', 'creator': 'Microsoft® Word 2019', 'creationdate': '2025-03-16T20:10:35+07:00', 'source': 'documents/Essay.pdf', 'file_path': 'documents/Essay.pdf', 'total_pages': 1, 'format': 'PDF 1.7', 'title': '', 'author': 'User', 'subject': '', 'keywords': '', 'moddate': '2025-03-16T20:10:35+07:00', 'trapped': '', 'modDate': "D:20250316201035+07'00'", 'creationDate': "D:20250316201035+07'00'", 'page': 0}, page_content="My Name is Kaung Nyo Lwin and I was born in 1994. I am studying for a master's degree in data science \nand artificial intelligence at the Asian Institute of Technology(AIT). My studies at AIT made me progress \nin almost different aspects except

In [66]:
answer['answer']

"You are Kaung Nyo Lwin, a data science and artificial intelligence student at the Asian Institute of Technology (AIT), pursuing a Master's degree."

In [77]:
questions = [
    "How old are you?",
    "What is your highest level of education?",
    "What major or field of study did you pursue during your education?",
    "How many years of work experience do you have?",
    "What type of work or industry have you been involved in?",
    "Can you describe your current role or job responsibilities?",
    "What are your core beliefs regarding the role of technology in shaping society?",
    "How do you think cultural values should influence technological advancements?",
    "As a master’s student, what is the most challenging aspect of your studies so far?",
    "What specific research interests or academic goals do you hope to achieve during your time as a master’s student?"
]

In [78]:
qa_pairs = []
for prompt_question in questions:
    pairs = {}
    answer = chain({"question":prompt_question})
    pairs["question"] = prompt_question
    pairs["answer"] = answer['answer']
    qa_pairs.append(pairs)



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='As a master’s student, what is the most challenging aspect of your studies so far?', additional_kwargs={}, response_metadata={}), AIMessage(content="According to the information provided, the most challenging aspect of your studies as a master's student so far is time management.", additional_kwargs={}, response_metadata={}), HumanMessage(content='What specific research interests or academic goals do you hope to achieve during your time as a master’s student?', additional_kwargs={}, response_metadata={}), AIMessage(content="Based on the information provided, it seems that you have a strong interest in self-supervised learning, particularly in the field of computer

In [79]:
qa_pairs

[{'question': 'How old are you?',
  'answer': 'You were born in 1994. To find your age, I need to know the current year. Since the knowledge cutoff is 2023, I will assume the current year is 2024. \n\nIf the current year is 2024, you would be 30 years old.'},
 {'question': 'What is your highest level of education?',
  'answer': 'Based on the information provided, your highest level of education is a Master of Engineering (MEng) in Data Science and Artificial Intelligence, which you are currently pursuing at the Asian Institute of Technology (AIT) and expected to complete in May 2026.'},
 {'question': 'What major or field of study did you pursue during your education?',
  'answer': "Based on the information provided, it seems that you have pursued the following fields of study:\n\n1. Avionics (Bachelor of Engineering) at Myanmar Aerospace Engineering University (2010-2015)\n2. Data Science and Artificial Intelligence (Master of Engineering) at the Asian Institute of Technology (AIT), ex

In [80]:
len(qa_pairs)

10

## 6. Discussion

For RAG, I use two models as follow
- instructor-base(hkunlp/instructor-base) model form hugging face as retriver
- ChatGroq as a generator model

Although the retriver is instructor-base, the performance is good enough to retrive the related information about me. When I change the question, the top documents generated by the model changes and they contain the most possible answers.

For text generation, I use ChatGroq, which is based on llama-3.1-8b-instant. The model performance is really impressive and the model can generate a well-versed answers. Howerver, the answers are typically long although descriptive. I found that the bot adds a question to me in its answer. The more direct answers might be better.