# Retrieval-Augmented generation (RAG)

RAG is a technique for augmenting LLM knowledge with additional, often private or real-time, data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs.

Introducing `MookBot`, a personalized chatbot created to assist with all things related to Mook. Powered by LangChain technology, MookBot excels at seamlessly accessing and delivering relevant content from various documents.

1. Prompt
2. Retrieval
3. Memory
4. Chain
5. Chatbot

In [1]:
# #langchain library
# !pip install langchain==0.1.13
# !pip install langchain-community==0.0.38

In [2]:
# #LLM
# !pip install accelerate==0.26.0
# !pip install transformers==4.45.0
# !pip install bitsandbytes==0.41.3

In [3]:
# #Text Embedding
# !pip install sentence-transformers==2.2.2
# !pip install InstructorEmbedding==1.0.1

In [4]:
# #vectorstore
# !pip install pymupdf==1.23.8
# !pip install faiss-gpu
# !pip install faiss-cpu

In [5]:
# # Hugging Face Hub (Compatible with InstructorEmbedding)
# !pip install huggingface_hub==0.23.3

In [6]:
# # Other dependencies
# !pip install torch==2.2.0
# !pip install torchvision==0.17.0
# !pip install nltk 
# !pip install scikit-learn
# !pip install tiktoken

In [7]:
import os
import torch
# Set GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

os.environ['http_proxy']  = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## 1. Prompt

A set of instructions or input provided by a user to guide the model's response, helping it understand the context and generate relevant and coherent language-based output, such as answering questions, completing sentences, or engaging in a conversation.

In [8]:
from langchain import PromptTemplate

prompt_template = """
    I'm MookBot, your friendly assistant, here to answer any personal questions you have about my background and experiences. 
    Whether you're curious about my age, education, career, or personal beliefs, feel free to ask, 
    and I'll provide answers to help you learn more about me. 
    Just let me know what you're wondering about, and I'll do my best to share.
    {context}
    Question: {question}
    Answer:
    """.strip()

PROMPT = PromptTemplate.from_template(
    template = prompt_template
)

PROMPT
#using str.format 
#The placeholder is defined using curly brackets: {} {}

PromptTemplate(input_variables=['context', 'question'], template="I'm MookBot, your friendly assistant, here to answer any personal questions you have about my background and experiences. \n    Whether you're curious about my age, education, career, or personal beliefs, feel free to ask, \n    and I'll provide answers to help you learn more about me. \n    Just let me know what you're wondering about, and I'll do my best to share.\n    {context}\n    Question: {question}\n    Answer:")

In [9]:
PROMPT.format(
    context="I am currently pursuing a Master's degree in Data Science and AI. Before enrolling in the program, I gained 4 years of experience as a Data Analyst.",
    question="What is your highest level of education?"
)

"I'm MookBot, your friendly assistant, here to answer any personal questions you have about my background and experiences. \n    Whether you're curious about my age, education, career, or personal beliefs, feel free to ask, \n    and I'll provide answers to help you learn more about me. \n    Just let me know what you're wondering about, and I'll do my best to share.\n    I am currently pursuing a Master's degree in Data Science and AI. Before enrolling in the program, I gained 4 years of experience as a Data Analyst.\n    Question: What is your highest level of education?\n    Answer:"

In [10]:
with open("prompt_template.txt", "w") as file:
    file.write(prompt_template)

Note : [How to improve prompting (Zero-shot, Few-shot, Chain-of-Thought, etc.](https://github.com/chaklam-silpasuwanchai/Natural-Language-Processing/blob/main/Code/05%20-%20RAG/advance/cot-tot-prompting.ipynb)

## 2. Retrieval

1. `Document loaders` : Load documents from many different sources (HTML, PDF, code). 
2. `Document transformers` : One of the essential steps in document retrieval is breaking down a large document into smaller, relevant chunks to enhance the retrieval process.
3. `Text embedding models` : Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of text that are similar.
4. `Vector stores`: there has emerged a need for databases to support efficient storage and searching of these embeddings.
5. `Retrievers` : Once the data is in the database, you still need to retrieve it.

### 2.1 Document Loaders 
Use document loaders to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

[PDF Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)

[Download Document](https://web.stanford.edu/~jurafsky/slp3/)

In [11]:
from langchain.document_loaders import PyMuPDFLoader

resume_path = "docs/pdf/Prapatsorn_Resume.pdf"
challenges_and_goals_path = "docs/txt/Challenges-and-Goals.txt"
tech_and_society_path = "docs/txt/Technology-and-Society.txt"

In [12]:
# Load documents
loader = PyMuPDFLoader(resume_path)
documents = loader.load()

with open(challenges_and_goals_path, "r") as file:
    challenges_and_goals = file.read()
with open(tech_and_society_path, "r") as file:
    tech_and_society = file.read()

In [13]:
# Combine all documents into a single list
documents.append({"page_content": challenges_and_goals, "metadata": {"source": "Challenges-and-Goals.txt"}})
documents.append({"page_content": tech_and_society, "metadata": {"source": "Technology-and-Society.txt"}})

In [14]:
# documents

In [15]:
len(documents)

3

In [16]:
print(type(documents))
print(type(documents[0]))
print(type(documents[1]))
print(type(documents[2]))

<class 'list'>
<class 'langchain_core.documents.base.Document'>
<class 'dict'>
<class 'dict'>


In [17]:
# Convert dictionaries to Document objects
from langchain_core.documents import Document

for i, doc in enumerate(documents):
    if isinstance(doc, dict):
        documents[i] = Document(page_content=doc["page_content"], metadata=doc.get("metadata", {}))

In [18]:
print(type(documents))
print(type(documents[0]))
print(type(documents[1]))
print(type(documents[2]))

<class 'list'>
<class 'langchain_core.documents.base.Document'>
<class 'langchain_core.documents.base.Document'>
<class 'langchain_core.documents.base.Document'>


In [19]:
documents[0]

Document(page_content="Prapatsorn Alongkornpradub \n \nCORE COMPETENCIES AND SKILLS \n \n● Technical Skills: \n○ Microsoft Office Suite (Excel, Word, PowerPoint), G Suite, Google Data Studio \n○ Beginner in Tableau, Power BI, Google Analytics, Microsoft Excel VBA \n● Language Skills: \n○ Thai (Native), English (IELTS: 6.5/9.0) \n● Key Strengths: \n○ Analytical, logical, and problem-solving skills \n○ Fast learner, open-minded, detail-oriented, and teamwork with a “can do” attitude \n○ Experience working under time pressure and cultural diversity \nEDUCATION \nAsian Institute of Technology (AIT), Pathum Thani, Thailand \n \nMaster of Science in Data Science and Artificial Intelligence \n \n \n• Expected Graduation: May 2025 \nMahidol University, Bangkok, Thailand \n \nBachelor of Science in Actuarial Science (International Program) \n \n \n• Graduation: December 2020 \nWORK EXPERIENCE \nData Analyst \n \n \n \n \n \n \n December 2023 – July 2024 \n \nThairath Group (Bangkok, Thailand) \

In [20]:
documents[1]

Document(page_content="The most challenging part of being a master’s student has been the transition from my undergraduate studies in Actuarial Science to my current master's program in Data Science and AI. Since my background isn't directly related to data science, I sometimes find it difficult to fully grasp some of the more technical concepts. I am working hard to bridge the gap in my knowledge, but it has been challenging to keep up with the advanced topics.\n\nAs for my research interests, I honestly haven't decided on a specific topic yet. However, my main goal during my master's studies is to build a strong understanding of both the theory and the practical skills required in data science and AI. I hope this will help me feel confident in applying my knowledge and eventually secure a job as a data scientist.", metadata={'source': 'Challenges-and-Goals.txt'})

In [21]:
documents[2]

Document(page_content='I believe technology has a big impact on society. It can improve things like education, healthcare, and communication, and it helps make life easier and more efficient. But technology also brings problems, such as unequal access, privacy issues, and ethical concerns. We need to use technology carefully to make sure it helps people without causing harm.\n\nCultural values should be important when creating new technologies. As technology becomes part of everyday life, it should reflect the beliefs and needs of different cultures. Technology should be fair, respect privacy, and help create a more equal society. By considering cultural values, we can make sure that new technologies benefit everyone and are used in a way that is good for the world.', metadata={'source': 'Technology-and-Society.txt'})

### 2.2 Document Transformers

This text splitter is the recommended one for generic text. It is parameterized by a list of characters and attempts to split on them in order until the chunks are small enough.

**Note for Small Documents:**
For documents that are small (e.g., a single page with a few words), there is no need to chunk the text. The splitting process is more useful for larger documents that require division into smaller pieces. In the case of small documents, you can skip chunking entirely and use the document as a whole without splitting it.

In [22]:
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size = 200,
#     chunk_overlap = 2
# )

# doc = text_splitter.split_documents(documents)
# doc[0]
# len(doc)

In [23]:
doc = documents

In [24]:
len(doc)

3

In [25]:
doc

[Document(page_content="Prapatsorn Alongkornpradub \n \nCORE COMPETENCIES AND SKILLS \n \n● Technical Skills: \n○ Microsoft Office Suite (Excel, Word, PowerPoint), G Suite, Google Data Studio \n○ Beginner in Tableau, Power BI, Google Analytics, Microsoft Excel VBA \n● Language Skills: \n○ Thai (Native), English (IELTS: 6.5/9.0) \n● Key Strengths: \n○ Analytical, logical, and problem-solving skills \n○ Fast learner, open-minded, detail-oriented, and teamwork with a “can do” attitude \n○ Experience working under time pressure and cultural diversity \nEDUCATION \nAsian Institute of Technology (AIT), Pathum Thani, Thailand \n \nMaster of Science in Data Science and Artificial Intelligence \n \n \n• Expected Graduation: May 2025 \nMahidol University, Bangkok, Thailand \n \nBachelor of Science in Actuarial Science (International Program) \n \n \n• Graduation: December 2020 \nWORK EXPERIENCE \nData Analyst \n \n \n \n \n \n \n December 2023 – July 2024 \n \nThairath Group (Bangkok, Thailand) 

### 2.3 Text Embedding Models
Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

*Note* Instructor Model : [Huggingface](gingface.co/hkunlp/instructor-base) | [Paper](https://arxiv.org/abs/2212.09741)

In [26]:
import torch
from langchain.embeddings import HuggingFaceInstructEmbeddings

model_name = 'hkunlp/instructor-base'

embedding_model = HuggingFaceInstructEmbeddings(
    model_name = model_name,
    model_kwargs = {"device" : device}
)

  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512




### 2.4 Vector Stores

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

In [27]:
#locate vectorstore
vector_path = 'vector-store'
if not os.path.exists(vector_path):
    os.makedirs(vector_path)
    print('create path done')

In [28]:
#save vector locally
from langchain.vectorstores import FAISS

vectordb = FAISS.from_documents(
    documents = doc,
    embedding = embedding_model
)

db_file_name = 'personal_info'

vectordb.save_local(
    folder_path = os.path.join(vector_path, db_file_name),
    index_name = 'personal' #default index
)

### 2.5 retrievers
A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

In [29]:
#calling vector from local
vector_path = 'vector-store'
db_file_name = 'personal_info'

vectordb = FAISS.load_local(
    folder_path = os.path.join(vector_path, db_file_name),
    embeddings = embedding_model,
    index_name = 'personal', #default index
    allow_dangerous_deserialization=True  # Allow loading pickle file
)   

In [30]:
vectordb

<langchain_community.vectorstores.faiss.FAISS at 0x7d7ffef56870>

In [31]:
#ready to use
retriever = vectordb.as_retriever()

In [32]:
retriever.get_relevant_documents("What programming languages are you proficient in?")

  warn_deprecated(


[Document(page_content="Prapatsorn Alongkornpradub \n \nCORE COMPETENCIES AND SKILLS \n \n● Technical Skills: \n○ Microsoft Office Suite (Excel, Word, PowerPoint), G Suite, Google Data Studio \n○ Beginner in Tableau, Power BI, Google Analytics, Microsoft Excel VBA \n● Language Skills: \n○ Thai (Native), English (IELTS: 6.5/9.0) \n● Key Strengths: \n○ Analytical, logical, and problem-solving skills \n○ Fast learner, open-minded, detail-oriented, and teamwork with a “can do” attitude \n○ Experience working under time pressure and cultural diversity \nEDUCATION \nAsian Institute of Technology (AIT), Pathum Thani, Thailand \n \nMaster of Science in Data Science and Artificial Intelligence \n \n \n• Expected Graduation: May 2025 \nMahidol University, Bangkok, Thailand \n \nBachelor of Science in Actuarial Science (International Program) \n \n \n• Graduation: December 2020 \nWORK EXPERIENCE \nData Analyst \n \n \n \n \n \n \n December 2023 – July 2024 \n \nThairath Group (Bangkok, Thailand) 

## 3. Memory

One of the core utility classes underpinning most (if not all) memory modules is the ChatMessageHistory class. This is a super lightweight wrapper that provides convenience methods for saving HumanMessages, AIMessages, and then fetching them all.

You may want to use this class directly if you are managing memory outside of a chain.


In [33]:
from langchain.memory import ChatMessageHistory

history = ChatMessageHistory()
history

InMemoryChatMessageHistory(messages=[])

In [34]:
history.add_user_message('hi')
history.add_ai_message('Whats up?')
history.add_user_message('How are you')
history.add_ai_message('I\'m quite good. How about you?')

In [35]:
history

InMemoryChatMessageHistory(messages=[HumanMessage(content='hi'), AIMessage(content='Whats up?'), HumanMessage(content='How are you'), AIMessage(content="I'm quite good. How about you?")])

### 3.1 Memory types

There are many different types of memory. Each has their own parameters, their own return types, and is useful in different scenarios. 
- Converstaion Buffer
- Converstaion Buffer Window

What variables get returned from memory

Before going into the chain, various variables are read from memory. These have specific names which need to align with the variables the chain expects. You can see what these variables are by calling memory.load_memory_variables({}). Note that the empty dictionary that we pass in is just a placeholder for real variables. If the memory type you are using is dependent upon the input variables, you may need to pass some in.

In this case, you can see that load_memory_variables returns a single key, history. This means that your chain (and likely your prompt) should expect an input named history. You can usually control this variable through parameters on the memory class. For example, if you want the memory variables to be returned in the key chat_history you can do:

#### Converstaion Buffer
This memory allows for storing messages and then extracts the messages in a variable.

In [36]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: hi\nAI: What's up?\nHuman: How are you?\nAI: I'm quite good. How about you?"}

In [37]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(return_messages = True)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': [HumanMessage(content='hi'),
  AIMessage(content="What's up?"),
  HumanMessage(content='How are you?'),
  AIMessage(content="I'm quite good. How about you?")]}

#### Conversation Buffer Window
- it keeps a list of the interactions of the conversation over time. 
- it only uses the last K interactions. 
- it can be useful for keeping a sliding window of the most recent interactions, so the buffer does not get too large.

In [38]:
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=1)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: How are you?\nAI: I'm quite good. How about you?"}

## 4. Chain

Using an LLM in isolation is fine for simple applications, but more complex applications require chaining LLMs - either with each other or with other components.

An `LLMChain` is a simple chain that adds some functionality around language models.
- it consists of a `PromptTemplate` and a `LM` (either an LLM or chat model).
- it formats the prompt template using the input key values provided (and also memory key values, if available), 
- it passes the formatted string to LLM and returns the LLM output.

Note : [Download Fastchat Model Here](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)

In [39]:
# %cd ./models
# !git clone https://huggingface.co/lmsys/fastchat-t5-3b-v1.0

In [40]:
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
from transformers import BitsAndBytesConfig
from langchain import HuggingFacePipeline
import torch

In [41]:
model_id = 'lmsys/fastchat-t5-3b-v1.0'

tokenizer = AutoTokenizer.from_pretrained(model_id,use_fast=False)
tokenizer.save_pretrained('fastchat-t5-3b-v1.0')

tokenizer.pad_token_id = tokenizer.eos_token_id

# bitsandbyte_config = BitsAndBytesConfig(
#     load_in_4bit = True,
#     bnb_4bit_quant_type = "nf4",
#     bnb_4bit_compute_dtype = torch.float16,
#     bnb_4bit_use_double_quant = True
# )

# model = AutoModelForSeq2SeqLM.from_pretrained(
#     model_id,
#     quantization_config = bitsandbyte_config, #caution Nvidia
#     device_map = 'auto',
#     load_in_8bit = True
# )

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_id,
    device_map = 'cpu',
)

pipe = pipeline(
    task="text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens = 256,
    model_kwargs = {
        "temperature" : 0,
        "repetition_penalty": 1.5
    }
)

llm = HuggingFacePipeline(pipeline = pipe)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### [Class ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/_modules/langchain/chains/conversational_retrieval/base.html#ConversationalRetrievalChain)

- `retriever` : Retriever to use to fetch documents.

- `combine_docs_chain` : The chain used to combine any retrieved documents.

- `question_generator`: The chain used to generate a new question for the sake of retrieval. This chain will take in the current question (with variable question) and any chat history (with variable chat_history) and will produce a new standalone question to be used later on.

- `return_source_documents` : Return the retrieved source documents as part of the final result.

- `get_chat_history` : An optional function to get a string of the chat history. If None is provided, will use a default.

- `return_generated_question` : Return the generated question as part of the final result.

- `response_if_no_docs_found` : If specified, the chain will return a fixed response if no docs are found for the question.


`question_generator`

In [42]:
from langchain.chains import LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import ConversationalRetrievalChain

In [43]:
CONDENSE_QUESTION_PROMPT

PromptTemplate(input_variables=['chat_history', 'question'], template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:')

In [44]:
question_generator = LLMChain(
    llm = llm,
    prompt = CONDENSE_QUESTION_PROMPT,
    verbose = True
)

In [45]:
text = "This is a simple test."
tokens = tokenizer.tokenize(text)
print(tokens)

['▁This', ' ', '▁is', ' ', '▁', 'a', ' ', '▁simple', ' ', '▁test', '.']


In [46]:
# Convert tokens back to a string
text_from_tokens = tokenizer.convert_tokens_to_string(tokens)
print(text_from_tokens)

This  is  a  simple  test.


In [47]:
query = 'What are the differences between your work and studies?'
chat_history = """
Human: What is your highest level of education?
AI: I am currently pursuing a Master's degree in Data Science and AI.
Human: What type of work have you done?
AI: I have 4 years of experience as a Data Analyst.
"""

question_generator({'chat_history' : chat_history, "question" : query})



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: What is your highest level of education?
AI: I am currently pursuing a Master's degree in Data Science and AI.
Human: What type of work have you done?
AI: I have 4 years of experience as a Data Analyst.

Follow Up Input: What are the differences between your work and studies?
Standalone question:[0m


  warn_deprecated(



[1m> Finished chain.[0m


{'chat_history': "\nHuman: What is your highest level of education?\nAI: I am currently pursuing a Master's degree in Data Science and AI.\nHuman: What type of work have you done?\nAI: I have 4 years of experience as a Data Analyst.\n",
 'question': 'What are the differences between your work and studies?',
 'text': 'What   are   the   main   differences   between   your   work   and   studies? \n'}

`combine_docs_chain`

In [48]:
doc_chain = load_qa_chain(
    llm = llm,
    chain_type = 'stuff',
    prompt = PROMPT,
    verbose = True
)
doc_chain

StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], template="I'm MookBot, your friendly assistant, here to answer any personal questions you have about my background and experiences. \n    Whether you're curious about my age, education, career, or personal beliefs, feel free to ask, \n    and I'll provide answers to help you learn more about me. \n    Just let me know what you're wondering about, and I'll do my best to share.\n    {context}\n    Question: {question}\n    Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x7d7ff2b4e8d0>)), document_variable_name='context')

In [49]:
query = "What are your research interests?"
input_document = retriever.get_relevant_documents(query)

doc_chain({'input_documents':input_document, 'question':query})



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm MookBot, your friendly assistant, here to answer any personal questions you have about my background and experiences. 
    Whether you're curious about my age, education, career, or personal beliefs, feel free to ask, 
    and I'll provide answers to help you learn more about me. 
    Just let me know what you're wondering about, and I'll do my best to share.
    The most challenging part of being a master’s student has been the transition from my undergraduate studies in Actuarial Science to my current master's program in Data Science and AI. Since my background isn't directly related to data science, I sometimes find it difficult to fully grasp some of the more technical concepts. I am working hard to bridge the gap in my knowledge, but it has been challenging to keep up with the advanced topics.

As for my research interests, I honestly haven't d

{'input_documents': [Document(page_content="The most challenging part of being a master’s student has been the transition from my undergraduate studies in Actuarial Science to my current master's program in Data Science and AI. Since my background isn't directly related to data science, I sometimes find it difficult to fully grasp some of the more technical concepts. I am working hard to bridge the gap in my knowledge, but it has been challenging to keep up with the advanced topics.\n\nAs for my research interests, I honestly haven't decided on a specific topic yet. However, my main goal during my master's studies is to build a strong understanding of both the theory and the practical skills required in data science and AI. I hope this will help me feel confident in applying my knowledge and eventually secure a job as a data scientist.", metadata={'source': 'Challenges-and-Goals.txt'}),
  Document(page_content="Prapatsorn Alongkornpradub \n \nCORE COMPETENCIES AND SKILLS \n \n● Technic

In [50]:
memory = ConversationBufferWindowMemory(
    k=3, 
    memory_key = "chat_history",
    return_messages = True,
    output_key = 'answer'
)

chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator,
    combine_docs_chain=doc_chain,
    return_source_documents=True,
    memory=memory,
    verbose=True,
    get_chat_history=lambda h : h
)

In [51]:
chain

ConversationalRetrievalChain(memory=ConversationBufferWindowMemory(output_key='answer', return_messages=True, memory_key='chat_history', k=3), verbose=True, combine_docs_chain=StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], template="I'm MookBot, your friendly assistant, here to answer any personal questions you have about my background and experiences. \n    Whether you're curious about my age, education, career, or personal beliefs, feel free to ask, \n    and I'll provide answers to help you learn more about me. \n    Just let me know what you're wondering about, and I'll do my best to share.\n    {context}\n    Question: {question}\n    Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x7d7ff2b4e8d0>)), document_variable_name='context'), question_generator=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['chat_h

## 5. Chatbot

In [52]:
prompt_question = "How old are you?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm MookBot, your friendly assistant, here to answer any personal questions you have about my background and experiences. 
    Whether you're curious about my age, education, career, or personal beliefs, feel free to ask, 
    and I'll provide answers to help you learn more about me. 
    Just let me know what you're wondering about, and I'll do my best to share.
    Prapatsorn Alongkornpradub 
 
CORE COMPETENCIES AND SKILLS 
 
● Technical Skills: 
○ Microsoft Office Suite (Excel, Word, PowerPoint), G Suite, Google Data Studio 
○ Beginner in Tableau, Power BI, Google Analytics, Microsoft Excel VBA 
● Language Skills: 
○ Thai (Native), English (IELTS: 6.5/9.0) 
● Key Strengths: 
○ Analytical, logical, and problem-solving skills 
○ Fast learner, open-minded, detail-oriented, and teamwork with 

{'question': 'How old are you?',
 'chat_history': [],
 'answer': '  As   of   2023,   I   am   24   years   old. \n',
 'source_documents': [Document(page_content="Prapatsorn Alongkornpradub \n \nCORE COMPETENCIES AND SKILLS \n \n● Technical Skills: \n○ Microsoft Office Suite (Excel, Word, PowerPoint), G Suite, Google Data Studio \n○ Beginner in Tableau, Power BI, Google Analytics, Microsoft Excel VBA \n● Language Skills: \n○ Thai (Native), English (IELTS: 6.5/9.0) \n● Key Strengths: \n○ Analytical, logical, and problem-solving skills \n○ Fast learner, open-minded, detail-oriented, and teamwork with a “can do” attitude \n○ Experience working under time pressure and cultural diversity \nEDUCATION \nAsian Institute of Technology (AIT), Pathum Thani, Thailand \n \nMaster of Science in Data Science and Artificial Intelligence \n \n \n• Expected Graduation: May 2025 \nMahidol University, Bangkok, Thailand \n \nBachelor of Science in Actuarial Science (International Program) \n \n \n• Graduat

In [53]:
prompt_question = "What is your highest level of education?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='How old are you?'), AIMessage(content='  As   of   2023,   I   am   24   years   old. \n')]
Follow Up Input: What is your highest level of education?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm MookBot, your friendly assistant, here to answer any personal questions you have about my background and experiences. 
    Whether you're curious about my age, education, career, or personal beliefs, feel free to ask, 
    and I'll provide answers to help you learn more about me. 
    Just let me know what you're wondering about,

{'question': 'What is your highest level of education?',
 'chat_history': [HumanMessage(content='How old are you?'),
  AIMessage(content='  As   of   2023,   I   am   24   years   old. \n')],
 'answer': '  Master   of   Science   in   Data   Science   and   Artificial   Intelligence \n',
 'source_documents': [Document(page_content="The most challenging part of being a master’s student has been the transition from my undergraduate studies in Actuarial Science to my current master's program in Data Science and AI. Since my background isn't directly related to data science, I sometimes find it difficult to fully grasp some of the more technical concepts. I am working hard to bridge the gap in my knowledge, but it has been challenging to keep up with the advanced topics.\n\nAs for my research interests, I honestly haven't decided on a specific topic yet. However, my main goal during my master's studies is to build a strong understanding of both the theory and the practical skills required 

In [54]:
prompt_question = "What major or field of study did you pursue during your education?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='How old are you?'), AIMessage(content='  As   of   2023,   I   am   24   years   old. \n'), HumanMessage(content='What is your highest level of education?'), AIMessage(content='  Master   of   Science   in   Data   Science   and   Artificial   Intelligence \n')]
Follow Up Input: What major or field of study did you pursue during your education?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm MookBot, your friendly assistant, here to answer any personal questions you have about my background and experiences. 
    Whether you

{'question': 'What major or field of study did you pursue during your education?',
 'chat_history': [HumanMessage(content='How old are you?'),
  AIMessage(content='  As   of   2023,   I   am   24   years   old. \n'),
  HumanMessage(content='What is your highest level of education?'),
  AIMessage(content='  Master   of   Science   in   Data   Science   and   Artificial   Intelligence \n')],
 'answer': '  Actuarial   Science. \n',
 'source_documents': [Document(page_content="The most challenging part of being a master’s student has been the transition from my undergraduate studies in Actuarial Science to my current master's program in Data Science and AI. Since my background isn't directly related to data science, I sometimes find it difficult to fully grasp some of the more technical concepts. I am working hard to bridge the gap in my knowledge, but it has been challenging to keep up with the advanced topics.\n\nAs for my research interests, I honestly haven't decided on a specific topi

In [55]:
prompt_question = "How many years of work experience do you have?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='How old are you?'), AIMessage(content='  As   of   2023,   I   am   24   years   old. \n'), HumanMessage(content='What is your highest level of education?'), AIMessage(content='  Master   of   Science   in   Data   Science   and   Artificial   Intelligence \n'), HumanMessage(content='What major or field of study did you pursue during your education?'), AIMessage(content='  Actuarial   Science. \n')]
Follow Up Input: How many years of work experience do you have?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm MookBot, your 

{'question': 'How many years of work experience do you have?',
 'chat_history': [HumanMessage(content='How old are you?'),
  AIMessage(content='  As   of   2023,   I   am   24   years   old. \n'),
  HumanMessage(content='What is your highest level of education?'),
  AIMessage(content='  Master   of   Science   in   Data   Science   and   Artificial   Intelligence \n'),
  HumanMessage(content='What major or field of study did you pursue during your education?'),
  AIMessage(content='  Actuarial   Science. \n')],
 'answer': '  As   of   2023,   I   have   2   years   of   work   experience   as   a   Data   Analyst   and   1   year   as   a   Data   Analyst   Intern. \n',
 'source_documents': [Document(page_content="Prapatsorn Alongkornpradub \n \nCORE COMPETENCIES AND SKILLS \n \n● Technical Skills: \n○ Microsoft Office Suite (Excel, Word, PowerPoint), G Suite, Google Data Studio \n○ Beginner in Tableau, Power BI, Google Analytics, Microsoft Excel VBA \n● Language Skills: \n○ Thai (Nati

In [56]:
prompt_question = "What type of work or industry have you been involved in?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What is your highest level of education?'), AIMessage(content='  Master   of   Science   in   Data   Science   and   Artificial   Intelligence \n'), HumanMessage(content='What major or field of study did you pursue during your education?'), AIMessage(content='  Actuarial   Science. \n'), HumanMessage(content='How many years of work experience do you have?'), AIMessage(content='  As   of   2023,   I   have   2   years   of   work   experience   as   a   Data   Analyst   and   1   year   as   a   Data   Analyst   Intern. \n')]
Follow Up Input: What type of work or industry have you been involved in?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Enterin

{'question': 'What type of work or industry have you been involved in?',
 'chat_history': [HumanMessage(content='What is your highest level of education?'),
  AIMessage(content='  Master   of   Science   in   Data   Science   and   Artificial   Intelligence \n'),
  HumanMessage(content='What major or field of study did you pursue during your education?'),
  AIMessage(content='  Actuarial   Science. \n'),
  HumanMessage(content='How many years of work experience do you have?'),
  AIMessage(content='  As   of   2023,   I   have   2   years   of   work   experience   as   a   Data   Analyst   and   1   year   as   a   Data   Analyst   Intern. \n')],
 'answer': '  I   have   been   involved   in   various   industries   such   as   finance,   insurance,   and   technology.   I   have   worked   as   a   Data   Analyst   and   Reporting   Analyst   in   various   companies   in   Thailand.   I   have   also   worked   as   a   Data   Analyst   Intern   at   WorkVenture   Technologies   Comp

In [57]:
prompt_question = "Can you describe your current role or job responsibilities?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What major or field of study did you pursue during your education?'), AIMessage(content='  Actuarial   Science. \n'), HumanMessage(content='How many years of work experience do you have?'), AIMessage(content='  As   of   2023,   I   have   2   years   of   work   experience   as   a   Data   Analyst   and   1   year   as   a   Data   Analyst   Intern. \n'), HumanMessage(content='What type of work or industry have you been involved in?'), AIMessage(content='  I   have   been   involved   in   various   industries   such   as   finance,   insurance,   and   technology.   I   have   worked   as   a   Data   Analyst   and   Reporting   Analyst   in   various   compani

{'question': 'Can you describe your current role or job responsibilities?',
 'chat_history': [HumanMessage(content='What major or field of study did you pursue during your education?'),
  AIMessage(content='  Actuarial   Science. \n'),
  HumanMessage(content='How many years of work experience do you have?'),
  AIMessage(content='  As   of   2023,   I   have   2   years   of   work   experience   as   a   Data   Analyst   and   1   year   as   a   Data   Analyst   Intern. \n'),
  HumanMessage(content='What type of work or industry have you been involved in?'),
  AIMessage(content='  I   have   been   involved   in   various   industries   such   as   finance,   insurance,   and   technology.   I   have   worked   as   a   Data   Analyst   and   Reporting   Analyst   in   various   companies   in   Thailand.   I   have   also   worked   as   a   Data   Analyst   Intern   at   WorkVenture   Technologies   Company   Limited   in   Bangkok. \n')],
 'answer': "  Sure,   here   are   some   m

In [58]:
prompt_question = "What are your core beliefs regarding the role of technology in shaping society?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='How many years of work experience do you have?'), AIMessage(content='  As   of   2023,   I   have   2   years   of   work   experience   as   a   Data   Analyst   and   1   year   as   a   Data   Analyst   Intern. \n'), HumanMessage(content='What type of work or industry have you been involved in?'), AIMessage(content='  I   have   been   involved   in   various   industries   such   as   finance,   insurance,   and   technology.   I   have   worked   as   a   Data   Analyst   and   Reporting   Analyst   in   various   companies   in   Thailand.   I   have   also   worked   as   a   Data   Analyst   Intern   at   WorkVenture   Technologies   Company   Limited   in

{'question': 'What are your core beliefs regarding the role of technology in shaping society?',
 'chat_history': [HumanMessage(content='How many years of work experience do you have?'),
  AIMessage(content='  As   of   2023,   I   have   2   years   of   work   experience   as   a   Data   Analyst   and   1   year   as   a   Data   Analyst   Intern. \n'),
  HumanMessage(content='What type of work or industry have you been involved in?'),
  AIMessage(content='  I   have   been   involved   in   various   industries   such   as   finance,   insurance,   and   technology.   I   have   worked   as   a   Data   Analyst   and   Reporting   Analyst   in   various   companies   in   Thailand.   I   have   also   worked   as   a   Data   Analyst   Intern   at   WorkVenture   Technologies   Company   Limited   in   Bangkok. \n'),
  HumanMessage(content='Can you describe your current role or job responsibilities?'),
  AIMessage(content="  Sure,   here   are   some   more   details   about   my   

In [59]:
prompt_question = "How do you think cultural values should influence technological advancements?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What type of work or industry have you been involved in?'), AIMessage(content='  I   have   been   involved   in   various   industries   such   as   finance,   insurance,   and   technology.   I   have   worked   as   a   Data   Analyst   and   Reporting   Analyst   in   various   companies   in   Thailand.   I   have   also   worked   as   a   Data   Analyst   Intern   at   WorkVenture   Technologies   Company   Limited   in   Bangkok. \n'), HumanMessage(content='Can you describe your current role or job responsibilities?'), AIMessage(content="  Sure,   here   are   some   more   details   about   my   previous   roles   and   responsibilities   as   a   data   

{'question': 'How do you think cultural values should influence technological advancements?',
 'chat_history': [HumanMessage(content='What type of work or industry have you been involved in?'),
  AIMessage(content='  I   have   been   involved   in   various   industries   such   as   finance,   insurance,   and   technology.   I   have   worked   as   a   Data   Analyst   and   Reporting   Analyst   in   various   companies   in   Thailand.   I   have   also   worked   as   a   Data   Analyst   Intern   at   WorkVenture   Technologies   Company   Limited   in   Bangkok. \n'),
  HumanMessage(content='Can you describe your current role or job responsibilities?'),
  AIMessage(content="  Sure,   here   are   some   more   details   about   my   previous   roles   and   responsibilities   as   a   data   analyst: \n 1.   Reporting   Analyst:   As   a   reporting   analyst   at   Thairath   Group,   I   was   responsible   for   recording   and   updating   the   database   for   all   team

In [60]:
prompt_question = "As a master’s student, what is the most challenging aspect of your studies so far?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='Can you describe your current role or job responsibilities?'), AIMessage(content="  Sure,   here   are   some   more   details   about   my   previous   roles   and   responsibilities   as   a   data   analyst: \n 1.   Reporting   Analyst:   As   a   reporting   analyst   at   Thairath   Group,   I   was   responsible   for   recording   and   updating   the   database   for   all   team's   usage.   I   also   managed   multiple   reports   to   ensure   all   data   was   accurate   and   organized.   I   generated   Excel   calculation   files   to   reduce   input   time   and   improve   productivity.   I   supported   managers   and   performed   other   adm

{'question': 'As a master’s student, what is the most challenging aspect of your studies so far?',
 'chat_history': [HumanMessage(content='Can you describe your current role or job responsibilities?'),
  AIMessage(content="  Sure,   here   are   some   more   details   about   my   previous   roles   and   responsibilities   as   a   data   analyst: \n 1.   Reporting   Analyst:   As   a   reporting   analyst   at   Thairath   Group,   I   was   responsible   for   recording   and   updating   the   database   for   all   team's   usage.   I   also   managed   multiple   reports   to   ensure   all   data   was   accurate   and   organized.   I   generated   Excel   calculation   files   to   reduce   input   time   and   improve   productivity.   I   supported   managers   and   performed   other   administrative   duties   as   assigned. \n 2.   Data   Analyst:   As   a   Data   Analyst   at   Toyota   Nakornping   Chiangmai   Company   Limited,   I   was   responsible   for   analyzi

In [61]:
prompt_question = "What specific research interests or academic goals do you hope to achieve during your time as a master’s student?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='What are your core beliefs regarding the role of technology in shaping society?'), AIMessage(content='                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               '), HumanMessage(content='How do you think cultural values 

{'question': 'What specific research interests or academic goals do you hope to achieve during your time as a master’s student?',
 'chat_history': [HumanMessage(content='What are your core beliefs regarding the role of technology in shaping society?'),
  AIMessage(content='                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               '),
  HumanMessage(content='How do you think cultural values should influence technological advancements?'),
  AIMessage(content='              As   technology   becomes   part   of   everyday   life,   it   should 