# Retrieval-Augmented generation (RAG)

Introducing `AITBot`, an innovative chatbot designed to assist AIT information member to answer the question about AIT 


1. Retrieval (find the relevant sources related to AIT )
2. Prompt (Design Prompt Template)
3. Memory => let model to know previous answer of question
4. Chain

## Task 1

In [3]:
# #langchain library
# !pip install langchain==0.1.6
# !pip install langchain-community==0.0.19
# #LLM
# !pip install accelerate==0.25.0
# !pip install transformers==4.36.2
# !pip install bitsandbytes==0.41.2 
# #Text Embedding
# !pip install sentence-transformers==2.2.2
# !pip install InstructorEmbedding==1.0.1
# #vectorstore
# !pip install pymupdf==1.23.8
# !pip install faiss-gpu==1.7.2
# !pip install faiss-cpu==1.7.4
# !pip install requests beautifulsoup4


In [4]:
import os
import torch
# Set GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

### Find all relevant sources related to AIT

### 1. Retrieval

1. `Document loaders` : Load documents from many different sources (HTML, PDF, code). 
2. `Document transformers` : One of the essential steps in document retrieval is breaking down a large document into smaller, relevant chunks to enhance the retrieval process.
3. `Text embedding models` : Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of text that are similar.
4. `Vector stores`: there has emerged a need for databases to support efficient storage and searching of these embeddings.
5. `Retrievers` : Once the data is in the database, you still need to retrieve it.

#### 1.1 Document Loaders 
I load text from AIT website and save them as HTML

In [3]:
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pdfkit
import re

def sanitize_filename(url):
    # Remove special characters from the URL
    filename = re.sub(r'[^a-zA-Z0-9_-]', '_', url)
    return filename

def crawl_website(url, visited_pages=set(), max_depth=3, html_directory='html_files'):
    if max_depth == 0:
        return
    
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Add the current URL to the visited pages set
        visited_pages.add(url)
        
        # Save the page as HTML
        html_content = str(soup)
        file_name = f"{sanitize_filename(url)}.html"
        with open(os.path.join(html_directory, file_name), 'w', encoding='utf-8') as f:
            f.write(html_content)
        
        # Find all links on the page
        for link in soup.find_all('a'):
            # Get the absolute URL of the link
            absolute_link = urljoin(url, link.get('href'))
            
            # Check if the link is within the same domain and has not been visited yet
            if absolute_link.startswith(url) and absolute_link not in visited_pages:
                # Recursively crawl the new page
                crawl_website(absolute_link, visited_pages, max_depth - 1, html_directory)


# Example usage
url = 'https://ait.ac.th/' 
html_directory = 'html_files'
os.makedirs(html_directory, exist_ok=True)
crawl_website(url, html_directory=html_directory)


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


In [4]:
from langchain_community.document_loaders import UnstructuredHTMLLoader
#!pip install unstructured

# Make document from HTML file
def make_document(html_directory):
    documents = []
    html_files = [os.path.join(html_directory, f) for f in os.listdir(html_directory) if f.endswith('.html')]
    for file in html_files:
        loader = UnstructuredHTMLLoader(file)
        documents += loader.load()
    return documents

documents = make_document(html_directory)

In [5]:
# List all documents
documents

[Document(page_content='Welcome to AIT\n\nAIT is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies. AIT’s rigorous academic, research, and experiential outreach programs prepare graduates for professional success and leadership roles in Asia and beyond.\n\nLearn More\n\nAIT Professors Shine Globally\n\nSixteen professors from AIT have been recognized among the top 2% of influential scientists globally in their respective fields, as per the latest annual rankings published by Stanford University on 4th October 2023.\n\nLearn More\n\nAIT’s MBA and BADT Programs Achieve Top Ranking\n\nLearn More\n\nTargeted Businesses to Enjoy Privileges from New BOI Designated Zone for AIT\n\nLearn More\n\nFlexible PhD Option\n\nThe “Flexible PhD Option” is a new alternative to our traditional “On-campus PhD Option” to conduct PhD studies at AIT.\n\nLearn More\n\nFlexible Master\'s Option\n\nThe “Flexible Master’s Option” is an altern

In [6]:
# Count the number of page
len(documents)

102

#### 1.2 Document Transformers

Split the documents to smaller chunk

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 25
)

doc = text_splitter.split_documents(documents)

In [8]:
doc[1]

Document(page_content='AIT is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies. AIT’s rigorous academic, research, and experiential outreach programs', metadata={'source': 'html_files\\https___ait_ac_th_.html'})

In [9]:
len(doc)

2015

#### 1.3 Text Embedding Models
Embed the text by model from Hugging Face

In [6]:
import torch
from langchain.embeddings import HuggingFaceInstructEmbeddings

model_name = 'hkunlp/instructor-base'

embedding_model = HuggingFaceInstructEmbeddings(
    model_name = model_name,
    model_kwargs = {"device" : device}
)

  from tqdm.autonotebook import trange
  _torch_pytree._register_pytree_node(


load INSTRUCTOR_Transformer


  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


max_seq_length  512


#### 1.4 Vector Stores
Create vector store to store embedded data and it perform vector search for you.

In [11]:
# Locate vectorstore
vector_path = './vector-store'
if not os.path.exists(vector_path):
    os.makedirs(vector_path)
    print('create path done')

create path done


In [12]:
# Save vector locally
from langchain.vectorstores import FAISS

vectordb = FAISS.from_documents(
    documents = doc,
    embedding = embedding_model
)

db_file_name = 'AIT'

vectordb.save_local(
    folder_path = os.path.join(vector_path, db_file_name),
    index_name = 'ait' #default index
)

#### 1.5 retrievers
A retriever is an interface that returns documents given an unstructured query.

In [7]:
#calling vector from local
vector_path = './vector-store'
db_file_name = 'AIT'

from langchain.vectorstores import FAISS

vectordb = FAISS.load_local(
    folder_path = os.path.join(vector_path, db_file_name),
    embeddings = embedding_model,
    index_name = 'ait', #default index
)   

In [8]:
#ready to use
retriever = vectordb.as_retriever()

In [9]:
retriever.get_relevant_documents("What is AIT?")

[Document(page_content='documents from AIT.', metadata={'source': 'html_files\\https___ait_ac_th_study_open-master-of-engineering-science-in-interdisciplinary-studies-omis_.html'}),
 Document(page_content='Welcome to AIT', metadata={'source': 'html_files\\https___ait_ac_th_.html'}),
 Document(page_content='Home >\n\nAbout\n\nAbout AIT\n\nAIT is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies.\n\nIn this section\n\nAbout AIT', metadata={'source': 'html_files\\https___ait_ac_th_about_.html'}),
 Document(page_content='AIT is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies. AIT’s rigorous academic, research, and experiential outreach programs', metadata={'source': 'html_files\\https___ait_ac_th_.html'})]

In [10]:
retriever.get_relevant_documents("What is DSAI?")

[Document(page_content='Computer Science (CS)\n\nConstruction, Engineering and Infrastructure Management (CEIM)\n\nData Science and AI (DSAI)\n\nDevelopment and Sustainability (DS)', metadata={'source': 'html_files\\https___ait_ac_th_academics_programs_.html'}),
 Document(page_content='Data Science and AI (DSAI)\n\nSchool of Environment, Resources, and Development (SERD)\n\nDEPARTMENT OF FOOD, AGRICULTURE AND BIORESOURCES\n\nAgribusiness Management (ABM)', metadata={'source': 'html_files\\https___ait_ac_th_study_flexible-masters-option_.html'}),
 Document(page_content='Data Science and AI (DSAI)\n\nSchool of Environment, Resources and Development (SERD)\n\nDEPARTMENT OF FOOD, AGRICULTURE, AND BIORESOURCES\n\nAgriBusiness Management (ABM)', metadata={'source': 'html_files\\https___ait_ac_th_study_flexible-phd-option_.html'}),
 Document(page_content='cooperation within the framework of DDAM, on the occasion of the 7 th DAAM international symposium to celebrate the 1000 th anniversary of 

### 2. Prompt

A set of instructions or input provided by a user to guide the model's response, helping it understand the context and generate relevant and coherent language-based output

In [11]:
from langchain import PromptTemplate

prompt_template = """
    I'm your friendly chatbot named AITBot created by Noppawee Teeraratchanon, here to assist AIT information member to answer the question people may have about AIT.
    Just let me know what you're wondering about, and I'll do my best to guide you through it!
    {context}
    Question: {question}
    Answer:
    """.strip()

PROMPT = PromptTemplate.from_template(
    template = prompt_template
)

PROMPT
#using str.format 
#The placeholder is defined using curly brackets: {} {}

PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly chatbot named AITBot created by Noppawee Teeraratchanon, here to assist AIT information member to answer the question people may have about AIT.\n    Just let me know what you're wondering about, and I'll do my best to guide you through it!\n    {context}\n    Question: {question}\n    Answer:")

In [12]:
PROMPT.format(
    context = "The Asian Institute of Technology (AIT) is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies. AIT’s rigorous academic, research, and experiential outreach programs prepare graduates for professional success and leadership roles in Asia and beyond.",
    question = "What is AIT"
)

"I'm your friendly chatbot named AITBot created by Noppawee Teeraratchanon, here to assist AIT information member to answer the question people may have about AIT.\n    Just let me know what you're wondering about, and I'll do my best to guide you through it!\n    The Asian Institute of Technology (AIT) is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies. AIT’s rigorous academic, research, and experiential outreach programs prepare graduates for professional success and leadership roles in Asia and beyond.\n    Question: What is AIT\n    Answer:"

## 3. Memory

One of the core utility classes underpinning most (if not all) memory modules is the ChatMessageHistory class. This is a super lightweight wrapper that provides convenience methods for saving HumanMessages, AIMessages, and then fetching them all.

You may want to use this class directly if you are managing memory outside of a chain.


In [13]:
from langchain.memory import ChatMessageHistory

history = ChatMessageHistory()
history

ChatMessageHistory(messages=[])

In [14]:
history.add_user_message('hi')
history.add_ai_message('Whats up?')
history.add_user_message('How are you')
history.add_ai_message('I\'m quite good. How about you?')

In [15]:
history

ChatMessageHistory(messages=[HumanMessage(content='hi'), AIMessage(content='Whats up?'), HumanMessage(content='How are you'), AIMessage(content="I'm quite good. How about you?")])

### 3.1 Memory types

There are many different types of memory. Each has their own parameters, their own return types, and is useful in different scenarios. 
- Converstaion Buffer
- Converstaion Buffer Window

What variables get returned from memory

Before going into the chain, various variables are read from memory. These have specific names which need to align with the variables the chain expects. You can see what these variables are by calling memory.load_memory_variables({}). Note that the empty dictionary that we pass in is just a placeholder for real variables. If the memory type you are using is dependent upon the input variables, you may need to pass some in.

In this case, you can see that load_memory_variables returns a single key, history. This means that your chain (and likely your prompt) should expect an input named history. You can usually control this variable through parameters on the memory class. For example, if you want the memory variables to be returned in the key chat_history you can do:

#### Converstaion Buffer
This memory allows for storing messages and then extracts the messages in a variable.

In [16]:
# this is the common way that we use
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: hi\nAI: What's up?\nHuman: How are you?\nAI: I'm quite good. How about you?"}

In [17]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(return_messages = True)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': [HumanMessage(content='hi'),
  AIMessage(content="What's up?"),
  HumanMessage(content='How are you?'),
  AIMessage(content="I'm quite good. How about you?")]}

#### Conversation Buffer Window
- it keeps a list of the interactions of the conversation over time. 
- it only uses the last K interactions. 
- it can be useful for keeping a sliding window of the most recent interactions, so the buffer does not get too large.

In [18]:
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=1) #Keep last one
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: How are you?\nAI: I'm quite good. How about you?"}

### 4. Chain

Using an LLM in isolation is fine for simple applications, but more complex applications require chaining LLMs - either with each other or with other components.

An `LLMChain` is a simple chain that adds some functionality around language models.
- it consists of a `PromptTemplate` and a `LM` (either an LLM or chat model).
- it formats the prompt template using the input key values provided (and also memory key values, if available), 
- it passes the formatted string to LLM and returns the LLM output.

Note : [Download Fastchat Model Here](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)

In [35]:
#locate vectorstore
vector_path = './models'
if not os.path.exists(vector_path):
    os.makedirs(vector_path)
    print('create path done')

create path done


In [36]:
%cd ./models
!git clone https://huggingface.co/lmsys/fastchat-t5-3b-v1.0

c:\Users\earth\AIT\Sem2 23\NLP\Assignment\A7\NLP_A7_AIT_GPT_Chatbot\models


Cloning into 'fastchat-t5-3b-v1.0'...
Filtering content: 100% (2/2)
Filtering content: 100% (2/2), 2.24 GiB | 5.72 MiB/s
Filtering content: 100% (2/2), 2.24 GiB | 5.70 MiB/s, done.


In [20]:
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
from transformers import BitsAndBytesConfig
from langchain import HuggingFacePipeline
import torch

model_id = './models/fastchat-t5-3b-v1.0/'

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token_id = tokenizer.eos_token_id


model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

pipe = pipeline(
    task="text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens = 256,
    model_kwargs = {
        "temperature" : 0,
        "repetition_penalty": 1.5
    }
)

llm = HuggingFacePipeline(pipeline = pipe)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### [Class ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/_modules/langchain/chains/conversational_retrieval/base.html#ConversationalRetrievalChain)

- `retriever` : Retriever to use to fetch documents.

- `combine_docs_chain` : The chain used to combine any retrieved documents.

- `question_generator`: The chain used to generate a new question for the sake of retrieval. This chain will take in the current question (with variable question) and any chat history (with variable chat_history) and will produce a new standalone question to be used later on.

- `return_source_documents` : Return the retrieved source documents as part of the final result.

- `get_chat_history` : An optional function to get a string of the chat history. If None is provided, will use a default.

- `return_generated_question` : Return the generated question as part of the final result.

- `response_if_no_docs_found` : If specified, the chain will return a fixed response if no docs are found for the question.


`question_generator`

In [21]:
from langchain.chains import LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import ConversationalRetrievalChain

In [22]:
CONDENSE_QUESTION_PROMPT

PromptTemplate(input_variables=['chat_history', 'question'], template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:')

In [23]:
question_generator = LLMChain(
    llm = llm,
    prompt = CONDENSE_QUESTION_PROMPT,
    verbose = True
)

In [24]:
query = 'Comparing both of them'
chat_history = "Human:What is CS\nAI:\nHuman:What s DSAI\nAI:"

question_generator({'chat_history' : chat_history, "question" : query})

  warn_deprecated(




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
Human:What is CS
AI:
Human:What s DSAI
AI:
Follow Up Input: Comparing both of them
Standalone question:[0m

[1m> Finished chain.[0m


{'chat_history': 'Human:What is CS\nAI:\nHuman:What s DSAI\nAI:',
 'question': 'Comparing both of them',
 'text': '<pad> What  is  the  difference  between  CS  and  DS  AI?\n'}

`combine_docs_chain`

In [25]:
doc_chain = load_qa_chain(
    llm = llm,
    chain_type = 'stuff',
    prompt = PROMPT,
    verbose = True
)
doc_chain

StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly chatbot named AITBot created by Noppawee Teeraratchanon, here to assist AIT information member to answer the question people may have about AIT.\n    Just let me know what you're wondering about, and I'll do my best to guide you through it!\n    {context}\n    Question: {question}\n    Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x000001F283100690>)), document_variable_name='context')

In [26]:
query = "What is AIT?"
input_document = retriever.get_relevant_documents(query)

doc_chain({'input_documents':input_document, 'question':query})



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly chatbot named AITBot created by Noppawee Teeraratchanon, here to assist AIT information member to answer the question people may have about AIT.
    Just let me know what you're wondering about, and I'll do my best to guide you through it!
    documents from AIT.

Welcome to AIT

Home >

About

About AIT

AIT is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies.

In this section

About AIT

AIT is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies. AIT’s rigorous academic, research, and experiential outreach programs
    Question: What is AIT?
    Answer:[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


{'input_documents': [Document(page_content='documents from AIT.', metadata={'source': 'html_files\\https___ait_ac_th_study_open-master-of-engineering-science-in-interdisciplinary-studies-omis_.html'}),
  Document(page_content='Welcome to AIT', metadata={'source': 'html_files\\https___ait_ac_th_.html'}),
  Document(page_content='Home >\n\nAbout\n\nAbout AIT\n\nAIT is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies.\n\nIn this section\n\nAbout AIT', metadata={'source': 'html_files\\https___ait_ac_th_about_.html'}),
  Document(page_content='AIT is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies. AIT’s rigorous academic, research, and experiential outreach programs', metadata={'source': 'html_files\\https___ait_ac_th_.html'})],
 'question': 'What is AIT?',
 'output_text': '<pad> AIT  is  an  international  English-speaking  postgraduate  institut

In [27]:
memory = ConversationBufferWindowMemory(
    k=3, 
    memory_key = "chat_history",
    return_messages = True,
    output_key = 'answer'
)

chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator,
    combine_docs_chain=doc_chain,
    return_source_documents=True,
    memory=memory,
    verbose=True,
    get_chat_history=lambda h : h
)
chain

ConversationalRetrievalChain(memory=ConversationBufferWindowMemory(output_key='answer', return_messages=True, memory_key='chat_history', k=3), verbose=True, combine_docs_chain=StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly chatbot named AITBot created by Noppawee Teeraratchanon, here to assist AIT information member to answer the question people may have about AIT.\n    Just let me know what you're wondering about, and I'll do my best to guide you through it!\n    {context}\n    Question: {question}\n    Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x000001F283100690>)), document_variable_name='context'), question_generator=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['chat_history', 'question'], template='Given the following conversation and a follow up question, rephrase the fo

### 5. Chatbot

In [28]:
prompt_question = "Who are you by the way?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly chatbot named AITBot created by Noppawee Teeraratchanon, here to assist AIT information member to answer the question people may have about AIT.
    Just let me know what you're wondering about, and I'll do my best to guide you through it!
    Home >

Privacy Policy

Privacy Policy

Who we are

Our website address is https://ait.ac.th.

you. We hope that will facilitate your personal and professional development.

you. We hope that will facilitate your personal and professional development.

Citizenship of an English speaking country. Applicants who are citizens of and have been educated in an English- speaking country (Australia, Canada, Ireland, New Zealand, the UK, and the USA) are
    Question: Who are you by the way?
    Answer:[0m

[1m> Finished chain.[0m

[1m> Fi

{'question': 'Who are you by the way?',
 'chat_history': [],
 'answer': "<pad> I'm  AITBot,  a  chatbot  created  by  Noppawee  Teeraratchanon  to  assist  AIT  information  member  to  answer  the  question  people  may  have  about  AIT.  I'm  here  to  help  you  with  any  questions  you  may  have  about  AIT,  so  feel  free  to  ask  me  anything  you  need  help  with!\n",
 'source_documents': [Document(page_content='Home >\n\nPrivacy Policy\n\nPrivacy Policy\n\nWho we are\n\nOur website address is https://ait.ac.th.', metadata={'source': 'html_files\\https___ait_ac_th_privacy-policy_.html'}),
  Document(page_content='you. We hope that will facilitate your personal and professional development.', metadata={'source': 'html_files\\https___ait_ac_th_study_flexible-masters-option_.html'}),
  Document(page_content='you. We hope that will facilitate your personal and professional development.', metadata={'source': 'html_files\\https___ait_ac_th_study_flexible-phd-option_.html'}),
  D

In [29]:
prompt_question = "What is AIT?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='Who are you by the way?'), AIMessage(content="<pad> I'm  AITBot,  a  chatbot  created  by  Noppawee  Teeraratchanon  to  assist  AIT  information  member  to  answer  the  question  people  may  have  about  AIT.  I'm  here  to  help  you  with  any  questions  you  may  have  about  AIT,  so  feel  free  to  ask  me  anything  you  need  help  with!\n")]
Follow Up Input: What is AIT?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly chatbot named AITBot created by Noppawee Teeraratchanon, here to assist AIT inf

{'question': 'What is AIT?',
 'chat_history': [HumanMessage(content='Who are you by the way?'),
  AIMessage(content="<pad> I'm  AITBot,  a  chatbot  created  by  Noppawee  Teeraratchanon  to  assist  AIT  information  member  to  answer  the  question  people  may  have  about  AIT.  I'm  here  to  help  you  with  any  questions  you  may  have  about  AIT,  so  feel  free  to  ask  me  anything  you  need  help  with!\n")],
 'answer': '<pad> < pad>  AIT  is  an  international  English-speaking  postgraduate  institution,  focusing  on  engineering,  environment,  and  management  studies.\n',
 'source_documents': [Document(page_content='documents from AIT.', metadata={'source': 'html_files\\https___ait_ac_th_study_open-master-of-engineering-science-in-interdisciplinary-studies-omis_.html'}),
  Document(page_content='Welcome to AIT', metadata={'source': 'html_files\\https___ait_ac_th_.html'}),
  Document(page_content='Home >\n\nAbout\n\nAbout AIT\n\nAIT is an international English-spe

In [30]:
prompt_question = "Are there any scholarship"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='Who are you by the way?'), AIMessage(content="<pad> I'm  AITBot,  a  chatbot  created  by  Noppawee  Teeraratchanon  to  assist  AIT  information  member  to  answer  the  question  people  may  have  about  AIT.  I'm  here  to  help  you  with  any  questions  you  may  have  about  AIT,  so  feel  free  to  ask  me  anything  you  need  help  with!\n"), HumanMessage(content='What is AIT?'), AIMessage(content='<pad> < pad>  AIT  is  an  international  English-speaking  postgraduate  institution,  focusing  on  engineering,  environment,  and  management  studies.\n')]
Follow Up Input: Are there any scholarship
Standalone question:[0m

[1m> Finished chain.[0m



{'question': 'Are there any scholarship',
 'chat_history': [HumanMessage(content='Who are you by the way?'),
  AIMessage(content="<pad> I'm  AITBot,  a  chatbot  created  by  Noppawee  Teeraratchanon  to  assist  AIT  information  member  to  answer  the  question  people  may  have  about  AIT.  I'm  here  to  help  you  with  any  questions  you  may  have  about  AIT,  so  feel  free  to  ask  me  anything  you  need  help  with!\n"),
  HumanMessage(content='What is AIT?'),
  AIMessage(content='<pad> < pad>  AIT  is  an  international  English-speaking  postgraduate  institution,  focusing  on  engineering,  environment,  and  management  studies.\n')],
 'answer': '<pad> Yes,  we  have  various  forms  of  financial  packages  available  to  highly-qualified  applicants  from  funds  granted  by  donors  as  well  as  from  AIT  itself.  Please  check  this  site:\n Scholarships\n',
 'source_documents': [Document(page_content='9. Do I need to apply for the scholarships listed on the

1) Analyze the model’s performance in retrieving information.<br>
    From the above 3 question, I think the AITBot answer quite good and relate to those question   
2) Address any issues related to the model providing unrelated information.<br>
    I observe that on question 2 ("What is AIT?"), there is one source document that unrelated which has content only "Welcome to AIT" and on question 1 ("Who are you by the way?") and question 3 ("Are there any scholarship").there are some source documents that have same content. it occur from the dataset that I have used on this assignment. I may address this issue by cleaning data.

## Task 3 Demo

In [57]:
from flask import Flask, render_template, request
from langchain.chains import LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import ConversationalRetrievalChain
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
from langchain import HuggingFacePipeline
import torch
from langchain import PromptTemplate
from langchain_community.vectorstores import FAISS
import os
from langchain_community.embeddings import HuggingFaceInstructEmbeddings

model_name = 'hkunlp/instructor-base'

embedding_model = HuggingFaceInstructEmbeddings(
    model_name = model_name,
    model_kwargs = {"device" : 'cpu'}
)


#calling vector from local
vector_path = '../vector-store'
db_file_name = 'AIT'

from langchain.vectorstores import FAISS

vectordb = FAISS.load_local(
    folder_path = os.path.join(vector_path, db_file_name),
    embeddings = embedding_model,
    index_name = 'ait', #default index
)   

retriever = vectordb.as_retriever()


prompt_template = """
    I'm your friendly chatbot named AITBot created by Noppawee Teeraratchanon, here to assist AIT information member to answer the question people may have about AIT.
    Just let me know what you're wondering about, and I'll do my best to guide you through it!
    {context}
    Question: {question}
    Answer:
    """.strip()

PROMPT = PromptTemplate.from_template(
    template = prompt_template
)

model_id = './fastchat-t5-3b-v1.0/'

tokenizer = AutoTokenizer.from_pretrained(
    model_id)

tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

pipe = pipeline(
    task="text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens = 256,
    model_kwargs = {
        "temperature" : 0,
        "repetition_penalty": 1.5
    }
)

llm = HuggingFacePipeline(pipeline = pipe)


question_generator = LLMChain(
    llm = llm,
    prompt = CONDENSE_QUESTION_PROMPT,
    verbose = True
)

doc_chain = load_qa_chain(
    llm = llm,
    chain_type = 'stuff',
    prompt = PROMPT,
    verbose = True
)

memory = ConversationBufferWindowMemory(
    k=3, 
    memory_key = "chat_history",
    return_messages = True,
    output_key = 'answer'
)

chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator,
    combine_docs_chain=doc_chain,
    return_source_documents=True,
    memory=memory,
    verbose=True,
    get_chat_history=lambda h : h
)


load INSTRUCTOR_Transformer


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


max_seq_length  512


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [71]:
search_query = " What is AIT?"
answer = chain({"question":search_query})['answer']
source = chain({"question":search_query})['source_documents']
print(answer)
print(source)



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content=' What is AIT?'), AIMessage(content='<pad> < pad>  AIT  is  an  international  English-speaking  postgraduate  institution,  focusing  on  engineering,  environment,  and  management  studies.\n'), HumanMessage(content='Are there any scholarship'), AIMessage(content='<pad> Yes,  there  are  several  scholarships  available  for  postgraduate  studies  at  AIT.  Please  check  the  scholarship  website  for  more  information.\n'), HumanMessage(content='Are there any scholarship'), AIMessage(content='<pad> Yes,  there  are  several  scholarships  available  for  postgraduate  studies  at  AIT.  Please  check  the  scholarship  website  for  more  information.\n')]
F

In [72]:
print(answer)
for i,doc in enumerate(source):
    print(i+1,doc)

<pad> < pad>  AIT  is  an  international  English-speaking  postgraduate  institution,  focusing  on  engineering,  environment,  and  management  studies.

1 page_content='documents from AIT.' metadata={'source': 'html_files\\https___ait_ac_th_study_open-master-of-engineering-science-in-interdisciplinary-studies-omis_.html'}
2 page_content='Welcome to AIT' metadata={'source': 'html_files\\https___ait_ac_th_.html'}
3 page_content='Home >\n\nAbout\n\nAbout AIT\n\nAIT is an international English-speaking postgraduate institution, focusing on engineering, environment, and management studies.\n\nIn this section\n\nAbout AIT' metadata={'source': 'html_files\\https___ait_ac_th_about_.html'}
4 page_content='Explore AIT campus' metadata={'source': 'html_files\\https___ait_ac_th_.html'}


In [73]:
search_query = "Are there any scholarship"
answer = chain({"question":search_query})['answer']
source = chain({"question":search_query})['source_documents']



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
[HumanMessage(content='Are there any scholarship'), AIMessage(content='<pad> Yes,  there  are  several  scholarships  available  for  postgraduate  studies  at  AIT.  Please  check  the  scholarship  website  for  more  information.\n'), HumanMessage(content=' What is AIT?'), AIMessage(content='<pad> < pad>  AIT  is  an  international  English-speaking  postgraduate  institution,  focusing  on  engineering,  environment,  and  management  studies.\n'), HumanMessage(content=' What is AIT?'), AIMessage(content='<pad> < pad>  AIT  is  an  international  English-speaking  postgraduate  institution,  focusing  on  engineering,  environment,  and  management  studies.\n')]
Follow Up Input: Ar

In [74]:
print(answer)
for i,doc in enumerate(source):
    print(i+1,doc)

<pad> Yes,  there  are  several  scholarships  available  for  postgraduate  studies  at  AIT.  Please  check  the  scholarship  website  for  more  information.

1 page_content='AIT offers a limited number of financial awards in the form of scholarships for the Master’s and Doctoral programs, on a highly competitive basis, to applicants who have been evaluated as' metadata={'source': 'html_files\\https___ait_ac_th_admissions_frequently-asked-questions_.html'}
2 page_content='9. Do I need to apply for the scholarships listed on the AIT scholarship website?' metadata={'source': 'html_files\\https___ait_ac_th_admissions_frequently-asked-questions_.html'}
3 page_content='Yes, we have academic exchange programs with various partner universities in Asia and Europe. AIT has a limited number of scholarships to assist our students to go on academic exchange. You can apply' metadata={'source': 'html_files\\https___ait_ac_th_admissions_frequently-asked-questions_.html'}
4 page_content='Yes, we h