# Natural Language Processing

# Retrieval-Augmented generation (RAG)

RAG is a technique for augmenting LLM knowledge with additional, often private or real-time, data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs.

<img src="../figures/RAG-process.png" >

Introducing `ChakyBot`, an innovative chatbot designed to assist Chaky (the instructor) and TA (Gun) in explaining the lesson of the NLP course to students. Leveraging LangChain technology, ChakyBot excels in retrieving information from documents, ensuring a seamless and efficient learning experience for students engaging with the NLP curriculum.

1. Prompt
2. Retrieval
3. Memory
4. Chain

In [2]:
#langchain library
!pip install langchain==0.0.350
#LLM
!pip install accelerate==0.25.0
!pip install transformers==4.36.2
!pip install bitsandbytes==0.41.2
#Text Embedding
!pip install sentence-transformers==2.2.2
!pip install InstructorEmbedding==1.0.1
#vectorstore
!pip install pymupdf==1.23.8
!pip install faiss-gpu==1.7.2
!pip install faiss-cpu==1.7.4

'pip' is not recognized as an internal or external command,
operable program or batch file.
'pip' is not recognized as an internal or external command,
operable program or batch file.
'pip' is not recognized as an internal or external command,
operable program or batch file.
'pip' is not recognized as an internal or external command,
operable program or batch file.
'pip' is not recognized as an internal or external command,
operable program or batch file.
'pip' is not recognized as an internal or external command,
operable program or batch file.
'pip' is not recognized as an internal or external command,
operable program or batch file.


'pip' is not recognized as an internal or external command,
operable program or batch file.
'pip' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
import os
import torch
# Set GPU device
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"

# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device


IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html



device(type='cpu')

In [4]:
"""

"""

'\n\n'

## 1. Prompt

A set of instructions or input provided by a user to guide the model's response, helping it understand the context and generate relevant and coherent language-based output, such as answering questions, completing sentences, or engaging in a conversation.

In [5]:
from langchain import PromptTemplate

prompt_template = """
    I'm your friendly chatbot named AIT-GPT, here to assist you with any questions you have about AIT. 
    If you are interesting about Background of AIT, Programs, Degrees, Admissions, Tuitions fee, Ongoing Research, anything about AIT, please feel free to ask.
    I am covering everything in official web page. So if you felt lazy to search in web page by yourself, no hesitate to asking me.
    {context}
    Question: {question}
    Answer:
    """.strip()

    # I'm your friendly NLP chatbot named ChakyBot, here to assist Chaky and Gun with any questions they have about Natural Language Processing (NLP). 
    # If you're curious about how probability works in the context of NLP, feel free to ask any questions you may have. 
    # Whether it's about probabilistic models, language models, or any other related topic, 
    # I'm here to help break down complex concepts into easy-to-understand explanations.
    # Just let me know what you're wondering about, and I'll do my best to guide you through it!

PROMPT = PromptTemplate.from_template(
    template = prompt_template
)

PROMPT
#using str.format 
#The placeholder is defined using curly brackets: {} {}

PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly chatbot named AIT-GPT, here to assist you with any questions you have about AIT. \n    If you are interesting about Background of AIT, Programs, Degrees, Admissions, Tuitions fee, Ongoing Research, anything about AIT, please feel free to ask.\n    I am covering everything in official web page. So if you felt lazy to search in web page by yourself, no hesitate to asking me.\n    {context}\n    Question: {question}\n    Answer:")

In [6]:
PROMPT.format(
    context = "I heard you want to go to AIT",
    question = "What is AIT"
)

"I'm your friendly chatbot named AIT-GPT, here to assist you with any questions you have about AIT. \n    If you are interesting about Background of AIT, Programs, Degrees, Admissions, Tuitions fee, Ongoing Research, anything about AIT, please feel free to ask.\n    I am covering everything in official web page. So if you felt lazy to search in web page by yourself, no hesitate to asking me.\n    Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can effectively generalize and thus perform tasks without explicit instructions.\n    Question: What is Machine Learning\n    Answer:"

Note : [How to improve prompting (Zero-shot, Few-shot, Chain-of-Thought, etc.](https://github.com/chaklam-silpasuwanchai/Natural-Language-Processing/blob/main/Code/05%20-%20RAG/advance/cot-tot-prompting.ipynb)

## 2. Retrieval

1. `Document loaders` : Load documents from many different sources (HTML, PDF, code). 
2. `Document transformers` : One of the essential steps in document retrieval is breaking down a large document into smaller, relevant chunks to enhance the retrieval process.
3. `Text embedding models` : Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of text that are similar.
4. `Vector stores`: there has emerged a need for databases to support efficient storage and searching of these embeddings.
5. `Retrievers` : Once the data is in the database, you still need to retrieve it.

### 2.1 Document Loaders 
Use document loaders to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

[PDF Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)

[Download Document](https://web.stanford.edu/~jurafsky/slp3/)

In [7]:
from langchain_community.document_loaders import PyMuPDFLoader

nlp_docs = r'C:\Users\Tairo Kageyama\Documents\GitHub\Python-fo-Natural-Language-Processing-main\lab7\Data\AIT_web_reduced.pdf'

loader = PyMuPDFLoader(nlp_docs)
documents = loader.load()

# import PyPDF2
# from spacy import displacy

# reader = PyPDF2.PdfReader("Data/Resume.pdf")
# page = reader.pages[0]
# documents = page.extract_text()

In [8]:
documents

[Document(page_content='Home > Data Science and AI (DSAI)\nData Science and AI (DSAI)\nData Science (DS) is concerned with the extraction of useful knowledge from data sets.\xa0 It is closely\nrelated to the ﬁelds of computer science, mathematics, and statistics.\xa0 It is a relatively new term for a\nbroad set of skills spanning the more established ﬁelds of machine learning, data mining, databases, and\nvisualization, along with their applications in various ﬁelds.\xa0 In 2012, Harvard Business Review called\ndata science “The Sexiest Job of the 21\n Century”.\nAr�ﬁcial Intelligence (AI) is the broad ﬁeld conceived in 1956 as the automation or simulation of human\nintelligence.\xa0 AI has two primary “levels”.\xa0 The ﬁrst level, “narrow AI”, concerns perception, statistical\ninference, and actuation, drawing on data science, sensors, and robotics.\xa0 The second level, sometimes\ncalled “artiﬁcial general intelligence (AGI), is concerned with more complex or ﬂexible reasoning and\nd

In [9]:
len(documents)

8

In [10]:
documents[0]

Document(page_content='Home > Data Science and AI (DSAI)\nData Science and AI (DSAI)\nData Science (DS) is concerned with the extraction of useful knowledge from data sets.\xa0 It is closely\nrelated to the ﬁelds of computer science, mathematics, and statistics.\xa0 It is a relatively new term for a\nbroad set of skills spanning the more established ﬁelds of machine learning, data mining, databases, and\nvisualization, along with their applications in various ﬁelds.\xa0 In 2012, Harvard Business Review called\ndata science “The Sexiest Job of the 21\n Century”.\nAr�ﬁcial Intelligence (AI) is the broad ﬁeld conceived in 1956 as the automation or simulation of human\nintelligence.\xa0 AI has two primary “levels”.\xa0 The ﬁrst level, “narrow AI”, concerns perception, statistical\ninference, and actuation, drawing on data science, sensors, and robotics.\xa0 The second level, sometimes\ncalled “artiﬁcial general intelligence (AGI), is concerned with more complex or ﬂexible reasoning and\nde

### 2.2 Document Transformers

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 100
)

doc = text_splitter.split_documents(documents)

In [12]:
doc[1]

Document(page_content='intelligence.\xa0 AI has two primary “levels”.\xa0 The ﬁrst level, “narrow AI”, concerns perception, statistical\ninference, and actuation, drawing on data science, sensors, and robotics.\xa0 The second level, sometimes\ncalled “artiﬁcial general intelligence (AGI), is concerned with more complex or ﬂexible reasoning and\ndecision-making in less constrained domains.\nThe AIT Masters in DS&AI was designed in partnership with the Erasmus+ DS&AI consortium, a group\nof 15 European and Asian universities with the mission of bringing European-standard advanced\neducation to Asia.\nst\nWe use cookies on our website to give you the most relevant experience by remembering your preferences and', metadata={'source': 'C:\\Users\\Tairo Kageyama\\Documents\\GitHub\\Python-fo-Natural-Language-Processing-main\\lab7\\Data\\AIT_web_reduced.pdf', 'file_path': 'C:\\Users\\Tairo Kageyama\\Documents\\GitHub\\Python-fo-Natural-Language-Processing-main\\lab7\\Data\\AIT_web_reduced.pdf'

In [13]:
len(doc)

15

### 2.3 Text Embedding Models
Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

*Note* Instructor Model : [Huggingface](gingface.co/hkunlp/instructor-base) | [Paper](https://arxiv.org/abs/2212.09741)

In [14]:
import torch
from langchain.embeddings import HuggingFaceInstructEmbeddings

model_name = 'hkunlp/instructor-base'

embedding_model = HuggingFaceInstructEmbeddings(
    model_name = model_name,
    model_kwargs = {"device" : device}
)

load INSTRUCTOR_Transformer
max_seq_length  512


### 2.4 Vector Stores

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

In [15]:
#locate vectorstore
vector_path = '../vector-store'
if not os.path.exists(vector_path):
    os.makedirs(vector_path)
    print('create path done')

In [16]:
#save vector locally
from langchain.vectorstores import FAISS

vectordb = FAISS.from_documents(
    documents = doc,
    embedding = embedding_model
)

db_file_name = 'nlp_stanford'

vectordb.save_local(
    folder_path = os.path.join(vector_path, db_file_name),
    index_name = 'nlp' #default index
)

### 2.5 retrievers
A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

In [17]:
#calling vector from local
vector_path = '../vector-store'
db_file_name = 'nlp_stanford'

from langchain.vectorstores import FAISS

vectordb = FAISS.load_local(
    folder_path = os.path.join(vector_path, db_file_name),
    embeddings = embedding_model,
    index_name = 'nlp', #default index
    # allow_dangerous_deserialization=True
)   

In [18]:
#ready to use
retriever = vectordb.as_retriever()

In [19]:
retriever.get_relevant_documents("What is main topic?")

[Document(page_content='integration into the global economy. AIT focuses on assisting stakeholders build their capacity to\npromote sustainability through appropriate technology, relevant and applied research, sustainable\nframeworks for development and planning, informed policy making and practice applications in the\nregion.\xa0AIT’s ﬁve thematic areas of research are, namely, Climate Change; Smart Communities;\nInfrastructure; Technology, Policy, Society and Water-Energy-Food.\xa0\nIn this sec�on\nWe use cookies on our website to give you the most relevant experience by remembering your preferences and\nrepeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies. However, you may visit "Cookie', metadata={'source': 'C:\\Users\\Tairo Kageyama\\Documents\\GitHub\\Python-fo-Natural-Language-Processing-main\\lab7\\Data\\AIT_web_reduced.pdf', 'file_path': 'C:\\Users\\Tairo Kageyama\\Documents\\GitHub\\Python-fo-Natural-Language-Processing-main\\lab7\\Data\\AIT_web_

In [20]:
retriever.get_relevant_documents("What is Claasfication?")

[Document(page_content='Home > Data Science and AI (DSAI)\nData Science and AI (DSAI)\nData Science (DS) is concerned with the extraction of useful knowledge from data sets.\xa0 It is closely\nrelated to the ﬁelds of computer science, mathematics, and statistics.\xa0 It is a relatively new term for a\nbroad set of skills spanning the more established ﬁelds of machine learning, data mining, databases, and\nvisualization, along with their applications in various ﬁelds.\xa0 In 2012, Harvard Business Review called\ndata science “The Sexiest Job of the 21\n Century”.\nAr�ﬁcial Intelligence (AI) is the broad ﬁeld conceived in 1956 as the automation or simulation of human', metadata={'source': 'C:\\Users\\Tairo Kageyama\\Documents\\GitHub\\Python-fo-Natural-Language-Processing-main\\lab7\\Data\\AIT_web_reduced.pdf', 'file_path': 'C:\\Users\\Tairo Kageyama\\Documents\\GitHub\\Python-fo-Natural-Language-Processing-main\\lab7\\Data\\AIT_web_reduced.pdf', 'page': 0, 'total_pages': 8, 'format': 'P

## 3. Memory

One of the core utility classes underpinning most (if not all) memory modules is the ChatMessageHistory class. This is a super lightweight wrapper that provides convenience methods for saving HumanMessages, AIMessages, and then fetching them all.

You may want to use this class directly if you are managing memory outside of a chain.


In [21]:
# from langchain.memory import ChatMessageHistory

# history = ChatMessageHistory()
# history

In [22]:
# history.add_user_message('hi')
# history.add_ai_message('Whats up?')
# history.add_user_message('How are you')
# history.add_ai_message('I\'m quite good. How about you?')

In [23]:
# history

### 3.1 Memory types

There are many different types of memory. Each has their own parameters, their own return types, and is useful in different scenarios. 
- Converstaion Buffer
- Converstaion Buffer Window

What variables get returned from memory

Before going into the chain, various variables are read from memory. These have specific names which need to align with the variables the chain expects. You can see what these variables are by calling memory.load_memory_variables({}). Note that the empty dictionary that we pass in is just a placeholder for real variables. If the memory type you are using is dependent upon the input variables, you may need to pass some in.

In this case, you can see that load_memory_variables returns a single key, history. This means that your chain (and likely your prompt) should expect an input named history. You can usually control this variable through parameters on the memory class. For example, if you want the memory variables to be returned in the key chat_history you can do:

#### Converstaion Buffer
This memory allows for storing messages and then extracts the messages in a variable.

In [24]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: hi\nAI: What's up?\nHuman: How are you?\nAI: I'm quite good. How about you?"}

In [25]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(return_messages = True)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': [HumanMessage(content='hi'),
  AIMessage(content="What's up?"),
  HumanMessage(content='How are you?'),
  AIMessage(content="I'm quite good. How about you?")]}

#### Conversation Buffer Window
- it keeps a list of the interactions of the conversation over time. 
- it only uses the last K interactions. 
- it can be useful for keeping a sliding window of the most recent interactions, so the buffer does not get too large.

In [26]:
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=1)
memory.save_context({'input':'hi'}, {'output':'What\'s up?'})
memory.save_context({"input":'How are you?'},{'output': 'I\'m quite good. How about you?'})
memory.load_memory_variables({})

{'history': "Human: How are you?\nAI: I'm quite good. How about you?"}

## 4. Chain

Using an LLM in isolation is fine for simple applications, but more complex applications require chaining LLMs - either with each other or with other components.

An `LLMChain` is a simple chain that adds some functionality around language models.
- it consists of a `PromptTemplate` and a `LM` (either an LLM or chat model).
- it formats the prompt template using the input key values provided (and also memory key values, if available), 
- it passes the formatted string to LLM and returns the LLM output.

Note : [Download Fastchat Model Here](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)

In [27]:
%cd ./models
!git clone https://huggingface.co/lmsys/fastchat-t5-3b-v1.0

c:\Users\Tairo Kageyama\Documents\GitHub\Python-fo-Natural-Language-Processing-main\lab7\models


fatal: destination path 'fastchat-t5-3b-v1.0' already exists and is not an empty directory.


In [28]:
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM
from transformers import BitsAndBytesConfig
from langchain import HuggingFacePipeline
from langchain_community.llms.huggingface_endpoint import (
    HuggingFaceEndpoint,
)
import torch

model_id = '../models/fastchat-t5-3b-v1.0/'
print(1)
tokenizer = AutoTokenizer.from_pretrained(
    model_id)

tokenizer.pad_token_id = tokenizer.eos_token_id
print(1)
bitsandbyte_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.float16,
    bnb_4bit_use_double_quant = True
)
print(2)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_id,
    # quantization_config = bitsandbyte_config, #caution Nvidia
    # device_map = 'auto',
    load_in_8bit = False
)
print(3)
pipe = pipeline(
    task="text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens = 1024,
    model_kwargs = {
        "temperature" : 0,
        "repetition_penalty": 1.5
    }
)
print(4)
# hf_model_id = "facebook/bart-large-cnn"
# hf_endpoint_url = f"https://api-inference.huggingface.co/models/{hf_model_id}"
# llm = HuggingFaceEndpoint(
#     task="summarization",
#     endpoint_url=hf_endpoint_url,
# )
# llm = HuggingFacePipeline.from_model_id(
#     model_id=model_id,
#     task="text2text-generation",
#     model_kwargs={"temperature": 0, "max_length": 1000},
# )

llm = HuggingFacePipeline(pipeline = pipe)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


1
1
2
3
4


### [Class ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/_modules/langchain/chains/conversational_retrieval/base.html#ConversationalRetrievalChain)

- `retriever` : Retriever to use to fetch documents.

- `combine_docs_chain` : The chain used to combine any retrieved documents.

- `question_generator`: The chain used to generate a new question for the sake of retrieval. This chain will take in the current question (with variable question) and any chat history (with variable chat_history) and will produce a new standalone question to be used later on.

- `return_source_documents` : Return the retrieved source documents as part of the final result.

- `get_chat_history` : An optional function to get a string of the chat history. If None is provided, will use a default.

- `return_generated_question` : Return the generated question as part of the final result.

- `response_if_no_docs_found` : If specified, the chain will return a fixed response if no docs are found for the question.


`question_generator`

In [29]:
from langchain.chains import LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import ConversationalRetrievalChain

In [30]:
CONDENSE_QUESTION_PROMPT

PromptTemplate(input_variables=['chat_history', 'question'], template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:')

In [31]:
# hf_model_id = "facebook/bart-large-cnn"
# hf_endpoint_url = f"https://api-inference.huggingface.co/models/{hf_model_id}"
# llm = HuggingFaceEndpoint(
#     task="summarization",
#     endpoint_url=hf_endpoint_url,
# )
# llm = HuggingFacePipeline(pipeline = pipe)

In [32]:
question_generator = LLMChain(
    llm = llm,
    prompt = CONDENSE_QUESTION_PROMPT,
    verbose = True
)

In [33]:
# from langchain.chains import LLMChain
# from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT
# from langchain.memory import ConversationBufferWindowMemory
# from langchain.chains.question_answering import load_qa_chain
# from langchain.chains import ConversationalRetrievalChain
# llm = HuggingFacePipeline(pipeline = pipe)
# question_generator = LLMChain(
#     llm = llm,
#     prompt = CONDENSE_QUESTION_PROMPT,
#     verbose = True
# )

query = 'What is AIT'
chat_history = "I am thinking to go to AIT"

# question_generator({'context' : chat_history, "question" : query})
question_generator({'chat_history' : chat_history, "question" : query})




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
I am thinking to go to AIT
Follow Up Input: What is AIT
Standalone question:[0m



[1m> Finished chain.[0m


{'chat_history': 'I am thinking to go to AIT',
 'question': 'What is AIT',
 'text': '<pad> What  is  AIT?\n'}

`combine_docs_chain`

In [34]:
llm

HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x000001B69DD611B0>)

In [35]:
doc_chain = load_qa_chain(
    llm = llm,
    chain_type = 'stuff',
    prompt = PROMPT,
    verbose = True
)
doc_chain

StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly chatbot named AIT-GPT, here to assist you with any questions you have about AIT. \n    If you are interesting about Background of AIT, Programs, Degrees, Admissions, Tuitions fee, Ongoing Research, anything about AIT, please feel free to ask.\n    I am covering everything in official web page. So if you felt lazy to search in web page by yourself, no hesitate to asking me.\n    {context}\n    Question: {question}\n    Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x000001B69DD611B0>)), document_variable_name='context')

In [36]:
query = "What is Transformers?"
input_document = retriever.get_relevant_documents(query)

doc_chain({'input_documents':input_document, 'question':query})



[1m> Entering new StuffDocumentsChain chain...[0m




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly chatbot named AIT-GPT, here to assist you with any questions you have about AIT. 
    If you are interesting about Background of AIT, Programs, Degrees, Admissions, Tuitions fee, Ongoing Research, anything about AIT, please feel free to ask.
    I am covering everything in official web page. So if you felt lazy to search in web page by yourself, no hesitate to asking me.
    integration into the global economy. AIT focuses on assisting stakeholders build their capacity to
promote sustainability through appropriate technology, relevant and applied research, sustainable
frameworks for development and planning, informed policy making and practice applications in the
region. AIT’s ﬁve thematic areas of research are, namely, Climate Change; Smart Communities;
Infrastructure; Technology, Policy, Society and Water-Energy-Food. 
In this sec�on
We use cookies on our website to give you the most r

{'input_documents': [Document(page_content='integration into the global economy. AIT focuses on assisting stakeholders build their capacity to\npromote sustainability through appropriate technology, relevant and applied research, sustainable\nframeworks for development and planning, informed policy making and practice applications in the\nregion.\xa0AIT’s ﬁve thematic areas of research are, namely, Climate Change; Smart Communities;\nInfrastructure; Technology, Policy, Society and Water-Energy-Food.\xa0\nIn this sec�on\nWe use cookies on our website to give you the most relevant experience by remembering your preferences and\nrepeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies. However, you may visit "Cookie', metadata={'source': 'C:\\Users\\Tairo Kageyama\\Documents\\GitHub\\Python-fo-Natural-Language-Processing-main\\lab7\\Data\\AIT_web_reduced.pdf', 'file_path': 'C:\\Users\\Tairo Kageyama\\Documents\\GitHub\\Python-fo-Natural-Language-Processing-main\\

In [37]:
memory = ConversationBufferWindowMemory(
    k=3, 
    memory_key = "chat_history",
    return_messages = True,
    output_key = 'answer'
)

chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator,
    combine_docs_chain=doc_chain,
    return_source_documents=True,
    memory=memory,
    verbose=True,
    get_chat_history=lambda h : h
)
chain

ConversationalRetrievalChain(memory=ConversationBufferWindowMemory(output_key='answer', return_messages=True, memory_key='chat_history', k=3), verbose=True, combine_docs_chain=StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], template="I'm your friendly chatbot named AIT-GPT, here to assist you with any questions you have about AIT. \n    If you are interesting about Background of AIT, Programs, Degrees, Admissions, Tuitions fee, Ongoing Research, anything about AIT, please feel free to ask.\n    I am covering everything in official web page. So if you felt lazy to search in web page by yourself, no hesitate to asking me.\n    {context}\n    Question: {question}\n    Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x000001B69DD611B0>)), document_variable_name='context'), question_generator=LLMChain(verbose=True, prompt=PromptTemp

In [38]:
# import pickle
# with open("chain.pkl", "wb") as f:
#     pickle.dump(chain, f)

PicklingError: Can't pickle <function <lambda> at 0x000001B73E133910>: attribute lookup <lambda> on __main__ failed

## 5. Chatbot

In [None]:
prompt_question = "Who are you by the way?"
answer = chain({"question":prompt_question})
answer



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly chatbot named AIT-GPT, here to assist you with any questions you have about AIT. 
    If you are interesting about Background of AIT, Programs, Degrees, Admissions, Tuitions fee, Ongoing Research, anything about AIT, please feel free to ask.
    I am covering everything in official web page. So if you felt lazy to search in web page by yourself, no hesitate to asking me.
    Intelligence: Problem Solving and Planning · Ar�ﬁcial
Intelligence: Knowledge Representa�on and Reasoning
· Computer Vision · Ar�ﬁcial Intelligence: Natural
Language Understanding · Recent Trend in Machine
Learning · Mul�criteria Op�miza�on and Decision
Analysis
Total Credits
Coursework
36 credits
Research Study
Credits
12 credits
TOTAL CREDIT
REQUIREMENT
48 credits

intelligence.  AI has two primary “l

{'question': 'Who are you by the way?',
 'chat_history': [],
 'answer': "<pad>  I  am  AIT-GPT,  a  chatbot  designed  to  assist  you  with  information  about  AIT.  I  am  a  language  model  trained  on  AIT's  website  and  can  provide  information  about  AIT's  programs,  degrees,  admissions,  tuitions  fee,  ongoing  research,  anything  about  AIT.  I  am  not  a  human  being  and  I  am  not  able  to  make  decisions  or  take  actions  on  my  own.  I  am  here  to  assist  you  with  any  questions  you  have  about  AIT.\n",
 'source_documents': [Document(page_content='Intelligence: Problem Solving and Planning · Ar�ﬁcial\nIntelligence: Knowledge Representa�on and Reasoning\n· Computer Vision · Ar�ﬁcial Intelligence: Natural\nLanguage Understanding · Recent Trend in Machine\nLearning · Mul�criteria Op�miza�on and Decision\nAnalysis\nTotal Credits\nCoursework\n36 credits\nResearch Study\nCredits\n12 credits\nTOTAL CREDIT\nREQUIREMENT\n48 credits', metadata={'source': 'C

In [39]:
prompt_question = "What is DSAI?"
context = "I want to go to AIT"
answer = chain({"question" : prompt_question})
answer['answer']



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mI'm your friendly chatbot named AIT-GPT, here to assist you with any questions you have about AIT. 
    If you are interesting about Background of AIT, Programs, Degrees, Admissions, Tuitions fee, Ongoing Research, anything about AIT, please feel free to ask.
    I am covering everything in official web page. So if you felt lazy to search in web page by yourself, no hesitate to asking me.
    Home > Data Science and AI (DSAI)
Data Science and AI (DSAI)
Data Science (DS) is concerned with the extraction of useful knowledge from data sets.  It is closely
related to the ﬁelds of computer science, mathematics, and statistics.  It is a relatively new term for a
broad set of skills spanning the more established ﬁelds of machine learning, data mining, databases, and
visualization, along with their 

'<pad>  Data  Science  and  Artificial  Intelligence  (DSAI)  is  a  broad  term  for  a  set  of  skills  spanning  the  more  established  fields  of  machine  learning,  data  mining,  databases,  and  visualization,  along  with  their  applications  in  various  fields.  It  is  concerned  with  the  extraction  of  useful  knowledge  from  data  sets  and  the  application  of  these  skills  in  various  fields.\n'

In [None]:
# prompt_question = "What is Eligibility?"
# answer = chain({'context' : chat_history, "question":prompt_question})
# answer

The dash_core_components package is deprecated. Please replace
`import dash_core_components as dcc` with `from dash import dcc`
  import dash_core_components as dcc
The dash_html_components package is deprecated. Please replace
`import dash_html_components as html` with `from dash import html`
  import dash_html_components as html


Top 1: Similarity=0.2626, Text= It is among the most notable theorems in the history of mathematics and prior to its proof was in the Guinness Book of World Records as the "most difficult mathematical problem"  in part because the theorem has the largest number of unsuccessful proofs

Top 2: Similarity=0.2611, Text= The unsolved problem stimulated the development of algebraic number theory in the 19th and 20th centuries

Top 3: Similarity=0.2500, Text= The proposition was first stated as a theorem by Pierre de Fermat around 1637 in the margin of a copy of Arithmetica

Top 4: Similarity=0.2357, Text= After 358 years of effort by mathematicians  the first successful proof was released in 1994 by Andrew Wiles and formally published in 1995

Top 5: Similarity=0.1342, Text= It also proved much of the Taniyama–Shimura conjecture  subsequently known as the modularity theorem  and opened up entire new approaches to numerous other problems and mathematically powerful modularity lifting techniqu