In [None]:
# We will initially install all the required packages for building our chatbot.
# We capture the output such that it does not leave a long trail of unnecessary outputs while installing the packages.
%%capture
!pip install -Uqqq pip --progress-bar off
!pip install -qqq torch==2.0.1 --progress-bar off
!pip install -qqq transformers==4.33.2 --progress-bar off
!pip install -qqq langchain==0.0.299 --progress-bar off
!pip install -qqq chromadb==0.4.10 --progress-bar off
!pip install -qqq xformers==0.0.21 --progress-bar off
!pip install -qqq sentence_transformers==2.2.2 --progress-bar off
!pip install -qqq tokenizers==0.14.0 --progress-bar off
!pip install -qqq optimum==1.13.1 --progress-bar off
!pip install -qqq auto-gptq==0.4.2 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ --progress-bar off
!pip install -qqq unstructured==0.10.16 --progress-bar off

In [None]:
!pip install langchain pypdf --q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/278.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/278.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[0m

Utilizing Llama 2 is as easy as any other HuggingFace model. We'll simplify the process further by using the HuggingFacePipeline wrapper (from LangChain). To load the 13B version of the model, we'll use a GPTQ (Generative Pre-trained Transformer Quantization) version of the model.

## Setup of the model

In [None]:
# We import the necessary models and packages that we will use to the setup part.
import torch
from langchain import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

In [None]:
# We define the model "MODEL_NAME".
MODEL_NAME = "TheBloke/Llama-2-7B-Chat-GPTQ"

Llama2 is a collection of pretrained and fine-tuned generative text models.

Llama2 is ranging in parameter between 7-70 billion. The model is an open source model developed by meta and will be the baseline for our GPT model.

In [None]:
# We load the tokenizer that is associated with the pre-trained model.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

# The AutoTokenizer function that provides a unified interface for loading tokenizers for various pre-trained
# language models.

# from_pretrained is a function that loads a pre-trained tokenizer based on the specific model.
# In our case it is the Llama2 model.

# use_fast = True indicates that the fast version of the tokenizer should be used if it is available.
# Fast tokenizers are optimized for speed.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [None]:
# We load the model with the AutoModelForCausalLM that is used to load pre-trained models
# designed for causal language modeling tasks (predicting the next word in sequence).
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto"
)

# The torch_dtype is set to float.16 for memory optimization since we will be working with a large database.
# This can result in some loss of precision of the model's calculations.

# trust_remote_code = "True" means that model can be allowed to load even if the code is hosted remotely
# (for example, on the Hugging Face Model Hub).

# device_map = "auto" specifies the device which the model should be loaded.
# By setting it to auto, it allows the library to automatically select an available device.


config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
# We create a configuration for our text generation based model to load the generation configuration.
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)

In [None]:
# We set the maximum number of new tokens in the generated text to 1024.
# The specific number is based on byte sizes
generation_config.max_new_tokens = 1024

# The max_new_tokens controls the maximum number of tokens that can be generated by
# the model in a single generation step.

# This configuration is useful when we want to control the length of the generated text to avoid
# execessively long outputs.

# This means that we limit the respons length of our chatbot to 1024 tokens.

In [None]:
# We set the temperature of the model.
# The temperature is a parameter that controls the randomness of the generated text.
generation_config.temperature = 0.0001

# When the temeprature is set very low e.g 0.0001, the generation will be more deterministic and focused.
# This is useful since we would like more controlled and predictable responses in the text generation.

In [None]:
# We set the top_p parameter.
# Setting the top_p to 0.95 during the generation, the model will consider words with
# cumulative probablity up to 95 %.
generation_config.top_p = 0.95

# This will help in selecting a diverse set of words while avoiding extremly low probability words.

In [None]:
# We choose to use sampling during the generation. This allows us to introduce some degree of
# randomness and creativity.
generation_config.do_sample = True

# This is being done such that the model can explore different possiblities in the generated text
# rather than always choosing the most probable next word.

In [None]:
# We set the repetition penalty at a high value of 1.15 to penalize repetition and encourage the chatbot
# to give more diverse and varied responses.
generation_config.repetition_penalty = 1.15

# This can be an impotant step when the model generates longer sequences of text.

In [None]:
# We create a text generation pipeline using the pipeline function.
# This ties all the settings and lines together that has been decided above.
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
)

# We create a LangChain pipeline that wraps the text generation pipeline and set a specific temperature for generation.
# Setting the temperature to 0 means that the the text will be more deterministic. A higher value would give more creative answers.
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})


## Loading external data

To work with external files, LangChain provides data loaders that can be used to load documents from various sources. Combining LLMs with external data is generally referred to as Retrieval Augmented Generation (RAG).

In [None]:
# We import the necessary package that we will use to load the database we will be using
from langchain.document_loaders import PyPDFLoader

In [None]:
# By using the PyPDFLoader function we are able to read the text database.
loader = PyPDFLoader("https://github.com/SeniorHreff/Assignment4chatbot/raw/main/AAURulebook dokumenter 2.pdf")

In [None]:
# We ask the document to be loaded, and define it "docs".
docs = loader.load()

# This process will take a while due to the size of the database.

In [None]:
# After we load in the data, we check the length of the database to check that it has loaded all 288 pages.
len(docs)

288

The file we're loading is the AAU Staff Handbook, which has information for staff at Aalborg University on rules, policies and procedures.

We manually joined all of the sections of the handbook together in one document to be able to load it as one document.

We want to split the document in to smaller chunks. To do this we will use RecursiveCharacterTextSplitter.

In [None]:
# We import the necessary package that we will use to split the text.
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# We create a text splitter to seperate the document into smaller chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)

# We set the chunk size to 1024 due the nature of byte sizes.
# This is also the case for chunk_overlap since it is equivilante over 1/16 of the size of a chunk.

In [None]:
# We set up the text splitter and split the text based on the requirements we defined in the line above
texts = text_splitter.split_documents(docs)

In [None]:
# We check to see if it has split the text into small chunks.
len(texts)

# If the chunk amount is larger than the total pages in the database, then we know that it works.
# The chunk amount should be around 3.8 times as large as the total mount of pages.
# This is due to the setup size.

831

Splitting the document into chunks is required due to the limited number of tokens a LLM can look at once (4096 for Llama 2). Next, we will use the HuggingFaceEmbeddings class to create embeddings for the chunks:

In [None]:
# We import the necessary package that we will use to run our text embedder
from langchain.embeddings import HuggingFaceEmbeddings

In [None]:
# We setup the function that will be embedding our database.
embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

# We use the gte-large (General Text Embedding) model from Hugging Face. The GTE models are based on the BERT framework.
# Compare to the gte-base, gte-large is able to handle more dimensions and are therefore faster. The model exclusively works on English texts.
# model_kwargs define what kind of cores that it should use. In this case it will be using CUDA cores, meaning that it will have to use a
# graphic card to embed the text.
# This is why it is important to check if there is a graphic card tied to the connection when running the code through Google Colab.
# encode_kwarg specifies that the embedding should be normalized.

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

onnx/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

onnx/tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

In [None]:
# We now generate the embedding using the Hugging Face model.
query_result = embeddings.embed_query(texts[0].page_content)

# embed_query(texts[0].page_content) makes it such that the embedding takes a query
# (a page content of the first text in the 'texts' list) and generates embedding for the query.

In [None]:
# We can see that with the embedding the first page has a query length of 1024,
# which is exactly the size we wanted.
print(len(query_result))

1024


We will also be using free embeddings hosted by HuggingFace. We will use Chroma database to store/cache the embeddings and make it easy to search them.

To combine the LLM with the database, we'll use the RetrievalQA chain.

In [None]:
# We import the necessary package that we will use to store/cache the embeddings
from langchain.vectorstores import Chroma

In [None]:
# We set up the embedding using Chroma, to store our embeddings such that the chatbot
# can refer back to the whole database when generating an output.
db = Chroma.from_documents(texts, embeddings, persist_directory="db")

# When persist_directory is specified, the collection will be persisted there.
# Otherwise, the data will be ephemeral in-memory.

## Similarity check

In [None]:
# We do a similarity check to see if the embedding works and are able to find something
# similar to what we search for.
results = db.similarity_search("alcohol", k=2)

# By setting k=2 we decide on the number of neighbors or similar texts to retrieve from the search.

In [None]:
# We now ask it print out the first result that it found.
print(results[0].page_content)

to
moderate
their
alcohol
consumption.
Exceptions
At
special
occasions,
such
as
Christmas
parties,
anniversaries,
receptions,
parties,
various
social
events,
etc.,
the
management
may
authorise
the
consumption
of
alcohol
during
working
hours.
Staff
members’
and
managers’
rights
and
duties
All
managers
are
responsible
for
ensuring
that
alcohol
or
drugs
are
not
part
of
the
working
culture.
All
managers
have
a
duty
to
intervene
if
staff
members
disregard
the
rules
by
using
or
abusing
drugs
and
alcohol.
If
a
manager
becomes
aware
of
a
staff
member
having
an
addiction
problem,
the
manager
should
support
their
rehabilitation
by
finding
out
what
services
are
available
to
the
staff
member
through
their
municipality
of
residence.
The
staff
member
has
a
duty
–
as
well
as
a
responsibility
–
to
work
on
solving
the
addiction
problem
and
to
improve
work
performance.
Treatment
All
staff
members
can
avail
themselves
of
help
and
treatment
for
addiction
through
their
municipality
of
residence.
Staff
memb

## Prompt template

In [None]:
# We import the necessary packages that we will use to store/cache the embeddings
from langchain.chains import RetrievalQA
from langchain import PromptTemplate

In [None]:
# We set up the template for the chatbot.
# This decides on how the model will react and how it will answer based on the prompt that it is given.
# By specifying that it is a guidance counselor, it keeps the answers more professional
# and keeps it in line with the purpose of its use.
# To make sure that the model does not give us complex answers we ask it to answer in a way that an average person can understand.
template = """
<s>[INST] <<SYS>>
Act as a guidance counselor that aims at providing a clear and simple answer. If relevant, provide your clear and simple answer with a further explanation refering to exeptions and limitations.
Keep the answers simple, short and precise only refering to currently active rules and guidances.
<</SYS>>

{context}

{question} [/INST]
"""

In [None]:
# We set up the PromptTemplate function in a way are based on the question asked to the chatbot
# and the context of the question.
# This is such that the chatbot answers the question based the input with the needed context.
prompt = PromptTemplate(template=template, input_variables=["context", "question"])

## Retrieval function and answering questions

In [None]:
# We set up the retrival function such that the chatbot is able to retrieve the information that it needs to be able to answer the prompt.
# Due to the setup from earlier, the chain_type "stuff" is required to fun the function. Were the model set up in another way, a different chain_type is required.
# The rest is based on the information that we
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

In [None]:
# We import the necessary package that we will use to wrap the text and limit the width of the input
from textwrap import fill

In [None]:
## Question 1

# Now that we have set up all details of the chatbot, we are now able to ask it questions to see if it works.
# We use the defined function qa_chain and save the question as a variable result.
result = qa_chain(
    "I work as a part-time employee at the university of AAU and i've now become sick. What are the rules regarding sickness for part-time employees?"
)

## Answer 1

# We now print out the answer of the chatbot to see if it gives an answer similar to something that we asked it.
# The .strip() function is added to make sure that it only prints out a cleaned-up version in the consol.
# We limit the width of the output such that we don't have to scroll back and forth for the full respons.
print(fill(result["result"].strip(), width=80))

As a guidance counselor, I must inform you that the rules on reporting sickness
for part-time employees are similar to those for full-time employees, with some
exceptions and limitations. Here is an overview of the relevant rules and
procedures: Part-time employees must report their sickness to their section or
department, following the procedures applicable at their place of work, either
when they start working hours or at the time of day the sickness absence begins.
They must also provide information about the cause of their sickness absence,
including their own illness, child's illness, pregnancy-related illness, or
work-related injury (Section 56). To report sick, part-time employees are
obligated to provide information about the cause of their sickness absence,
which may include their own illness, child's illness, pregnancy-related illness,
or work-related injury. This information must be provided in order to register
the sick leave correctly and ensure possible reimbursement. All

For this question the model gives a very long question.

This could be due to the question not providing enough context for the model to answer more precisely and/or it could be due to the fact that there are many rules regarding sickness which makes it harder for the model to find and give the specific information we are interessted in a short and simple answer.

In [None]:
## Question 2

# We use the qa_chain to perform question-answering with a new question.
# The input is a question related to the alcohol policies and exceptions thereof for staff members at Aalborg University during working hours.
# The result contains the generated answer by the QA system.
result = qa_chain(
    "My colleague turns 60 years old and has invited the section to cake and champagne. We are invited to come to the staff kitchen at 2 pm which is during my normal work time. Can I drink alcohol for this occassion?"
)

## Answer 2

# Print the generated answer after applying text wrapping.
# We limit the width of the output such that we don't have to scroll back and forth for the full respons.
print(fill(result["result"].strip(), width=80))

As a guidance counselor, it is important to note that these rules apply to all
staff members, including managers. The university takes the safety and well-
being of its employees seriously, and therefore prohibits the consumption of
alcohol during working hours. However, there are exceptions to this rule, such
as special occasions like Christmas parties or anniversaries where the
management may authorize the consumption of alcohol during working hours.  In
this case, since it is a special occasion and you have been explicitly invited
by your colleague, you may consume alcohol during your normal worktime. Please
keep in mind to always drink responsibly and to not overconsume, as the
university also provides support services for addiction problems.


For this question the model gives a rather short and precise answer.

We see that the answer start out explaining the generel rules for alcohol consumpsion during working hours, and then states that there are some exceptions to the rules. It then ends the answer by saying what exception applies for this instance based on the context that is provided in the question.

We consider this answer to be good since it is short, simple, and easy to understand.

In [None]:
## Question 3

# We use the qa_chain to perform question-answering with a new question.
# The input is a question related to the policies and rules that apply when a staff member experience offensive behavior.
# The result contains the generated answer by the QA system.
result = qa_chain(
    "I am a staff member that have experienced offensive behavior from my colleague. I have forgotten the rules that apply in this situation. What should I do?"
)

## Answer 3

# Print the generated answer after applying text wrapping.
# We limit the width of the output such that we don't have to scroll back and forth for the full respons.
print(fill(result["result"].strip(), width=80))

The rules you are referring to are outlined in the following sections:  Section
1 - Reporting Offensive Behavior  You may report any offensive behavior by
speaking directly to the person engaging in such conduct or by reporting it
through proper channels. You can also seek assistance from your immediate
manager or supervisor if you feel uncomfortable addressing the issue yourself.
Section 2 - Investigating Offensive Behavior  When an incident of offensive
behavior is reported, management will investigate the matter promptly and take
appropriate action. Management will consider all factors before taking
disciplinary action against the offender. The investigation process includes
gathering information from witnesses, interviewing parties involved, and
reviewing any available documentation related to the incident.  Section 3 -
Taking Action Against Offenders  Management will take appropriate disciplinary
action when necessary based on the findings of the investigation. Disciplinary
termin

This answer refers directly to the specific sections in the staff handbook where you can find information regarding offensive behavior. It tells you in the first section what you should do, then follows that up with the proceedings that will take place after you have reported the instance. Due to the context that is given for the question it makes sense that the model's answer gives you more than just the initial action you should take.

The model also states that reporting/addressing this behavior is important.

We generelly find this to be a good answer since it is simple and rather precise, but also tells you what will follow your report.

In this assignment we've implemented a NLP pipeline using the LangChain library and Hugging Face Transformers. Initially, we set up a Question-Answering (QA) system employing a RetrievalQA chain with a predefined template. This system utilized a language model (llm) and a document vector store (Chroma) for efficient retrieval and processing. We then demonstrated the system's capability by answering questions related to rules and rights for the staff.

Additionally, we integrated text splitting to handle large documents efficiently and embedded the text content using a pre-trained language model.

Furthermore, we performed similarity search to find relevant information based on user queries. Finally, we summarized the results and presented them with proper text wrapping for readability. The overall system allows for querying, retrieval, and summarization of information.