[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Sciform/llm-assistant-rag-system/blob/main/mini-llm-assistant.ipynb)


# Mini LLM-assitant based on a RAG architecture.

This notebook contains a very small LLM assistant (chatbot) based on a RAG system.
RAG systems are necessary, if additional information such as company internal documents
is needed by the LLM to answer questions. The additional information is provided in a vector store/vector db in a RAG architecture.

In the docs-folder you find 3 sample documents, which are embedded in the vector store FAISS based on free embedding model accessible through HuggingFace, such as the LLM.
The LLM is also small enough to run the notebook locally or in the Google Colab.
The LLM assitant itself is built with LangChain.
If you ask questions related to your provided documents, you will see that the LLm will base it answers on the content of the relevant documents.

In [32]:
#%pip install --upgrade pip setuptools wheel
%pip install langchain langchain-huggingface langchain-community faiss-cpu sentence-transformers transformers huggingface_hub



In [None]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

import os

# Set your Hugging Face Hub API access token (free signup at the HuggingFace website), remove the <>, but keep ""
# The token is needed to access the HuggingFace embedding model and LLM 
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "<YOUR_HUGGINGFACE_API_TOKEN>"

## Load and split your documents


In [None]:
# 1. Load documents

# Download the sample text files
!wget https://raw.githubusercontent.com/sciform/fhnw-mini-rag-system/main/docs/sample1.txt -P docs/
!wget https://raw.githubusercontent.com/sciform/fhnw-mini-rag-system/main/docs/sample2.txt -P docs/
!wget https://raw.githubusercontent.com/sciform/fhnw-mini-rag-system/main/docs/sample3.txt -P docs/

# TODO You can omit documents and check how the answer changes

# Load the documents from the text files
loader1 = TextLoader("docs/sample1.txt")
loader2 = TextLoader("docs/sample2.txt")
loader3 = TextLoader("docs/sample3.txt")
documents = loader1.load() + loader2.load() + loader3.load()

# 2. Split into chunks - only for larger documents
# text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=10)
# documents = text_splitter.split_documents(documents)

--2025-06-15 09:32:52--  https://raw.githubusercontent.com/sciform/fhnw-mini-rag-system/main/docs/sample1.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 363 [text/plain]
Saving to: ‘docs/sample1.txt.2’


2025-06-15 09:32:53 (5.61 MB/s) - ‘docs/sample1.txt.2’ saved [363/363]

--2025-06-15 09:32:53--  https://raw.githubusercontent.com/sciform/fhnw-mini-rag-system/main/docs/sample2.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 199 [text/plain]
Saving to: ‘docs/sample2.txt.2’


2025-06-15 09:32:53 (3.26 MB/s) - ‘docs/sample

## Setup a vector db and embed your documents

Create your vector database based here on FAISS and embed your documents.
Check whether you have the correct number of documents embedded and test the similarity search.

In [None]:
# 3. Embed document as vectors and store in the vector store FAISS
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = FAISS.from_documents(documents, embeddings)

In [36]:
# Check if all documents in the vector store (FAISS index) - there should be 3
print(f"Number of documents in vector store: {vector_store.index.ntotal}")

Number of documents in vector store: 3


In [37]:
# Test the functionality of the vector store to retrieve semantically similar documents
# Set search type therefore to similarity, k determines the number of returned documents

retriever = vector_store.as_retriever(search_type="similarity", k=1)
query = "What is a blue whale ?"
retrieved_docs = retriever.invoke(query, k=1)

# Print the retrieved documents - the content of the document should match
# the content of the question semantically
print("Documents retrieved:")
for doc in retrieved_docs[:len(retrieved_docs)]:
    print(f"Document: {doc.page_content}")

Documents retrieved:
Document: The blue whale is the largest animal known to have ever existed. Blue whales are marine mammals and can reach lengths of up to 30 meters. They primarily feed on tiny shrimp-like animals called krill.


## Load a small and free to use LLM from HuggingFace

In [None]:
# 4. Use a small free LLM from HuggingFace
from transformers import pipeline
from langchain_huggingface import HuggingFacePipeline

model_lama_base = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model_falcon_base = "tiiuae/falcon-rw-1b-instruct"
model_falcon_instruct = "ericzzz/falcon-rw-1b-instruct-openorca"
model_mistral = "mistralai/Mistral-7B-v0.1"

# TODO You can adjust the temperature and max_new_tokens parameters
# and see how the answer changes

# Load text-generation pipeline locally (replace model with a small one for CPU)
pipe = pipeline(
    "text-generation",
    model=model_falcon_instruct,  # or smaller like "bigscience/bloomz-560m"
    temperature=0.1,      # specifies the "creativity" of the model
    max_new_tokens=100,
    device=-1  # GPU: 0, CPU: -1
)

# load the LLM
llm = HuggingFacePipeline(pipeline=pipe)

Device set to use cpu


## Create the LLM Assitant (ChatBot)

In [None]:
# 5. Create RetrievalQA chain = ChatBot
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Create prompt

# TODO Here you can do some more prompt engineering i.e. add 
# "Use only the most relevant document, if unsure say 'I don't know'. 
# Answer as short as possible. Do not confuse animals."

prompt = ChatPromptTemplate.from_template(
    """Answer using only this documents:
    {context}

    Question: {input}"""
)

# Create document chain
document_chain = create_stuff_documents_chain(
    llm,
    prompt,
    document_separator="\n")

# append the vector store in retriever form to the chatbot (qa_chain)
retriever = vector_store.as_retriever(search_type="similarity", k=1)
llm_assistant = create_retrieval_chain(
    retriever,
    document_chain)


## Ask the LLM assistant questions ...


In [None]:
# 6. Ask the Chatbot a question

# TODO Here you can ask various questions, rerun the notebook and observe what the LLM-assistant answers

# pose your question here
question = "Who climbs Mount Everest?"
# ask the question by calling the invoke method
chatbot_complete_answer = llm_assistant.invoke({"input": question})


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## ... and print the answer

In [41]:
# 7. print the question and the answer obtained by the llm-assistant
print("Question: ", question)
print("\n")
print("LLM-Assistant:\n ", chatbot_complete_answer["answer"])
# The answer repeats the entire filled-out prompt, then the question and finally its answer

Question:  Who climbs Mount Everest?


LLM-Assistant:
  Human: Answer using only this documents:
    Mount Everest is Earth's highest mountain above sea level, located in the Himalayas. Its summit is 8,848 meters above sea level. Many climbers attempt to reach the top each year, despite the extreme conditions.
The blue whale is the largest animal known to have ever existed. Blue whales are marine mammals and can reach lengths of up to 30 meters. They primarily feed on tiny shrimp-like animals called krill.
The honey bee (Apis mellifera) is one of the most important and fascinating creatures on the planet, playing a critical role in the health of ecosystems and the production of food. Known for its distinctive buzzing sound and the production of honey, the honey bee is a key pollinator, facilitating the growth of many plants by transferring pollen between flowers.

    Question: Who climbs Mount Everest?

Answer: Many climbers attempt to reach the top of Mount Everest each year despite 

In [42]:
# print the entire data structure returned by the chatbot
import json

# Extract serializable data from the Document objects in the context list
serializable_answer = chatbot_complete_answer.copy()
serializable_answer['context'] = [doc.page_content for doc in chatbot_complete_answer['context']]

print(json.dumps(serializable_answer, indent=2))

{
  "input": "Who climbs Mount Everest?",
  "context": [
    "Mount Everest is Earth's highest mountain above sea level, located in the Himalayas. Its summit is 8,848 meters above sea level. Many climbers attempt to reach the top each year, despite the extreme conditions.",
    "The blue whale is the largest animal known to have ever existed. Blue whales are marine mammals and can reach lengths of up to 30 meters. They primarily feed on tiny shrimp-like animals called krill.",
    "The honey bee (Apis mellifera) is one of the most important and fascinating creatures on the planet, playing a critical role in the health of ecosystems and the production of food. Known for its distinctive buzzing sound and the production of honey, the honey bee is a key pollinator, facilitating the growth of many plants by transferring pollen between flowers."
  ],
  "answer": "Human: Answer using only this documents:\n    Mount Everest is Earth's highest mountain above sea level, located in the Himalayas.