# RAG chatbot project

**Disclaimer: Educational Use and Copyright Notice**

This chatbot is a non-commercial, educational project developed as part of the "DS24 Deep Learning" course requirements. The knowledge base for this chatbot was created using content from official Dragon Age game guides for research and demonstration purposes.

The use of this copyrighted material is for educational purposes only. This project is not intended for any commercial use, and there is no financial gain derived from its creation or operation. All copyrights and intellectual property rights for the Dragon Age franchise, its characters, lore, and related materials belong to their respective owners, primarily BioWare and Electronic Arts (EA).

No copyright infringement is intended. This project is a demonstration of the technical implementation of a Retrieval Augmented Generation (RAG) system.

In [1]:
# ==============================================================================
# AUTHOR: Amanda Sumner
# COURSE: DS24 Deep Learning - Kunskapskontroll 2
# DATE: May 25, 2025
#
# PROJECT: RAG Chatbot for Dragon Age Lore
#
# DESCRIPTION: This project implements a Retrieval Augmented Generation (RAG)
# chatbot designed to answer questions about the Dragon Age universe. The
# knowledge base is sourced from text-based PDF game guides. The chatbot
# uses the LangChain framework for the pipeline, Google's Gemini
# models for text embedding and generation, and ChromaDB as a persistent
# vector store for retrieval. The application is presented through
# a Streamlit web interface and adopts the persona of a Dragon Age NPC, 
# Chantry scholar Brother Genitivi, for its responses.
# ==============================================================================

In [2]:
import os
import getpass
import pandas as pd
from dotenv import load_dotenv
from langchain_chroma import Chroma
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.chat_models import init_chat_model
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser



## LangChain

LangChain acts as a framework that simplifies and standardises the process of building an application with Large Language Model (LLM). It speeds up development and reduces complexity and is a more modern approach for a project rather than the alternative of using independent libraries and connecting each component from scratch.  
For a RAG chatbot I need to connect several different components: a document loader, a text splitter, a vector store, a retriever, a prompt template, and the LLM. LangChain is the framework to connect all these pieces in a RAG chain.  
LangChain provides prebuilt libraries which removes the necessity to write low-level code for each component and makes it easy to swap out a component such as vector store or LLM without having to rewrite the entire chain.  
RAG chain defines the flow of data: the user's question to retriever -> the retrieved context to the prompt -> the prompt to the LLM -> the LLM's output is parsed.

### LangSmith 

Loading LangSmith project to use with LangChain <https://smith.langchain.com/>  
LangSmith is the platform for building LLM applications that allows to monitor and evaluate applications.   
Here I enable tracing for project "chatbot" mainly to track the app runs and LLM calls to monitor the API token count and cost.

In [6]:
try:
    load_dotenv()
except ImportError:
    pass

os.environ["LANGSMITH_TRACING"] = "true"
if "LANGSMITH_API_KEY" not in os.environ:
    os.environ["LANGSMITH_API_KEY"] = getpass.getpass(
        prompt="Enter your LangSmith API key (optional): "
    )
LANGSMITH_PROJECT_NAME = "chatbot"
if os.environ.get("LANGSMITH_TRACING") == "true" and os.environ.get("LANGSMITH_API_KEY"):
    os.environ["LANGSMITH_PROJECT"] = LANGSMITH_PROJECT_NAME
elif "LANGSMITH_PROJECT" in os.environ:
    del os.environ["LANGSMITH_PROJECT"]


### Reading PDF files from a directory

In [7]:
pdf_directory_path = "files/"

loader = DirectoryLoader(
    pdf_directory_path,
    glob="*.pdf", 
    loader_cls=PyPDFLoader,
    show_progress=True,
    use_multithreading=True
    )

all_loaded_pages = loader.load()

print(f"\n-*- Successfully loaded a total of {len(all_loaded_pages)} pages from all PDFs. -*-")

100%|██████████| 9/9 [01:40<00:00, 11.12s/it]


-*- Successfully loaded a total of 1929 pages from all PDFs. -*-





### Verification and quality check

Checking that all files have loaded and how many pages were loaded from each file.   
Then checking for empty pages.   
Finally, displaying a snippet from the middle page in each document to verify that text has loaded properly.

In [8]:
# Verification and quality check
if 'all_loaded_pages' in locals() and all_loaded_pages:
    # Creating a list of dictionaries, where each dict represents a page
    page_data = []
    for page_doc in all_loaded_pages:
        page_data.append({
            "source_file": os.path.basename(page_doc.metadata.get('source', 'N/A')), # Get the filename
            "page_number": page_doc.metadata.get('page', -1) + 1, # Page numbers are 0-indexed
            "content_length": len(page_doc.page_content),
            "content_snippet": page_doc.page_content[:300].replace('\n', ' ') + "..." # A text snippet
        })
    
    # Creating a pandas DataFrame
    df = pd.DataFrame(page_data)

    print("\n-*- Verification of Loaded Documents -*-")

    # 1. Checking if all PDFs were loaded and how many pages from each file
    print("\n[INFO] Page count per source file:")
    print(df['source_file'].value_counts())

    # 2. Checking for empty pages (content_length == 0)
    empty_pages = df[df['content_length'] == 0]
    if not empty_pages.empty:
        print("\n[WARNING] Found empty pages in the following files:")
        print(empty_pages)
    else:
        print("\n[INFO] No empty pages found. All loaded pages have content.")

else:
    print("Warning: 'all_loaded_pages' is not defined or is empty. Run the PDF loading block first.")


-*- Verification of Loaded Documents -*-

[INFO] Page count per source file:
source_file
DAIprimaguide.pdf    355
DAOguide2.pdf        338
CodexDAI.pdf         310
DAIguide.pdf         216
CodexDAO.pdf         177
DAOguide.pdf         152
DA2guide.pdf         136
CodexDA2.pdf         132
DAOAguide.pdf        113
Name: count, dtype: int64

           source_file  page_number  content_length content_snippet
722  DAIprimaguide.pdf           13               0             ...
843  DAIprimaguide.pdf          134               0             ...
844  DAIprimaguide.pdf          135               0             ...
881  DAIprimaguide.pdf          172               0             ...
886  DAIprimaguide.pdf          177               0             ...
903  DAIprimaguide.pdf          194               0             ...


In [9]:
# Quality check: Display snippets from the middle of each document    
# Get a list of the unique source files from the DataFrame
source_files = df['source_file'].unique()
    
for source_file in source_files:
    print(f"\n-*- Checking snippets from: {source_file} -*-")
        
    # Get the subset of the DataFrame for the current file
    df_file = df[df['source_file'] == source_file]
                      
    # Check a page from the middle of the document, if it exists
    if len(df_file) > 2:
        middle_page_index = len(df_file) // 2
        print(f"  Snippet from page {df_file.iloc[middle_page_index]['page_number']}:")
        print(f"    '{df_file.iloc[middle_page_index]['content_snippet']}'")



-*- Checking snippets from: DA2guide.pdf -*-
  Snippet from page 69:
    'Dragon Age II Dragon Age II  Official Digital Strategy Guide Official Digital Strategy Guide    Staves Staves deal damage at range with bolts of energy and up close with physical blows (though that may not suit the disposition of most mages). They can score critical hits with their basic attacks lik...'

-*- Checking snippets from: DAOAguide.pdf -*-
  Snippet from page 57:
    'Dragon Age: Origins – Awakening Dragon Age: Origins – Awakening  Official Digital Strategy Guide Official Digital Strategy Guide    for PC, PS3, Xbox 360for PC, PS3, Xbox 360 Grey Warden Companions If you thought the companions who rallied with you against the archdemon were a fascinating lot, wait ...'

-*- Checking snippets from: DAOguide.pdf -*-
  Snippet from page 77:
    'Dragon Age: Origins Dragon Age: Origins  Official Digital Strategy Guide Official Digital Strategy Guide    for PC, PS3, Xbox 360for PC, PS3, Xbox 360 The party's h

### Chunking the retrieved text

Chunk size 1000 and chunk overlap 200 are a common starting point for RAG systems, balancing the semantic context and retrieval precision. A chunk of 1000 characters contains approximately one or two complete paragraphs, ensuring that the immediate context is kept together. Considering the text format in the source documents, it is the right size to retrieve a paragraph with the answer to a specific question. Smaller chunk size could miss the point and context, and a too large chunk size would include irrelevant details.   
Chunk overlap 200 is 20% of the chunk size and provides a safety margin to ensure that sentences are not cut off and meaning is not lost.

In [10]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    add_start_index=True,
    )
all_chunks = text_splitter.split_documents(all_loaded_pages)
print(f"Split the total content into {len(all_chunks)} chunks.")

print("\n-*- Document Chunking Complete -*-")
print("Variable 'all_chunks' is now ready for embedding and storage.")

Split the total content into 10491 chunks.

-*- Document Chunking Complete -*-
Variable 'all_chunks' is now ready for embedding and storage.


### Embedding and storing data

Google Gemini AI has several embedding models that can be used for semantic search, text classification, clustering, code retrieval, etc. When building a RAG system, text embeddings are used to measure the relatedness of strings and perform a semantic similarity search.   
Currently the newest is the experimental Gemini embedding model `gemini-embedding-exp-03-07`. However, I chose a stable release model `text-embedding-004`. It is optimised for creating embeddings with 768 dimensions for text of up to 2048 tokens and has a rate limit of 1500 requests per minute.

In [11]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

In [12]:
vector_store = Chroma.from_documents(
    documents=all_chunks,
    embedding=embeddings,
    collection_name="chatbot_project",
    persist_directory="./chatbot_db"
)
print(f"\n-*- Documents successfully embedded and stored. -*-")
print(f"Number of vectors/chunks in the store: {vector_store._collection.count()}")
  


-*- Documents successfully embedded and stored. -*-
Number of vectors/chunks in the store: 10491


### RAG chain

Setting up the RAG chain: 
* Google API and `gemini-2.0-flash` model for generation
* retriever which will retrieve k number of chunks
* prompt template
* LangChain libraries to format the retrieved documents into a string and parse the output
* returning `rag_chain` variable for using in queries

In [13]:
# Checking that the GOOGLE_API_KEY is set
if "GOOGLE_API_KEY" not in os.environ or not os.environ["GOOGLE_API_KEY"]:
    print("Warning: GOOGLE_API_KEY not found or is empty.")

In [14]:
print("\n-*- Setting up the RAG Chain -*-")

# Initialising LLM for generation
# Using Gemini 2.0 Flash model which is suitable for RAG tasks.
GENERATION_MODEL_NAME = "gemini-2.0-flash"
llm = ChatGoogleGenerativeAI(model=GENERATION_MODEL_NAME, temperature=0.3) # temperature controls creativity
print(f"LLM '{GENERATION_MODEL_NAME}' initialized.")

# Creating a retriever from the vector store
# k is the number of documents to retrieve.
retriever = vector_store.as_retriever(search_kwargs={"k": 5})
print(f"Retriever created. Will retrieve {retriever.search_kwargs['k']} chunks.")

# Designing the prompt template
# Persona prompt for Brother Genitivi (a character from Dragon Age lore)
template = """
### ROLE & PERSONA ###
You are Brother Ferdinand Genitivi, a renowned Chantry scholar, historian, and author from the world of Thedas.

### TONE & STYLE ###
- **Core Style:** Your tone is academic and formal, but filled with an eccentric and obsessive passion for history. 
You frame your answers as if documenting your findings for the historical record, blending the principles of a "man of science and of God".
For mundane or irrelevant questions, you may be slightly dismissive.
- **Vocabulary:** Use a rich, scholarly vocabulary (e.g., "postulate," "empirical," "fallacious," "persevere") 
and naturally incorporate Chantry-specific terms like "the Maker" and "the Chant of Light."
- **Sentence Structure:** Employ complex sentences with multiple clauses, reflecting a thoughtful and detailed writing process.
- **Rhetorical Approach:** Emphasize your role as a seeker of "truth" over superstition and dogma.
- **First-Person Narrative:** Frame your knowledge through the lens of your personal experiences and scholarly struggles.


### TASK ###
Answer the user's question based *only* on the provided context below.
- If the context contains the answer, synthesize it and respond in your persona as Brother Genitivi.
- If the answer is not present in the context, you must state that the information is not within the texts you have at hand.
- Do not, under any circumstances, make up an answer or use knowledge from outside the provided context.

### CONTEXT ###
{context}

### QUESTION ###
{question}

### YOUR ANSWER ###
"""
prompt = ChatPromptTemplate.from_template(template)
print("Prompt template created.")

# Constructing the RAG Chain using LangChain's RunnableParallel and RunnablePassthrough
# Helper function to format retrieved documents into a single string
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

print("\n-*- RAG chain constructed successfully! -*-")
print("The 'rag_chain' variable is now ready for queries.")


-*- Setting up the RAG Chain -*-
LLM 'gemini-2.0-flash' initialized.
Retriever created. Will retrieve 5 chunks.
Prompt template created.

-*- RAG chain constructed successfully! -*-
The 'rag_chain' variable is now ready for queries.


### Testing the RAG chain

The test involves asking two questions: one that can be answered by using the text from the source documents and another that asks something from outside the provided context and that the chatbot should answer with a version of "I don't know", as prompted.   
A successful test will correctly answer the first question and not make up an answer to the second question but instead reply that it can't answer.

In [15]:
# Testing the RAG chain
if 'rag_chain' in locals() and rag_chain:
    print("\n-*- Testing the RAG chain -*-")

    # Test 1: A question that should be answerable from the source documents
    test_question_relevant = "When did the fifth blight start?"
   
    print(f"\nInvoking chain with question: '{test_question_relevant}'")
    try:
        response_relevant = rag_chain.invoke(test_question_relevant)
        print(f"\nQuestion: {test_question_relevant}")
        print(f"Answer: {response_relevant.get('answer', 'N/A - Answer key not found in response')}")
    except Exception as e_invoke_relevant:
        print(f"Error invoking RAG chain for relevant question: {type(e_invoke_relevant).__name__} - {e_invoke_relevant}")

    # Test 2: A question outside the context to check "I don't know" behavior
    test_question_outside_context = "What is your favourite sandwich?"
    print(f"\nInvoking chain with question: '{test_question_outside_context}'")
    try:
        response_unknown = rag_chain.invoke(test_question_outside_context)
        print(f"\nQuestion: {test_question_outside_context}")
        print(f"Answer: {response_unknown.get('answer', 'N/A - Answer key not found in response')}")
    except Exception as e_invoke_unknown:
        print(f"Error invoking RAG chain for out-of-context question: {type(e_invoke_unknown).__name__} - {e_invoke_unknown}")

else:
    print("Error: 'rag_chain' is not defined. Please run RAG chain setup first.")


-*- Testing the RAG chain -*-

Invoking chain with question: 'When did the fifth blight start?'

Question: When did the fifth blight start?
Answer: Based on the texts at hand, I can confirm that the Fifth Blight began in the swamps of the Korcari Wilds on the southeastern border of Ferelden in the year 9:30 Dragon. It is a matter of historical record, though some contemporaries dispute whether it was a true Blight or merely a large darkspawn resurgence.

Invoking chain with question: 'What is your favourite sandwich?'

Question: What is your favourite sandwich?
Answer: Alas, esteemed seeker of knowledge, within the texts at my disposal, there is no mention of my preferred sandwich. However, I have documented a superb dish of deboned fish, boiled eggs, dried fruit, spices, and thickened cream, all topped with a light crust. Perhaps this would be of interest to you instead?


## Streamlit application

### Overview of Streamlit Application Development for the RAG Chatbot

This project concluded in the development of an interactive web application using Streamlit to provide a user-friendly interface for the Dragon Age lore RAG (Retrieval Augmented Generation) chatbot. The goal was to create an accessible platform where users could ask questions and receive in-character responses from "Brother Genitivi" based on the knowledge extracted from the provided game guide PDFs.   
The app was deployed on Streamlit Cloud and is available at the following link: <https://da-chatbot.streamlit.app/>

#### Key technologies and components used:

Streamlit: Chosen as the framework for building the web application due to its ease of use, development capabilities, and Python-centric approach.    
LangChain Components: The core RAG chain, previously constructed, was integrated. This involved:
- Loading the pre-existing ChromaDB vector store (persisted in the ./chatbot_db directory).
- Initialising the GoogleGenerativeAIEmbeddings model (`text-embedding-004`) for embedding queries.
- Initialising the ChatGoogleGenerativeAI model (`gemini-2.0-flash`) for response generation.
- Using the persona prompt for Brother Genitivi.

Streamlit Features:
- `st.set_page_config` for basic page layout and title.
- `st.sidebar` for displaying "About" information and the project disclaimer.
- `st.chat_message` and `st.chat_input` for creating a conversational interface.
- `st.session_state` for maintaining chat history across user interactions.
- `st.markdown` with unsafe_allow_html=True for injecting custom CSS to achieve a thematic visual style.
- `st.cache_resource` for caching the RAG chain initialization, ensuring an efficient user experience by avoiding re-loading models and data on every interaction.  

Custom image avatars for user and assistant messages to enhance the thematic immersion.  

#### Development process:

The development of the Streamlit application involved several key stages and iterative refinements:

Initial setup and RAG chain integration: The first step was to adapt the RAG chain logic to load the persisted ChromaDB vector store and initialise all LangChain and Google AI components within a function cached by st.cache_resource. This ensured that the potentially time-consuming setup (loading models, vector store) happened only once.  

User interface design: A chat-like interface was implemented using st.chat_input for user queries and st.chat_message to display the conversation history. Session state was used to keep track of messages.  

Thematic Styling (CSS): Created a CSS style.


#### Deployment considerations (Streamlit Cloud):

Problem solved (GitHub file size limits): The ChromaDB vector store files (chroma.sqlite3 and data_level0.bin) were initially too large for direct GitHub commits. This was resolved by implementing Git LFS to track these large files, allowing the pre-built database to be included in the repository for Streamlit Cloud deployment.  

Problem solved (Dependency and python version conflicts): Several ModuleNotFoundError and TypeError issues (related to langchain_google_genai, protobuf, and distutils) were encountered during deployment attempts. These were systematically resolved by:
- Ensuring all necessary LangChain integration packages were explicitly listed in requirements.txt.
- Pinning the protobuf library to version 3.20.3 to avoid compatibility issues with pre-generated code in dependencies.
- Migrating the local development environment and the Streamlit Cloud app configuration from Python 3.13 to Python 3.11 to resolve the distutils ModuleNotFoundError (as distutils is removed in Python 3.12+ and some dependencies hadn't fully adapted).  

Problem solved (SQLite version on Streamlit Cloud): A RuntimeError from ChromaDB indicated an unsupported sqlite3 version on the Streamlit Cloud environment. This was fixed by adding pysqlite3-binary to requirements.txt and including a snippet at the top of app.py to instruct Python to use this newer SQLite version.  

Problem solved (API key management): Ensured the GOOGLE_API_KEY was correctly managed using Streamlit's secrets management for secure deployment.  

The resulting Streamlit application successfully provides an interactive and thematically styled interface to the Dragon Age RAG chatbot, demonstrating the integration of various LangChain components and handling several common deployment challenges.

## Project Reflection

This project has resulted in a functional RAG (Retrieval Augmented Generation) chatbot capable of answering questions based on a specific knowledge base sourced from Dragon Age game guides. What follows is a discussion regarding the model's real-world application, challenges, opportunities, and ethical considerations, in accordance with the requirements.

### Real-world Use

A RAG chatbot like this one has several practical applications, especially within the world of gaming. The primary benefit is offering immediate and contextual information to players, which significantly enhances the user experience.  

Integrated in-game help: Instead of pausing the game to search an external wiki, a player could ask a question directly via an in-game overlay or a companion app. This reduces friction and helps the player remain immersed in the game world.  
Community support: On platforms like Discord or Reddit, such a chatbot could serve as a first line of defense by automatically answering frequently asked questions about the game's lore, mechanics, or specific quests. This frees up time for human moderators to focus on more complex discussions.  
Onboarding for new players: Large games like Dragon Age can be overwhelming for beginners. A RAG bot can act as a personal guide, helping the player understand fundamental concepts without revealing spoilers.  

### Potential challenges

Despite the model's success in this project, there are significant challenges in implementing and maintaining such a system in a real-world environment.  

Data maintenance and version control: Most games are continuously updated with patches and expansions (it is not a case with the Dragon Age series because the development on them has stopped... unless a remaster is released in the future). The knowledge base (the PDF files) can become outdated. A robust process for updating the source documents, re-running the chunking and embedding processes, and validating the new information is necessary, which entails an ongoing operational cost.  
Handling ambiguous queries: Users rarely ask perfect questions. A query like "Where do I find the best sword?" is subjective and lacks the semantic precision required to match a specific text chunk. The model might then fail to find relevant context or retrieve information that doesn't match the user's intent, leading to irrelevant answers.  
Inaccuracies in source data: The RAG model is entirely dependent on its knowledge base. If a game guide contains a factual error, the chatbot will present this error as truth. The system lacks the capacity for critical thinking or external fact-checking.  
LLM limitations and hallucinations: Even with a strong instruction to only use the provided context, there is a risk that the language model will "hallucinate" or misinterpret information, especially if the retrieved chunks are ambiguous or contradictory. The in-game character persona might also inadvertently encourage creativity where fact-based recall is desired.  
Scalability: The system built here works well for a single user. Handling thousands of concurrent users would require a transition to a cloud architecture to manage API calls (costs and rate limits) and vector database queries.  

### Opportunities

Despite the challenges, the commercial and user-centric opportunities are significant.  

Commercialization: Game developers could offer such a chatbot as a premium feature in an official app or as part of a subscription service. It could also be used as a marketing tool on the game's website to increase engagement.  
Personalization: A more advanced version could connect to a player's save file. By knowing the player's progress, the chatbot could provide tailored advice and avoid spoilers for quests or areas the player has not yet discovered.  
Accessibility: For players with certain disabilities, a voice- or text-based chat interface might be a more accessible way to obtain information compared to navigating complex menus or websites.  

### Ethical Considerations

Finally, it is important to reflect on the ethical implications.  

Bias in source material: Game guides and lore books are written by people and may contain cultural or other forms of bias. The RAG system will inherit and amplify these biases without question, presenting them as objective fact.  
Dissemination of misinformation: As previously mentioned, errors in the source data will lead to incorrect answers. An authoritative and convincing persona might cause users to trust this incorrect information, which could lead to frustration and the spread of incorrect lore within the game's community.  
Data privacy and logging: What happens to the questions users ask? If they are logged to improve the system, there must be a clear policy on how the data is stored, who has access to it, and for how long it is kept. User anonymity and privacy must be protected.  
Persona and manipulation: A well-written persona can create a form of emotional connection. There is an ethical responsibility not to design these personas in a way that could be manipulative or misleading to vulnerable users.  