## Colab Notebook for Building a RAG System with LLAMAIndex and OpenAI

Open this notebook in [colab](https://colab.research.google.com/github/Chair-of-Banking-and-Finance/Bachelor_thesis_24_25_Template/blob/main/GPT_RAG/rag_openAI.ipynb).

### Install Required Libraries

In [16]:
!pip install llama-index
!pip install openai
!pip install faiss-cpu
!pip install requests
!pip install PyMuPDF
!pip install llama-index-vector-stores-faiss
!pip install chromadb
!pip install llama-index-vector-stores-chroma
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


### Import Libraries

In [2]:
import os
import openai
import faiss
import numpy as np
import fitz  # PyMuPDF for PDF handling
from llama_index.core import VectorStoreIndex, ServiceContext, PromptTemplate, Document, StorageContext, SimpleDirectoryReader, load_index_from_storage
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.vector_stores.faiss import FaissVectorStore
import chromadb
from llama_index.vector_stores.chroma.base import ChromaVectorStore

### Load txt file with the OpenAI key to colab

In [3]:
!wget -O OPEN_AI_KEY.txt 'https://raw.githubusercontent.com/Chair-of-Banking-and-Finance/Bachelor_thesis_24_25_Template/main/GPT_RAG/OPEN_AI_KEY.txt'

--2025-01-02 13:21:24--  https://raw.githubusercontent.com/Chair-of-Banking-and-Finance/Bachelor_thesis_24_25_Template/main/GPT_RAG/OPEN_AI_KEY.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53 [text/plain]
Saving to: ‘OPEN_AI_KEY.txt’


2025-01-02 13:21:24 (1010 KB/s) - ‘OPEN_AI_KEY.txt’ saved [53/53]



### Insert your OpenAI key and overwrite the file with the new content

How to Create an OpenAI API Key
1. **Sign Up/Log In**: Go to [OpenAI's website](https://platform.openai.com/) and sign up for an account if you don’t already have one, or log in if you do.

2. **Navigate to API Keys**: After logging in, go to your account settings (accessible via the user icon at the top-right corner). Select "API Keys" from the menu.

3. **Create a New Key**: Click on the "Create API Key" button. A new API key will be generated. Make sure to copy and save it securely, as you won’t be able to view it again.

4. **Use the Key in Your Code**: Reference the key in your code to authenticate requests to the OpenAI API (the line below).

In [4]:
OPENAI_API_KEY = 'sk-proj-xxx-xxx'

In [5]:
with open('OPEN_AI_KEY.txt', 'w') as file:
    file.write(OPENAI_API_KEY)

### Load OpenAI API key from file

In [6]:
if OPENAI_API_KEY:
  openai.api_key=OPENAI_API_KEY
else:
  with open("OPEN_AI_KEY.txt", "r") as key_file:
    openai.api_key = key_file.read().strip()

### Create a 'data' directory

In [7]:
!mkdir -p data

### Load text files and convert PDFs from the "data" folder

In [8]:
import os

# Define the text to be written to the file
roman_empire_text = """
The Roman Empire: An Overview
The Roman Empire was one of the most influential civilizations in human history, spanning over a millennium and leaving a legacy that shaped the world in areas such as governance, architecture, engineering, and law. Officially beginning in 27 BCE with the rise of Augustus Caesar, Rome transitioned from a republic to an empire, dominating vast territories that stretched from Britain in the northwest to Egypt in the southeast.

Formation and Expansion
The Roman Empire's foundation was built on centuries of conquest during the Roman Republic. Under Augustus, the empire ushered in a period of peace and stability known as the Pax Romana (Roman Peace), lasting about 200 years. During this time, Rome expanded its borders, solidifying control over Europe, North Africa, and parts of the Middle East.

The empire was characterized by a vast network of cities connected by advanced roads and aqueducts, facilitating trade, military movements, and cultural exchange. Notable conquests include Gaul (modern-day France) under Julius Caesar, the annexation of Egypt after Cleopatra's defeat, and the consolidation of power in regions such as Spain and the Balkans.

Culture and Society
Roman society was highly stratified, with a clear distinction between the elite patricians, common plebeians, and enslaved individuals. Roman culture blended Latin traditions with influences from Greece and the regions it conquered. This fusion led to remarkable achievements in literature (Virgil’s Aeneid), philosophy (Cicero, Seneca), and architecture (the Colosseum, aqueducts, and the Pantheon).

The Roman Empire was also a melting pot of religions. Initially polytheistic, it later became a cradle for Christianity, with Emperor Constantine legalizing the faith in 313 CE and Emperor Theodosius I declaring it the state religion by 380 CE.

Governance and Law
Rome was renowned for its administrative prowess and legal systems. The empire was divided into provinces, each governed by an appointed official. Roman law, codified in the Twelve Tables and later expanded, formed the foundation for many modern legal systems. Concepts like innocent until proven guilty and legal representation have their roots in Roman jurisprudence.

Decline and Fall
The decline of the Roman Empire was a gradual process influenced by internal and external factors. Political instability, economic struggles, and military overreach weakened the empire. The division of the empire into Eastern and Western halves in 395 CE further strained its cohesion. While the Western Roman Empire fell in 476 CE after being overrun by Germanic tribes, the Eastern Roman Empire, known as the Byzantine Empire, endured for another thousand years until the fall of Constantinople in 1453.

Legacy
The Roman Empire profoundly shaped Western civilization. Its contributions to governance, infrastructure, and culture remain influential today. Latin, the language of Rome, evolved into the Romance languages (Italian, French, Spanish, etc.), and Roman architecture inspired countless generations. The very concept of a republic and the rule of law owe much to Rome’s enduring influence.

In essence, the Roman Empire stands as a testament to humanity’s capacity for organization, innovation, and adaptation, making it a cornerstone of global history.
"""

# Specify the directory and file name
output_dir = "./data"
file_name = "roman_empire_overview.txt"
file_path = os.path.join(output_dir, file_name)

# Ensure the output directory exists; if not, create it
os.makedirs(output_dir, exist_ok=True)
print(f"Directory '{output_dir}' is ready.")

# Write the text to the file with UTF-8 encoding
try:
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(roman_empire_text)
    print(f"Text successfully written to '{file_path}'.")
except Exception as e:
    print(f"An error occurred while writing to the file: {e}")


Directory './data' is ready.
Text successfully written to './data/roman_empire_overview.txt'.


In [9]:
text_folder = "data"
texts = []

for filename in os.listdir(text_folder):
    file_path = os.path.join(text_folder, filename)
    if filename.endswith(".txt"):
        with open(file_path, "r", encoding="utf-8") as file:
            texts.append(file.read())
    elif filename.endswith(".pdf"):
        with fitz.open(file_path) as pdf:
            pdf_text = ""
            for page_num in range(pdf.page_count):
                page = pdf.load_page(page_num)
                pdf_text += page.get_text()
            texts.append(pdf_text)

### Manually add texts to the vector store

In [10]:
def chunk_text(text, max_tokens=1000):
    """
    Splits text into smaller chunks based on token limit.
    Assumes average of ~4 characters per token for rough estimation.
    """
    chunks = []
    words = text.split()  # Split text into words
    chunk = []
    char_count = 0

    for word in words:
        char_count += len(word) + 1  # Include space
        if char_count > max_tokens * 4:  # Estimate max characters per token
            chunks.append(" ".join(chunk))
            chunk = []
            char_count = len(word) + 1
        chunk.append(word)

    if chunk:  # Add remaining words
        chunks.append(" ".join(chunk))
    return chunks

In [25]:
import os
import chromadb
from chromadb.utils import embedding_functions
from chromadb.api.types import Document
import openai
import uuid
import PyPDF2

def pdf_to_txt_in_memory(pdf_path):
    """
    Reads a PDF file and returns the extracted text as a single string.
    No local .txt file is created or saved.
    """
    if not os.path.isfile(pdf_path):
        raise FileNotFoundError(f"The file '{pdf_path}' does not exist.")

    with open(pdf_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        extracted_text = ""
        for page in pdf_reader.pages:
            extracted_text += page.extract_text() + '\n'

    return extracted_text

def chunk_text(text, chunk_size=1000, overlap=100):
    """
    Example chunking function. Splits `text` into chunks of size `chunk_size`
    with `overlap` overlap between consecutive chunks.
    Adjust these parameters for your use case.
    """
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        # Move start forward by chunk_size - overlap
        start += chunk_size - overlap

    return chunks

def create_and_populate_chroma_db(
    data_dir: str = "./data",
    chroma_db_path: str = "./chroma_db",
    collection_name: str = "quickstart"
):
    """
    Creates a ChromaDB collection and populates it with documents from the specified directory.

    :param data_dir: Directory containing `.txt` (and `.pdf`) files to be added to ChromaDB.
    :param chroma_db_path: Path where ChromaDB data will be stored.
    :param collection_name: Name of the ChromaDB collection.
    """
    # Initialize ChromaDB
    client = chromadb.PersistentClient(path=chroma_db_path)
    print(f"ChromaDB client initialized with path '{chroma_db_path}'.")

    collection = client.get_or_create_collection(name=collection_name)
    print(f"ChromaDB collection '{collection_name}' created or retrieved.")

    # Lists to store all documents, embeddings, metadata, and IDs
    documents = []
    embeddings = []
    metadatas = []
    ids = []

    # Set your OpenAI API key
    openai.api_key = OPENAI_API_KEY

    for file_name in os.listdir(data_dir):
        file_path = os.path.join(data_dir, file_name)
        if os.path.isfile(file_path):
            # For PDF files, convert to text in-memory
            if file_name.endswith('.pdf'):
                try:
                    print(f"Converting PDF '{file_name}' to text in memory...")
                    content = pdf_to_txt_in_memory(file_path)
                    print(f"'{file_name}' successfully converted.")
                except Exception as e:
                    print(f"Error converting PDF '{file_name}': {e}")
                    continue

            # For text files, just read the content
            elif file_name.endswith('.txt'):
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
            else:
                # Skip other file types
                continue

            # Now chunk the content (whether from PDF or TXT)
            for chunk in chunk_text(content):
                chunk_id = uuid.uuid4()
                documents.append(chunk)

                try:
                    response = openai.embeddings.create(input=chunk, model="text-embedding-ada-002")
                    embedding = response.data[0].embedding
                except Exception as e:
                    print(f"Error creating embedding for chunk in '{file_name}': {e}")
                    embedding = []

                embeddings.append(embedding)
                metadatas.append({"source": file_name})
                ids.append(str(chunk_id))

    print(f"Loaded {len(documents)} total chunks from '{data_dir}'.")

    if not documents:
        print("No documents found to add to ChromaDB.")
        return

    try:
        # Add all documents and their embeddings to ChromaDB
        collection.add(
            documents=documents,
            metadatas=metadatas,
            ids=ids,
            embeddings=embeddings
        )
        print(f"Documents added to collection '{collection_name}' successfully.")
    except Exception as e:
        print(f"Error adding documents to ChromaDB: {e}")

if __name__ == "__main__":
    create_and_populate_chroma_db()

ChromaDB client initialized with path './chroma_db'.
ChromaDB collection 'quickstart' created or retrieved.
Converting PDF '18Q1-aic-Quarterly-Statement.pdf' to text in memory...
'18Q1-aic-Quarterly-Statement.pdf' successfully converted. (No local .txt saved.)
Loaded 291 total chunks from './data'.
Documents added to collection 'quickstart' successfully.


In [27]:
import os
import chromadb

def inspect_chroma_db(
    chroma_db_path: str = "./chroma_db",
    collection_name: str = "quickstart"
):
    """
    Inspect the contents of a ChromaDB collection.

    :param chroma_db_path: Path to the ChromaDB storage.
    :param collection_name: Name of the ChromaDB collection to inspect.
    """
    # Initialize ChromaDB client
    client = chromadb.PersistentClient(path=chroma_db_path)
    print(f"ChromaDB client initialized with path '{chroma_db_path}'.")

    # Retrieve the collection
    try:
        collection = client.get_collection(name=collection_name)
        print(f"Collection '{collection_name}' retrieved successfully.")
    except Exception as e:
        print(f"Error retrieving collection '{collection_name}': {e}")
        return

    # Fetch all data from the collection
    try:
        data = collection.get()
        print(f"\nCollection '{collection_name}' contains the following data:")
        ids = data.get("ids", [])
        metadatas = data.get("metadatas", [])
        documents = data.get("documents", [])
        embeddings = data.get("embeddings", [])

        for idx, doc_id in enumerate(ids):
            print(f"\nChunk {idx + 1}:")
            print(f"  ID: {doc_id}")
            print(f"  Metadata: {metadatas[idx] if metadatas else 'No metadata'}")
            print(f"  Document: {documents[idx] if documents else 'No document'}")
            print(f"  Embedding: {len(embeddings[idx])} values" if embeddings else "No embedding")
            print("-" * 40)
    except Exception as e:
        print(f"Error fetching data from collection '{collection_name}': {e}")

if __name__ == "__main__":
    # Example usage
    inspect_chroma_db()


[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
9.2 Has the code of ethics for senior managers been amended? Yes [  ]      No [ X ]
9.21 If the response to 9.2 is Yes, provide information related to amendment(s).
9.3 Have any provisions of the code of ethics been waived for any of the specified officers? Yes [  ]      No [ X ]
9.31 If the response to 9.3 is Yes, provide the nature of any waiver(s).
FINANCIAL
10.1 Does the reporting entity report any amounts due from parent, subsidiaries or affiliates on Page 2 of this statement? Yes [ X ]      No [  ]
10.2 If yes, indicate any amounts receivable from parent included in the Page 2 amount: $ 1,649,261
INVESTMENT
11.1 Were any of the stocks, bonds, or other assets of the reporting entity loaned, placed under option agreement, or otherwise made available for
use by another person?  (Exclude securities under securities lending agreements.) Yes [  ]      No [ X ]
11.2 If yes, give full and complete informatio

In [32]:
import os
import openai
import chromadb
from chromadb.utils import embedding_functions

# For exact token counting
try:
    import tiktoken
except ImportError:
    tiktoken = None
    print("tiktoken library not found. Install via 'pip install tiktoken' for exact token counting.")

def num_tokens_from_string(string: str, model: str = "gpt-4") -> int:
    """
    Counts the number of tokens in a string using the tiktoken library (if available).
    Fallback to approximate calculation if tiktoken is not installed.
    """
    if tiktoken:
        # For GPT-3.5 and GPT-4, "cl100k_base" is typically used
        encoding = tiktoken.encoding_for_model(model)
        return len(encoding.encode(string))
    else:
        # Fallback: approximate by splitting on whitespace and punctuation
        return len(string.split())

def query_chroma_db(
    query_text: str,
    chroma_db_path: str = "./chroma_db",
    collection_name: str = "quickstart",
    top_k: int = 5,
    OPENAI_API_KEY: str = None,
    max_context_tokens: int = 10000,  # New parameter to limit context tokens
    openai_model: str = "gpt-4o-mini",
):
    """
    Queries the ChromaDB collection and retrieves the top_k most similar documents.
    Optionally limits the context (query + retrieved documents) to max_context_tokens tokens.

    :param query_text: The query string.
    :param chroma_db_path: Path to the ChromaDB storage.
    :param collection_name: Name of the ChromaDB collection.
    :param top_k: Number of top similar documents to retrieve.
    :param OPENAI_API_KEY: Your OpenAI API key.
    :param max_context_tokens: Hard limit on the context size in tokens.
    :param openai_model: The OpenAI model to use (default "gpt-4").
    """
    # Ensure API key is set
    if not OPENAI_API_KEY:
        raise ValueError("OpenAI API key not provided.")
    openai.api_key = OPENAI_API_KEY

    # Initialize ChromaDB client
    client = chromadb.PersistentClient(path=chroma_db_path)
    print(f"ChromaDB client initialized with path '{chroma_db_path}'.")

    # Retrieve the collection
    collection = client.get_collection(name=collection_name)
    if not collection:
        raise FileNotFoundError(f"Collection '{collection_name}' not found in ChromaDB.")
    print(f"ChromaDB collection '{collection_name}' retrieved.")

    # Generate embedding for the query using OpenAI
    try:
        embed_response = openai.embeddings.create(
            input=query_text,
            model="text-embedding-ada-002"
        )
        query_embedding = embed_response.data[0].embedding
        print("Query embedding successfully created.")
    except Exception as e:
        raise RuntimeError(f"Error generating embedding for query: {e}")

    # Query the collection
    try:
        results = collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
            include=['metadatas', 'documents']
        )
    except Exception as e:
        raise RuntimeError(f"Error querying ChromaDB: {e}")

    # Process results
    if not results or 'documents' not in results or not results['documents'][0]:
        print("No results found.")
        return "No relevant documents found."

    # Start building combined context
    combined_context = f"Query: {query_text}\n\nRetrieved Documents:\n"
    for idx, document in enumerate(results['documents'][0]):
        metadata = results['metadatas'][0][idx]
        source = metadata.get('source', 'Unknown')
        combined_context += (
            f"Document {idx + 1} (Source: {source}):\n{document}\n\n"
        )

    # --- NEW SECTION: Token limit check and truncation logic ---
    current_tokens = num_tokens_from_string(combined_context, model=openai_model)
    # We also need to account for tokens in the system message. Let's approximate:
    system_message = "You are an assistant that synthesizes answers based on the given context."
    system_tokens = num_tokens_from_string(system_message, model=openai_model)

    # If we exceed the maximum, we must truncate or reduce the documents
    if (current_tokens + system_tokens) > max_context_tokens:
        print(
            f"[Warning] Combined context has {current_tokens + system_tokens} tokens. "
            f"Truncating to fit under {max_context_tokens} tokens."
        )
        # Truncate from the end until within the limit
        # A simple approach is to slice the combined_context from the end
        # but this might chop words in half. Consider more sophisticated approaches.
        while (current_tokens + system_tokens) > max_context_tokens and len(combined_context) > 0:
            # Remove last 100 characters at a time (heuristic)
            combined_context = combined_context[:-100]
            current_tokens = num_tokens_from_string(combined_context, model=openai_model)

    # Prepare the messages for ChatCompletion
    messages = [
        {
            "role": "system",
            "content": system_message
        },
        {
            "role": "user",
            "content": combined_context
        }
    ]

    # Send the combined context to OpenAI
    try:
        openai_response = openai.chat.completions.create(
            model=openai_model,  # e.g., "gpt-4"
            messages=messages,
            max_tokens=1000  # Adjust as needed
        )
        synthesized_answer = openai_response.choices[0].message.content.strip()
        print("Response from OpenAI generated successfully.")
    except Exception as e:
        raise RuntimeError(f"Error querying OpenAI: {e}")

    return synthesized_answer


### Query the index

In [33]:
# Example usage
user_query = "What are the key achievements of the Roman Empire?"
response = query_chroma_db(user_query, OPENAI_API_KEY = OPENAI_API_KEY)
print("\nSynthesized Response from OpenAI:")
print(response)

ChromaDB client initialized with path './chroma_db'.
ChromaDB collection 'quickstart' retrieved.
Query embedding successfully created.
Response from OpenAI generated successfully.

Synthesized Response from OpenAI:
I'm sorry, none of the documents provided contain information related to the key achievements of the Roman Empire. Please provide relevant documents.
