## Naive RAG - Q&A System

This code implements a document retrieval and question-answering system using OpenAI's GPT-4 model. Key steps include:

- **Environment Setup:** Loads API keys from `.env`.
- **Model Initialization:** Sets up GPT-4 for generating responses.
- **Document Indexing:** Loads text documents, embeds them, and stores in a Chroma vector store for retrieval.
- **Retrieval & Formatting:** Retrieves relevant documents based on a question, calculates cosine similarity, and formats the output.
- **Async Execution:** Uses async functions to retrieve documents and generate concise answers efficiently.

Run the `main()` function to see the system in action with a sample question.

In [2]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_openai.embeddings import OpenAIEmbeddings
import numpy as np
import asyncio

# Load environment variables from the .env file
load_dotenv()

# API Keys
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")

# Model Initialization
model = ChatOpenAI(model="gpt-4", api_key=OPENAI_API_KEY)

#### INDEXING ####

# Base directory for loading documents
base_directory = "rag_data/website/organized_data"

# Load all .txt files from the specified directory using DirectoryLoader
loader = DirectoryLoader(base_directory, glob="**/*.txt", loader_cls=TextLoader)

# Load documents
docs = loader.load()

# Embedding
embedding = OpenAIEmbeddings(api_key=OPENAI_API_KEY)
vectorstore = Chroma.from_documents(documents=docs, embedding=embedding)

# Define the retriever
retriever = vectorstore.as_retriever()

#### RETRIEVAL and GENERATION ####

# Prompt template for question-answering
custom_prompt = ChatPromptTemplate.from_messages([
    HumanMessagePromptTemplate.from_template(
        """
        You are a Telekom-Hilfe assistant for question-answering tasks, providing answers to Telekom customers or potential customers.
        Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know.
        Use three sentences maximum and keep the answer concise.
        Question: {question}
        Context: {context}
        Answer:
        """
    )
])

# Cosine Similarity calculation function
def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors.
    """
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

# Function to format documents with a limit on the number of documents included
def format_docs(docs, query_embedding, max_docs=5):
    """
    Formats the documents for inclusion in the context, limiting the total number of documents.
    Args:
        docs: List of retrieved documents.
        query_embedding: Embedding vector for the query.
        max_docs: Maximum number of documents to include.
    Returns:
        A string containing formatted documents.
    """
    unique_sources = set()  # To keep track of unique sources
    formatted_docs = []

    '''The format_docs function limits the number of documents to include in the context (max_docs=5).
    Each document’s content is truncated to 500 characters to help stay within the model’s token limit.
    These changes should help avoid exceeding the token limit of the GPT-4 model and prevent the BadRequestError. 
    Adjust max_docs and content length if necessary based on further testing.'''

    for doc in docs[:max_docs]:  # Limit the number of documents to `max_docs`
        source = doc.metadata.get("source")  # Get the source from metadata
        if source and source not in unique_sources:
            unique_sources.add(source)
            document_embedding = embedding.embed_query(doc.page_content)  # Compute embedding
            similarity = cosine_similarity(query_embedding, document_embedding)  # Cosine similarity
            content = doc.page_content.strip()[:500]  # Trim content to reduce size
            formatted_docs.append(
                f"Source document: {source}\n\nCosine Similarity: {similarity:.4f}\n\n{content}"
            )

    return "\n\n".join(formatted_docs)

# Define the processing chain
rag_chain = (
    {"context": RunnablePassthrough(), "question": RunnablePassthrough()}
    | custom_prompt
    | model
    | StrOutputParser()
)

# Function to retrieve and format documents and generate an answer
async def retrieve_and_format_docs(question):
    """
    Retrieves relevant documents and formats them to answer a given question.
    Args:
        question: The input question as a string.
    Returns:
        The generated answer and formatted documents.
    """
    # Get the query embedding
    query_embedding = embedding.embed_query(question)
    retrieved_docs = await retriever.ainvoke(question)  # Retrieve relevant documents

    # Format documents with a limit to avoid exceeding token limit
    formatted_docs = format_docs(retrieved_docs, query_embedding, max_docs=5)
    
    # Generate the answer using the formatted context
    try:
        answer = await rag_chain.invoke({"context": formatted_docs, "question": question})
    except TypeError:
        answer = rag_chain.invoke({"context": formatted_docs, "question": question})
    
    return answer, formatted_docs

# Main function to run the retrieval and answer generation process
async def main():
    """
    Main function to execute the retrieval and answer generation process.
    """
    question = "What is Magenta TV?"
    answer, source_docs = await retrieve_and_format_docs(question)
    print("Answer:", answer)
    print("\nSources:")
    print(source_docs)

# Execute the main function using the current event loop
await main()

Answer: MagentaTV is a high-quality streaming service offered by Telekom. It provides a vast and unique selection of series, films, shows, documentaries, and content for children, including many originals and exclusives only available on MagentaTV+. It can be enjoyed at home or on the go using a TV receiver, the MagentaTV One, or the MagentaTV App, all offering numerous comfortable features to enhance your online TV experience.

Sources:
Source document: rag_data/website/organized_data/Others/https_www_telekom_de_unterhaltung_serien_und_filme.txt

Cosine Similarity: 0.8683

Source URL: https://www.telekom.de/unterhaltung/serien-und-filme

Question: Was ist MagentaTV+?
Answer: MagentaTV+ ist ein hochwertiges Streaming-Angebot, das bei MagentaTV immer enthalten ist. Hier finden Sie eine riesige und einzigartige Auswahl an Serien, Filmen, Shows, Dokumentationen und Kinderinhalten. Dazu gehören viele Originals und Exklusives, die es nur bei MagentaTV+ gibt. Dazu zählen etwa:
Darüber hinaus