# LangChain Document Structure

## from langchain.schema import Document

### Core Components:
    - page_content(str)
    - metadata(dict)

### LangChain Loader
    - give you the content of a data (csv, pdf,...) to a document structure

# Data Ingestion

In [19]:
# Document Structure
from langchain_core.documents import Document

In [20]:
doc = Document(
    page_content = "this is the main text content I am using to create RAG",
    metadata = {
        'source': 'example.txt',
        'pages': 1,
        'author': "Farnaz Nouri",
        'data_created': '2025-10-01'
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'Farnaz Nouri', 'data_created': '2025-10-01'}, page_content='this is the main text content I am using to create RAG')

In [21]:
# Create a simple txt file
import os
os.makedirs('../data/text_files', exist_ok=True)

In [22]:
sample_texts = {
    "../data/text_files/python_intro.txt":'''Python is a high-level, interpreted programming language known for its readability and versatility. Created by Guido van Rossum and released in 1991, it has become one of the most popular languages for various applications.
Key Characteristics:
Readability: Python's syntax emphasizes clarity and conciseness, often allowing developers to express concepts in fewer lines of code compared to other languages. This is partly due to its use of indentation to define code blocks.
Interpreted: Python code is executed line by line by an interpreter, which facilitates rapid prototyping and interactive testing.
Dynamically Typed: Variable types are automatically determined at runtime, simplifying code writing as explicit type declarations are not always required.
Multi-paradigm: Python supports various programming paradigms, including object-oriented, procedural, and functional programming.
Extensive Standard Library: Python comes with a large standard library that provides modules and functions for a wide range of tasks, reducing the need to write code from scratch.
Cross-platform: Python applications can be developed and run on different operating systems like Windows, macOS, and Linux.
Common Applications:
Web Development: Used for server-side web applications with frameworks like Django and Flask.
Data Science and Machine Learning: A popular choice due to its powerful libraries such as NumPy, Pandas, and scikit-learn.
Automation and Scripting: Ideal for automating repetitive tasks and system administration.
Software Development: Used for creating desktop applications, games, and internal tools.
Scientific Computing and Research: Employed in various scientific fields for data analysis and modeling.
    ''',
    "../data/text_files/machine_learning.txt": ''' 
Machine learning allows computers to learn from data and improve performance on tasks without explicit programming. The core process involves collecting and preparing data, selecting an algorithm, training a model, and evaluating its accuracy to make predictions. The three primary types of machine learning are supervised learning (using labeled data for tasks like classification and regression), unsupervised learning (finding patterns in unlabeled data, like clustering), and reinforcement learning (learning from rewards and penalties in an environment).
Key Concepts
Data is Crucial: High-quality, diverse data is the foundation of machine learning, providing the examples for models to learn from. 
Algorithms & Models: An algorithm is a set of instructions that enables the computer to learn from data, while a model is the trained output of the algorithm. 
Features: These are the attributes or characteristics extracted from data that are used by the model to learn and make decisions. 
Types of Machine Learning
Supervised Learning:
How it works: Uses labeled datasets, where the correct input-output relationships are known, to train the model. 
Examples: Image recognition (labeling photos as "cat" or "dog") and spam email filtering. 
Unsupervised Learning:
How it works: Works with unlabeled data to identify hidden patterns, similarities, and groupings on its own. 
Examples: Clustering data points to find common characteristics or detecting anomalies in datasets. 
Reinforcement Learning:
How it works: An agent learns by interacting with an environment, receiving feedback in the form of rewards or penalties for its actions. 
Examples: Training a robot to navigate or a program to play a game by playing against itself.
'''
}

for filepath, content in sample_texts.items():
    with open(filepath, 'w', encoding="utf-8") as f:
        f.write(content)

print("sample text file created!")

sample text file created!


In [23]:
# TextLoader
from langchain.document_loaders import TextLoader

# from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt", encoding='utf-8')
document = loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content="Python is a high-level, interpreted programming language known for its readability and versatility. Created by Guido van Rossum and released in 1991, it has become one of the most popular languages for various applications.\nKey Characteristics:\nReadability: Python's syntax emphasizes clarity and conciseness, often allowing developers to express concepts in fewer lines of code compared to other languages. This is partly due to its use of indentation to define code blocks.\nInterpreted: Python code is executed line by line by an interpreter, which facilitates rapid prototyping and interactive testing.\nDynamically Typed: Variable types are automatically determined at runtime, simplifying code writing as explicit type declarations are not always required.\nMulti-paradigm: Python supports various programming paradigms, including object-oriented, procedural, and functional programming.\nExtensive Standard 

In [24]:
# Directory Loader -> if you have all the documents in your directory 
# and want to load all of them
from langchain_community.document_loaders import DirectoryLoader

# Load all the text files from the directory
dir_loader = DirectoryLoader(
    '../data/text_files',
    glob= "**/*.txt", # pattern to match files - corrected pattern
    loader_cls= TextLoader,
    loader_kwargs={'encoding':'utf-8'},
    show_progress=False
)
documents = dir_loader.load()
documents

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content="Python is a high-level, interpreted programming language known for its readability and versatility. Created by Guido van Rossum and released in 1991, it has become one of the most popular languages for various applications.\nKey Characteristics:\nReadability: Python's syntax emphasizes clarity and conciseness, often allowing developers to express concepts in fewer lines of code compared to other languages. This is partly due to its use of indentation to define code blocks.\nInterpreted: Python code is executed line by line by an interpreter, which facilitates rapid prototyping and interactive testing.\nDynamically Typed: Variable types are automatically determined at runtime, simplifying code writing as explicit type declarations are not always required.\nMulti-paradigm: Python supports various programming paradigms, including object-oriented, procedural, and functional programming.\nExtensive Standard 

In [25]:
# Directory Loader for PDF files
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

# First, let's create the pdf_files directory and add a sample PDF (if needed)
import os
os.makedirs('../data/pdf_files', exist_ok=True)

# Load all the PDF files from the directory
dir_loader = DirectoryLoader(
    '../data/pdf_files',
    glob= "**/*.pdf", # pattern to match files - corrected pattern
    loader_cls= PyPDFLoader,
    show_progress=False
)
pdf_documents = dir_loader.load()
pdf_documents

[Document(metadata={'producer': 'Skia/PDF m142 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'attention', 'source': '../data/pdf_files/attention.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='NLP,  attention  mechanisms  enable  a  model  to  dynamically  weigh  the  importance  of  different  \nwords\n \nin\n \nan\n \ninput\n \nsequence,\n \nallowing\n \nit\n \nto\n \nfocus\n \non\n \nthe\n \nmost\n \nrelevant\n \nparts\n \nof\n \nthe\n \ncontext\n \nwhen\n \nprocessing\n \ninformation\n \nor\n \ngenerating\n \noutput.\n \nThis\n \nimproves\n \nunderstanding\n \nof\n \nlong-range\n \ndependencies,\n \nlike\n \n"dog"\n \nand\n \n"field"\n \nin\n \n"The\n \ndog\n \nran\n \nacross\n \nthe\n \nfield,"\n \nand\n \nis\n \na\n \nfoundational\n component  of  modern  Transformer  models,  powering  tasks  from  text  generation to  translation.    How  Attention  Works  At  its  core,  attention  works  by  using  the  concepts  of  queries,  ke

# Embedding And VectorStore

# Text Chunking

Before creating embeddings, we need to split our documents into smaller chunks for better retrieval performance.

In [26]:
# Import text splitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Maximum characters per chunk
    chunk_overlap=200,      # Overlap between chunks to maintain context
    length_function=len,    # Function to measure chunk length
    separators=["\n\n", "\n", " ", ""]  # Separators to use for splitting
)

# Combine ALL documents (text + PDF) before chunking
print("Combining text and PDF documents...")
all_documents = documents + pdf_documents

print(f"Text documents: {len(documents)}")
print(f"PDF documents: {len(pdf_documents)}")
print(f"Total documents: {len(all_documents)}")

# Split ALL documents into chunks
print("\nSplitting documents into chunks...")
chunks = text_splitter.split_documents(all_documents)

print(f"Number of original documents: {len(all_documents)}")
print(f"Number of chunks after splitting: {len(chunks)}")

# Show examples from both types
print(f"\nExample chunk from text file:")
text_chunks = [c for c in chunks if c.metadata.get('source', '').endswith('.txt')]
if text_chunks:
    print(f"Content: {text_chunks[0].page_content[:200]}...")
    print(f"Metadata: {text_chunks[0].metadata}")

print(f"\nExample chunk from PDF file:")
pdf_chunks = [c for c in chunks if c.metadata.get('source', '').endswith('.pdf')]
if pdf_chunks:
    print(f"Content: {pdf_chunks[0].page_content[:200]}...")
    print(f"Metadata: {pdf_chunks[0].metadata}")

Combining text and PDF documents...
Text documents: 2
PDF documents: 3
Total documents: 5

Splitting documents into chunks...
Number of original documents: 5
Number of chunks after splitting: 15

Example chunk from text file:
Content: Python is a high-level, interpreted programming language known for its readability and versatility. Created by Guido van Rossum and released in 1991, it has become one of the most popular languages fo...
Metadata: {'source': '../data/text_files/python_intro.txt'}

Example chunk from PDF file:
Content: NLP,  attention  mechanisms  enable  a  model  to  dynamically  weigh  the  importance  of  different  
words
 
in
 
an
 
input
 
sequence,
 
allowing
 
it
 
to
 
focus
 
on
 
the
 
most
 
relevant
 
...
Metadata: {'producer': 'Skia/PDF m142 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'attention', 'source': '../data/pdf_files/attention.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}


In [None]:
import numpy as np
import chromadb
import uuid
from chromadb.config import Settings
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer



In [28]:
class EmbeddingMangager:
    """ Handles document embedding generation using sentenceTransformer"""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        """
        Initialize the embedding manager
        Args:
            model name: HuggingFace model name for sentence embeddings
        """
        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self): # Private model
        """ Loads the Sentence transformer model"""
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension; {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise

    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """ Generate embeddings for a list of texts
            Args:
                texts: List of text strings to embed
            Returns:
                numpy array of embeddings with shape(len(texts), embedding_dim)
        """
        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embeddings for {len(texts)} text...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generating embeddings with shape: {embeddings.shape}")
        return embeddings
    
    # initialize the embedding manager
embedding_manager = EmbeddingMangager()
embedding_manager
    




Loading embedding model: all-MiniLM-L6-v2
Model loaded successfully. Embedding dimension; 384
Model loaded successfully. Embedding dimension; 384


<__main__.EmbeddingMangager at 0x16ab5e900>

# Vector Store

In [29]:
class VectorStore:
    """ Manages documents embeddings in a ChromaDB vector store"""
    def __init__(self, collection_name: str = "pdf_documents", persist_directory: str = "../data/vector_store"):
        """ Initialize the vector Store
            Args: 
                collection_name: Name of the ChromaDB collection
                persist_directory: Directory to persist the vector store
        """
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()
        
    def _initialize_store(self):
        """ Initialize ChromaDB and collection"""
        try:
            # Create persistent ChromaDB client
            os.makedirs(self.persist_directory, exist_ok = True)
            self.client = chromadb.PersistentClient(path= self.persist_directory)

            # Get or create collection
            self.collection = self.client.get_or_create_collection(
                name= self.collection_name,
                metadata={"description" : "PDF document embeddings for RAG"}
            )
            print(f"Vector Store initialized. Collection: {self.collection_name}")
            print(f"Existing documents in collection: {self.collection.count()}")
        except Exception as e:
            print(f"Error initializing ChromaDB: {e}")
            raise

    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """ Add documents and their embeddings to the vector Store
        Args: 
            documents: List of Langchain documents
            embeddings: Corresponding embeddings for the documents
        """
        if len(documents) != len(embeddings):
            raise ValueError("Number of documents must match number of embeddings")
        
        print(f"Adding {len(documents)} documents to vector store")

        # prepare data for ChromaDB
        ids = []
        metadatas = []
        documents_text = []
        embeddings_list = []

        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            # Generate unique ID
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            # Prepare metadata
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)

            # Document content
            documents_text.append(doc.page_content)

            # Embedding
            embeddings_list.append(embedding.tolist())

        # Add to collection
        try:
            self.collection.add(
                ids=ids,
                embeddings=embeddings_list,
                metadatas=metadatas,  # Fixed: was 'metadata', should be 'metadatas'
                documents=documents_text
            )
            print(f"Successfully added {len(documents)} documents to vector store")
            print(f"Total documents in collection: {self.collection.count()}")
        except Exception as e:
            print(f"Error adding documents to vector store: {e}")
            raise
            

vectorstore = VectorStore()
vectorstore

Vector Store initialized. Collection: pdf_documents
Existing documents in collection: 12


<__main__.VectorStore at 0x16abcc650>

In [30]:
chunks

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content="Python is a high-level, interpreted programming language known for its readability and versatility. Created by Guido van Rossum and released in 1991, it has become one of the most popular languages for various applications.\nKey Characteristics:\nReadability: Python's syntax emphasizes clarity and conciseness, often allowing developers to express concepts in fewer lines of code compared to other languages. This is partly due to its use of indentation to define code blocks.\nInterpreted: Python code is executed line by line by an interpreter, which facilitates rapid prototyping and interactive testing.\nDynamically Typed: Variable types are automatically determined at runtime, simplifying code writing as explicit type declarations are not always required.\nMulti-paradigm: Python supports various programming paradigms, including object-oriented, procedural, and functional programming."),
 Document(metadat

In [31]:
# Convert the text to embeddings
texts = [doc.page_content for doc in chunks]

# Generate the Embeddings
embeddings = embedding_manager.generate_embeddings(texts)

# Store in the vector Database
vectorstore.add_documents(chunks, embeddings)



Generating embeddings for 15 text...


Batches: 100%|██████████| 1/1 [00:00<00:00,  4.88it/s]

Generating embeddings with shape: (15, 384)
Adding 15 documents to vector store
Successfully added 15 documents to vector store
Total documents in collection: 27





# Retriver Pipeline From VectorStore

In [39]:
class RAGRetriver:
    """ Handles query based retrieval from the vector store"""

    def __init__(self, vector_store: VectorStore, embedding_manager: EmbeddingMangager):
        """ Initialize the retriver
        Args:
            vectorestore: Vector Store containing document embeddings
            embedding_manager: Manager for generating query embeddings
        """
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager 
        
    def retrieve(self, query: str, top_k: int = 5, score_threshold: float = 0.0) -> List[Dict[str, Any]]:
        """ Retrieve relevant documents for a query
            Args: 
            query: The search query
            top_k: number of top results to return 
            score_threshold: Minimum similarity score threshold

            Returns:
            List of dictionaries containing retrieved documents and metadata
        """
        print(f"Retrieving documents for query: '{query}'")
        print(f"top K: {top_k}, Score threshold: {score_threshold}")

        # generate query embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])[0]

        # Search in vector store
        try:
            results = self.vector_store.collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k
            )
            retrieved_docs = []
            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]

                for i, (doc_id, document, metadata, distance) in enumerate(zip(ids, documents, metadatas, distances)):
                    # Convert distance to similarity score (ChromaDB uses cosine distance)
                    similarity_score = 1 - distance

                    if similarity_score >= score_threshold:
                        retrieved_docs.append({
                            'id': doc_id,
                            'content': document,
                            'metadata': metadata,
                            'similarity_score': similarity_score,
                            'distance': distance,
                            'rank': i + 1
                        })
                print(f"Retrieved {len(retrieved_docs)} documents (after filtering)")
            else:
                print("No documents found")
            return retrieved_docs
        except Exception as e:
            print(f"Error during retrieval:{e}")
            return []
        
rag_retriever = RAGRetriver(vectorstore, embedding_manager)
rag_retriever

<__main__.RAGRetriver at 0x17b214f80>

In [33]:
rag_retriever.retrieve("why attention is important")

Retrieving documents for query: 'why attention is important'
top K: 5, Score threshold: 0.0
Generating embeddings for 1 text...


Batches: 100%|██████████| 1/1 [00:00<00:00, 128.57it/s]

Generating embeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)





[{'id': 'doc_afd086a1_6',
  'content': 'representation\n \nthat\n \nemphasizes\n \nthe\n \nmost\n \nimportant\n \ncontext.\n \n Key  Types  of  Attention  ●  Self-Attention:  A  crucial  innovation  in  the  Transformer  architecture,  where  queries,  keys,  and  values  are  all  derived  from  the  same  input  sequence.  This  allows  tokens  within  the  \nsequence\n \nto\n \nattend\n \nto\n \neach\n \nother,\n \ncapturing\n \ninternal\n \nrelationships\n \nand\n \ncontext.\n \n ●  Multi-Head  Attention:  In  this  technique,  multiple  attention  mechanisms  are  run  in  parallel.  Each  "head"  can  learn  different  types  of  relationships,  allowing  the  model  to  grasp  \nvarious\n \npatterns\n \nand\n \nnuances\n \nwithin\n \nthe\n \ntext.\n \n Why  Attention  is  Important  ●  Improved  Contextual  Understanding:  Attention  allows  models  to  capture  dependencies  \nbetween\n \nwords\n \nthat\n \nare\n \nfar\n \napart\n \nin\n \na\n \nsentence,\n \nleading\n \nto\n \

# Generation Pipeline
## Integration VectorDB Context Pipeline with LLM output

In [46]:
# Simple RAG pipeline with Groq LLM
from langchain_groq import ChatGroq
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Get the API key from environment variables
groq_api_key = os.getenv("GROQ_API_KEY")

if not groq_api_key:
    print("Warning: GROQ_API_KEY not found in environment variables")
    print("Please create a .env file with your GROQ_API_KEY")
else:
    print("✅ GROQ_API_KEY loaded successfully")

# Initialize the Groq LLM with a currently supported model
llm = ChatGroq(
    api_key=groq_api_key,  
    model="llama-3.1-8b-instant",  # Updated to a currently supported model
    temperature=0.1, 
    max_tokens=1024
)

print("LLM initialized successfully!")

# Simple RAG function: retrieve context + generate response
def rag_simple(query, retriever, llm, top_k=3):
    """
    Simple RAG pipeline that retrieves context and generates an answer
    """
    print(f"🔍 Processing query: '{query}'")
    
    # Retrieve the context
    results = retriever.retrieve(query, top_k=top_k)
    
    if not results:
        return "No relevant context found to answer the question."
    
    # Combine retrieved documents into context
    context_pieces = []
    for i, doc in enumerate(results, 1):
        context_pieces.append(f"Context {i}:\n{doc['content']}")
        print(f"✅ Retrieved context {i} (score: {doc['similarity_score']:.3f})")
    
    context = "\n\n".join(context_pieces)
    
    # Create the prompt
    prompt = f"""Use the following context to answer the question accurately and concisely.

Context:
{context}

Question: {query}

Answer:"""

    print("🤖 Generating response...")
    
    # Generate the answer using Groq LLM
    try:
        response = llm.invoke(prompt)
        return response.content
    except Exception as e:
        return f"Error generating response: {str(e)}"

✅ GROQ_API_KEY loaded successfully
LLM initialized successfully!


In [47]:
# Test different Groq models to find one that works
def test_groq_models():
    """Test various Groq models to find supported ones"""
    test_models = [
        "llama-3.1-8b-instant",
        "llama-3.2-90b-text-preview", 
        "llama-3.2-11b-text-preview",
        "llama-3.2-3b-preview",
        "llama-3.2-1b-preview",
        "mixtral-8x7b-32768",
        "gemma2-9b-it"
    ]
    
    groq_api_key = os.getenv("GROQ_API_KEY")
    if not groq_api_key:
        print("No API key found")
        return None
    
    for model in test_models:
        try:
            print(f"Testing model: {model}")
            test_llm = ChatGroq(api_key=groq_api_key, model=model, temperature=0.1, max_tokens=50)
            response = test_llm.invoke("Hello, this is a test.")
            print(f"✅ {model} works!")
            return model
        except Exception as e:
            print(f"❌ {model} failed: {str(e)[:100]}...")
            continue
    
    print("No working models found")
    return None

# Find a working model
working_model = test_groq_models()
if working_model:
    print(f"\n🎉 Using working model: {working_model}")
    llm = ChatGroq(api_key=groq_api_key, model=working_model, temperature=0.1, max_tokens=1024)
else:
    print("❌ No supported models found. Please check Groq documentation for current models.")

Testing model: llama-3.1-8b-instant
✅ llama-3.1-8b-instant works!

🎉 Using working model: llama-3.1-8b-instant
✅ llama-3.1-8b-instant works!

🎉 Using working model: llama-3.1-8b-instant


In [48]:
answer = rag_simple("what is attention mechanism?", rag_retriever, llm)
print(answer)

🔍 Processing query: 'what is attention mechanism?'
Retrieving documents for query: 'what is attention mechanism?'
top K: 3, Score threshold: 0.0
Generating embeddings for 1 text...


Batches: 100%|██████████| 1/1 [00:00<00:00, 18.58it/s]

Generating embeddings with shape: (1, 384)
Retrieved 2 documents (after filtering)
✅ Retrieved context 1 (score: 0.171)
✅ Retrieved context 2 (score: 0.148)
🤖 Generating response...





An attention mechanism is a technique in Natural Language Processing (NLP) that enables a model to dynamically weigh the importance of different words in an input sequence, allowing it to focus on the most relevant parts of the context when processing information or generating output.


# Inhance RAG Pipeline Features

In [55]:
# Advanced RAG Pipeline features
def rag_advanced(query, retriever, llm, top_k=5, min_score=0.2, return_context=False):
    """ 
    RAG Pipeline with extra features:
    - Returns answer, source, confidence score, and optionally full context
    """
    results = retriever.retrieve(query, top_k=top_k, score_threshold= min_score)
    if not results:
        return {'answer': 'No relevant context found.', 'source':[], 'confidence':0.0, 'context':''}

    # Prepare context and sources
    context = "\n\n". join([doc['content'] for doc in results])
    sources = [{
        'source': doc['metadata'].get('source_file', doc['metadata'].get('source', 'unknown')),
        'page': doc['metadata'].get('page', 'unknown'),
        'score': doc['similarity_score'],
        'preview':doc['content'][:300] + '...'
    } for doc in results]
    confidence = max([doc['similarity_score'] for doc in results])

    # Generate answer
    prompt = f"""Use the following context to answer the question accurately and concisely.

    Context:
    {context}

    Question: {query}

    Answer:"""   
    
    response = llm.invoke([prompt.format(context=context, query=query)])

    output = {
        'answer': response.content,
        'sources': sources,
        'confidence': confidence
    }
    if return_context:
        output['context'] = context
    return output


# Example usage
result = rag_advanced("what is the process in embedding?", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print('answer:', result['answer'])
print('sources:', result['sources'])
print('confidence:', result['confidence'])
print('context preview:', result['context'][:300])


Retrieving documents for query: 'what is the process in embedding?'
top K: 3, Score threshold: 0.1
Generating embeddings for 1 text...


Batches: 100%|██████████| 1/1 [00:00<00:00,  5.50it/s]

Generating embeddings with shape: (1, 384)
Retrieved 3 documents (after filtering)





answer: The process in embedding involves three main steps:

1. **Data Transformation**: Non-numerical data (text, images, graphs) is converted into numerical vectors, where each dimension represents a specific feature of the data.
2. **Vector Space**: These vectors are placed into an n-dimensional space, creating a dense numerical representation.
3. **Semantic Relationships**: The embedding process captures the nuances and context of the original data, where words with similar meanings will have embeddings that are closer together in the vector space.
sources: [{'source': '../data/pdf_files/embedding.pdf', 'page': 0, 'score': 0.36253780126571655, 'preview': '●   An  embedding  is  a  data  representation  that  uses  low-dimensional  numerical  vectors  to  capture  \nthe\n \nsemantic\n \nmeaning\n \nand\n \nrelationships\n \nof\n \nnon-numerical\n \ndata\n \nlike\n \nwords,\n \nimages,\n \nor\n \ngraphs,\n \nmaking\n \nthem\n \nunderstandable\n \nfor\n \nmachine\n \nlearning\n \nmode