<div style="text-align: center;">
    <h1 style="color: #FF6347;">Self-Guided Lab: Retrieval-Augmented Generation (RAGs)</h1>
</div>

<div style="text-align: center;">
    <img src="https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExZ3FsdzRveTBrenMxM3VnbDMwaTJxN2NnZm50aGFibXk1NzNnY2Q0MCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/LR5ZBwZHv02lmpVoEU/giphy.gif" alt="NLP Gif" style="width: 300px; height: 150px; object-fit: cover; object-position: center;">
</div>

<h1 style="color: #FF6347;">Data Storage & Retrieval</h1>


<h2 style="color: #FF8C00;">PyPDFLoader</h2>

`PyPDFLoader` is a lightweight Python library designed to streamline the process of loading and parsing PDF documents for text processing tasks. It is particularly useful in Retrieval-Augmented Generation workflows where text extraction from PDFs is required.

- **What Does PyPDFLoader Do?**
  - Extracts text from PDF files, retaining formatting and layout.
  - Simplifies the preprocessing of document-based datasets.
  - Supports efficient and scalable loading of large PDF collections.

- **Key Features:**
  - Compatible with popular NLP libraries and frameworks.
  - Handles multi-page PDFs and embedded images (e.g., OCR-compatible setups).
  - Provides flexible configurations for structured text extraction.

- **Use Cases:**
  - Preparing PDF documents for retrieval-based systems in RAGs.
  - Automating the text extraction pipeline for document analysis.
  - Creating datasets from academic papers, technical manuals, and reports.


In [2]:
pip install langchain langchain_community pypdf


Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install termcolor langchain_openai langchain-huggingface sentence-transformers chromadb langchain_chroma tiktoken openai python-dotenv

Collecting termcolor
  Using cached termcolor-3.2.0-py3-none-any.whl.metadata (6.4 kB)
Collecting langchain-huggingface
  Using cached langchain_huggingface-1.0.1-py3-none-any.whl.metadata (2.1 kB)
Collecting sentence-transformers
  Using cached sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting langchain_chroma
  Using cached langchain_chroma-1.0.0-py3-none-any.whl.metadata (1.9 kB)
Collecting huggingface-hub<1.0.0,>=0.33.4 (from langchain-huggingface)
  Using cached huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Using cached transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
INFO: pip is looking at multiple versions of sentence-transformers to determine which version is compatible with other requirements. This could take a while.
Collecting sentence-transformers
  Using cached sentence_transformers-5.1.1-py3-none-any.whl.metadata (16 kB)
  Using cached sentence_transformers-5.1.0-

In [16]:
import os
import warnings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter


<h3 style="color: #FF8C00;">Loading the Documents</h3>

In [17]:
# File path for the document

file_path = "/Users/Shyam/Desktop/Ironhack-Bootcamp/Week 7/D5/lab-intro-rag/ai-for-everyone.pdf"

<h3 style="color: #FF8C00;">Documents into pages</h3>

The `PyPDFLoader` library allows efficient loading and splitting of PDF documents into smaller, manageable parts for NLP tasks.

This functionality is particularly useful in workflows requiring granular text processing, such as Retrieval-Augmented Generation (RAG).


In [18]:
# Load and split the document
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()
len(pages)

297

<h3 style="color: #FF8C00;">Pages into Chunks</h3>


####  RecursiveCharacterTextSplitter in LangChain

The `RecursiveCharacterTextSplitter` is the **recommended splitter** in LangChain when you want to break down long documents into smaller, semantically meaningful chunks ‚Äî especially useful in **RAG pipelines**, where clean context chunks lead to better LLM responses.

####  Parameters

| Parameter       | Description                                                                 |
|-----------------|-----------------------------------------------------------------------------|
| `chunk_size`    | The **maximum number of characters** allowed in a chunk (e.g., `1000`).     |
| `chunk_overlap` | The number of **overlapping characters** between consecutive chunks (e.g., `200`). This helps preserve context continuity. |

####  How it works
`RecursiveCharacterTextSplitter` attempts to split the text **intelligently**, trying the following separators in order:
1. Paragraphs (`"\n\n"`)
2. Lines (`"\n"`)
3. Sentences or words (`" "`)
4. Individual characters (as a last resort)

This makes it ideal for handling **natural language documents**, such as PDFs, articles, or long reports, without breaking sentences or paragraphs in awkward ways.



In [19]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(pages)

len(chunks)

1096

####  Alternative: CharacterTextSplitter

`CharacterTextSplitter` is a simpler splitter that breaks text into chunks based **purely on character count**, without trying to preserve any natural language structure.

##### Example:
```python
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
````

This method is faster and more predictable but may split text in the middle of a sentence or paragraph, which can hurt performance in downstream tasks like retrieval or QA.

---

#### Comparison Table

| Feature                        | RecursiveCharacterTextSplitter | CharacterTextSplitter     |
| ------------------------------ | ------------------------------ | ------------------------- |
| Structure-aware splitting      |  Yes                          |  No                      |
| Preserves sentence/paragraphs  |  Yes                          |  No                      |
| Risk of splitting mid-sentence |  Minimal                     |  High                   |
| Ideal for RAG/document QA      |  Highly recommended           |  Only if structured text |
| Performance speed              |  Slightly slower             |  Faster                  |

---

#### Recommendation

Use `RecursiveCharacterTextSplitter` for most real-world document processing tasks, especially when building RAG pipelines or working with structured natural language content like PDFs or articles.

## Best Practices for Choosing Chunk Size in RAG

### Best Practices for Chunk Size in RAG

| Factor                      | Recommendation                                                                                                                                                                                          |
| ---------------------------| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **LLM context limit**       | Choose a chunk size that lets you retrieve multiple chunks **without exceeding the model‚Äôs token limit**. For example, GPT-4o supports 128k tokens, but with GPT-3.5 (16k) or GPT-4 (32k), keep it modest. |
| **Chunk size (in characters)** | Typically: **500‚Äì1,000 characters** per chunk ‚Üí ~75‚Äì200 tokens. This fits well for retrieval + prompt without context overflow.                                                                           |
| **Chunk size (in tokens)**  | If using token-based splitter (e.g. `TokenTextSplitter`): aim for **100‚Äì300 tokens** per chunk.                                                                                                            |
| **Chunk overlap**           | Use **overlap of 10‚Äì30%** (e.g., 100‚Äì300 characters or ~50 tokens) to preserve context across chunk boundaries and avoid cutting off important ideas mid-sentence.                                        |
| **Document structure**      | Use **`RecursiveCharacterTextSplitter`** to preserve semantic boundaries (paragraphs, sentences) instead of arbitrary cuts.                                                                                |
| **Task type**               | For **question answering**, smaller chunks (~500‚Äì800 chars) reduce noise.<br>For **summarization**, slightly larger chunks (~1000‚Äì1500) are OK.                                                          |
| **Embedding model**         | Some models (e.g., `text-embedding-3-large`) can handle long input. But still, smaller chunks give **finer-grained retrieval**, which improves relevance.                                                  |
| **Query type**              | If users ask **very specific questions**, small focused chunks are better. For broader queries, bigger chunks might help.                                                                                  |


### Rule of Thumb

| Use Case                 | Chunk Size      | Overlap |
| ------------------------| --------------- | ------- |
| Factual Q&A              | 500‚Äì800 chars   | 100‚Äì200 |
| Summarization            | 1000‚Äì1500 chars | 200‚Äì300 |
| Technical documents      | 400‚Äì700 chars   | 100‚Äì200 |
| Long reports/books       | 800‚Äì1200 chars  | 200‚Äì300 |
| Small LLMs (‚â§16k tokens) | ‚â§800 chars      | 100‚Äì200 |


### Avoid

- Chunks >2000 characters: risks context overflow.
- No overlap: may lose key information between chunks.



<h2 style="color: #FF8C00;">Embeddings</h2>

Embeddings transform text into dense vector representations, capturing semantic meaning and contextual relationships. They are essential for efficient document retrieval and similarity analysis.

- **What are OpenAI Embeddings?**
  - Pre-trained embeddings like `text-embedding-3-large` generate high-quality vector representations for text.
  - Encapsulate semantic relationships in the text, enabling robust NLP applications.

- **Key Features of `text-embedding-3-large`:**
  - Large-scale embedding model optimized for accuracy and versatility.
  - Handles diverse NLP tasks, including retrieval, classification, and clustering.
  - Ideal for applications with high-performance requirements.

- **Benefits:**
  - Reduces the need for extensive custom training.
  - Provides state-of-the-art performance in retrieval-augmented systems.
  - Compatible with RAGs to create powerful context-aware models.


In [20]:
from langchain_openai import OpenAIEmbeddings

In [21]:
from dotenv import load_dotenv

In [22]:
load_dotenv()

True

In [23]:
api_key = os.getenv("OPENAI_API_KEY")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

<h2 style="color: #FF8C00;">ChromaDB</h2>

ChromaDB is a versatile vector database designed for efficiently storing and retrieving embeddings. It integrates seamlessly with embedding models to enable high-performance similarity search and context-based retrieval.

### Workflow Overview:
- **Step 1:** Generate embeddings using a pre-trained model (e.g., OpenAI's `text-embedding-3-large`).
- **Step 2:** Store the embeddings in ChromaDB for efficient retrieval and similarity calculations.
- **Step 3:** Use the stored embeddings to perform searches, matching, or context-based retrieval.

### Key Features of ChromaDB:
- **Scalability:** Handles large-scale datasets with optimized indexing and search capabilities.
- **Speed:** Provides fast and accurate retrieval of embeddings for real-time applications.
- **Integration:** Supports integration with popular frameworks and libraries for embedding generation.

In [24]:
from langchain_community.vectorstores import Chroma

In [25]:
db = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db_LAB")
print("ChromaDB created with document embeddings.")

ChromaDB created with document embeddings.


<h1 style="color: #FF6347;">Retrieving Documents</h1>


### Exercice1: Write a user question that someone might ask about your book‚Äôs topic or content.

In [26]:
user_question = "" # User question
retrieved_docs = db.similarity_search(user_question, k=10) # k is the number of documents to retrieve

In [16]:
# Display top results
for i, doc in enumerate(retrieved_docs[:3]): # Display top 3 results
    print(f"Document {i+1}:\n{doc.page_content[36:1000]}") # Display content

Document 1:
f Human Communication. Palo Alto, CA: 
Science and Behavior Books.
Weizenbaum, J. 1976. Computer Power and Human Reason: From Judgment to 
Calculation. San Francisco: W . H. Freeman.
Document 2:
‚Äì and when the difference between human 
and machine is affirmed at the cost of their unity that is negated ‚Äì done so by 
disconnections. The way out is the establishment of a relation through affirm -
ing both the identity of, and the difference between, the two sides ‚Äì as done by
Document 3:
ne, Not a Camera: How Financial Models Shape 
Markets. (1st edn.). Cambridge, MA: The MIT Press.
Malik, M. M. 2020. A Hierarchy of Limitations in Machine Learning. 
ArXiv:2002.05193 [Cs, Econ, Math, Stat] , February. http://arxiv.org 
/abs/2002.05193.
Marcus, G. 2018. Deep Learning: A Critical Appraisal. ArXiv:1801.00631 [Cs, 
Stat], January. http://arxiv.org/abs/1801.00631.
McQuillan, D. 2015. Algorithmic States of Exception. European Journal  
of Cultural Studies  18 (4‚Äì5), 564‚Äì576

<h2 style="color: #FF8C00;">Preparing Content for GenAI</h2>

In [17]:
def _get_document_prompt(docs):
    prompt = "\n"
    for doc in docs:
        prompt += "\nContent:\n"
        prompt += doc.page_content + "\n\n"
    return prompt

In [18]:
# Generate a formatted context from the retrieved documents
formatted_context = _get_document_prompt(retrieved_docs)
print("Context formatted for GPT model.")

Context formatted for GPT model.


<h2 style="color: #FF8C00;">ChatBot Architecture</h2>

### Exercice2: Write a prompt that is relevant and tailored to the content and style of your book.

In [20]:
prompt = f"""Based on the provided context from 'AI for Everyone? Critical Perspectives', 
please answer the following question with a critical and analytical approach:

Question: How do the authors challenge the notion of technological determinism in AI, 
and what alternative frameworks do they propose for understanding AI's role in society?

Please provide:
1. Key arguments from the text regarding power structures and AI inequalities
2. Specific examples or case studies mentioned in the context
3. Any theoretical frameworks or critical perspectives discussed
4. The implications for policy and governance

Use direct references from the context where possible, and maintain an academic tone 
consistent with critical theory perspectives."""

In [21]:
import openai

### Exercice3: Tune parameters like temperature, and penalties to control how creative, focused, or varied the model's responses are.

In [23]:
# Set up GPT client and parameters
client = openai.OpenAI()
model_params = {
    'model': 'gpt-4o',
    'temperature': 0.7,  # Increase creativity
    'max_tokens': 4000,  # Allow for longer responses
    'top_p': 0.9,        # Use nucleus sampling
    'frequency_penalty': 0.5,  # Reduce repetition
    'presence_penalty': 0.6    # Encourage new topics
}

<h1 style="color: #FF6347;">Response</h1>


In [24]:
messages = [{'role': 'user', 'content': prompt}]
completion = client.chat.completions.create(messages=messages, **model_params, timeout=120)

In [25]:
answer = completion.choices[0].message.content
print(answer)

The authors of "AI for Everyone? Critical Perspectives" challenge the notion of technological determinism by emphasizing that AI does not develop in a vacuum but is deeply embedded within existing social, political, and economic structures. They argue that viewing AI as an autonomous force that inevitably shapes society overlooks the significant influence of human agency, power dynamics, and socio-economic contexts on its development and deployment.

1. **Key Arguments Regarding Power Structures and AI Inequalities:**
   - The text highlights how AI systems often reinforce existing power hierarchies and inequalities rather than disrupt them. This is due to the fact that those who control AI technologies typically come from dominant social groups with specific interests and biases.
   - The authors point out that AI can exacerbate disparities by embedding biases into algorithms, thereby perpetuating discrimination in areas such as hiring practices, law enforcement, and access to resourc

<img src="https://miro.medium.com/v2/resize:fit:824/1*GK56xmDIWtNQAD_jnBIt2g.png" alt="NLP Gif" style="width: 500px">

<h2 style="color: #FF6347;">Cosine Similarity</h2>

**Cosine similarity** is a metric used to measure the alignment or similarity between two vectors, calculated as the cosine of the angle between them. It is the **most common metric used in RAG pipelines** for vector retrieval.. It provides a scale from -1 to 1:

- **-1**: Vectors are completely opposite.
- **0**: Vectors are orthogonal (uncorrelated or unrelated).
- **1**: Vectors are identical.


<img src="https://storage.googleapis.com/lds-media/images/cosine-similarity-vectors.original.jpg" alt="NLP Gif" style="width: 700px">

<h2 style="color: #FF6347;">Keyword Highlighting</h2>

Highlighting important keywords helps users quickly understand the relevance of the retrieved text to their query.

In [3]:
pip install termcolor

Collecting termcolor
  Using cached termcolor-3.2.0-py3-none-any.whl.metadata (6.4 kB)
Using cached termcolor-3.2.0-py3-none-any.whl (7.7 kB)
Installing collected packages: termcolor
Successfully installed termcolor-3.2.0
Note: you may need to restart the kernel to use updated packages.


In [4]:
from termcolor import colored

The `highlight_keywords` function is designed to highlight specific keywords within a given text. It replaces each keyword in the text with a highlighted version using the `colored` function from the `termcolor` library.


In [5]:
def highlight_keywords(text, keywords):
    for keyword in keywords:
        text = text.replace(keyword, colored(keyword, 'green', attrs=['bold']))
    return text

### Exercice4: add your keywords

In [29]:
query_keywords = ["AI", "power", "critical", "technology", "inequality"] # add your keywords
for i, doc in enumerate(retrieved_docs[:1]):
    snippet = doc.page_content[:200]
    highlighted = highlight_keywords(snippet, query_keywords)
    print(f"Snippet {i+1}:\n{highlighted}\n{'-'*80}")

Snippet 1:
Watzlawick, P . 1964. An Anthology of Human Communication. Palo Alto, CA: 
Science and Behavior Books.
Weizenbaum, J. 1976. Computer Power and Human Reason: From Judgment to 
Calculation. San Francisc
--------------------------------------------------------------------------------


1. `query_keywords` is a list of keywords to be highlighted.
2. The loop iterates over the first document in retrieved_docs.
3. For each document, a snippet of the first 200 characters is extracted.
4. The highlight_keywords function is called to highlight the keywords in the snippet.
5. The highlighted snippet is printed along with a separator line.

<h1 style="color: #FF6347;">Bonus</h1>

**Try loading one of your own PDF books and go through the steps again to explore how the pipeline works with your content**:


In [46]:
"""
Complete RAG Pipeline for Your Own PDF
=======================================
This script walks you through the entire RAG process with your own PDF document.
"""

import os
import warnings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from dotenv import load_dotenv
from termcolor import colored

warnings.filterwarnings('ignore')

# ============================================================================
# STEP 1: CONFIGURATION
# ============================================================================
print("=" * 80)
print("STEP 1: CONFIGURATION")
print("=" * 80)

# TODO: Replace with your PDF file path
YOUR_PDF_PATH = "/Users/Shyam/Desktop/Ironhack-Bootcamp/Week 7/D5/lab-intro-rag/steel_axle_thesis.pdf"
# Load environment variables (for OpenAI API key)
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

if not api_key:
    print("‚ö†Ô∏è  WARNING: OPENAI_API_KEY not found in environment variables!")
    print("Please create a .env file with your API key or set it as an environment variable.")
else:
    print("‚úÖ OpenAI API key loaded successfully")

# ============================================================================
# STEP 2: LOAD YOUR PDF DOCUMENT
# ============================================================================
print("\n" + "=" * 80)
print("STEP 2: LOADING PDF DOCUMENT")
print("=" * 80)

try:
    loader = PyPDFLoader(YOUR_PDF_PATH)
    pages = loader.load_and_split()
    print(f"‚úÖ Successfully loaded PDF: {YOUR_PDF_PATH}")
    print(f"üìÑ Total pages: {len(pages)}")
    print(f"\nüìñ Sample from first page (first 300 characters):")
    print("-" * 80)
    print(pages[0].page_content[:300] + "...")
    print("-" * 80)
except FileNotFoundError:
    print(f"‚ùå ERROR: File not found at '{YOUR_PDF_PATH}'")
    print("Please update YOUR_PDF_PATH with the correct path to your PDF.")
    exit(1)
except Exception as e:
    print(f"‚ùå ERROR loading PDF: {e}")
    exit(1)

# ============================================================================
# STEP 3: SPLIT DOCUMENTS INTO CHUNKS
# ============================================================================
print("\n" + "=" * 80)
print("STEP 3: SPLITTING DOCUMENT INTO CHUNKS")
print("=" * 80)

# Configure the text splitter
CHUNK_SIZE = 450  # Maximum characters per chunk
CHUNK_OVERLAP = 50  # Overlapping characters between chunks

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

chunks = text_splitter.split_documents(pages)
print(f"‚úÖ Document split into {len(chunks)} chunks")
print(f"‚öôÔ∏è  Chunk size: {CHUNK_SIZE} characters")
print(f"‚öôÔ∏è  Chunk overlap: {CHUNK_OVERLAP} characters")
print(f"\nüìù Sample chunk:")
print("-" * 80)
print(chunks[0].page_content[:400] + "...")
print("-" * 80)

# ============================================================================
# STEP 4: CREATE EMBEDDINGS
# ============================================================================
print("\n" + "=" * 80)
print("STEP 4: CREATING EMBEDDINGS")
print("=" * 80)

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
print("‚úÖ Embeddings model initialized: text-embedding-3-large")

# ============================================================================
# STEP 5: CREATE VECTOR DATABASE (ChromaDB)
# ============================================================================
print("\n" + "=" * 80)
print("STEP 5: CREATING VECTOR DATABASE")
print("=" * 80)

db = Chroma.from_documents(
    chunks, 
    embeddings, 
    persist_directory="./my_custom_chroma_db"
)
print("‚úÖ ChromaDB created successfully")
print(f"üíæ Database stored at: ./my_custom_chroma_db")

# ============================================================================
# STEP 6: QUERY YOUR DOCUMENT (RETRIEVAL)
# ============================================================================
print("\n" + "=" * 80)
print("STEP 6: QUERYING YOUR DOCUMENT")
print("=" * 80)

# TODO: Customize these questions based on your PDF content
sample_questions = [
    "What is the vickers hardness?",
    "What is the bending force?",
    "Summarize the main arguments presented."
]

print("Sample questions you can ask:")
for i, q in enumerate(sample_questions, 1):
    print(f"  {i}. {q}")

# Use the first question as an example
user_question = sample_questions[0]
print(f"\nüîç Searching for: '{user_question}'")

# Retrieve relevant documents
k = 3  # Number of relevant chunks to retrieve
retrieved_docs = db.similarity_search(user_question, k=k)

print(f"‚úÖ Retrieved {len(retrieved_docs)} relevant chunks")

# ============================================================================
# STEP 7: DISPLAY RESULTS WITH KEYWORD HIGHLIGHTING
# ============================================================================
print("\n" + "=" * 80)
print("STEP 7: RETRIEVED RESULTS (with keyword highlighting)")
print("=" * 80)

# TODO: Add keywords relevant to your query
keywords = ["main", "topic", "hardness", "bending force", "microstructure analysis"]

def highlight_keywords(text, keywords):
    """Highlight keywords in text using colored output"""
    for keyword in keywords:
        # Case-insensitive replacement
        import re
        pattern = re.compile(re.escape(keyword), re.IGNORECASE)
        text = pattern.sub(colored(keyword, 'green', attrs=['bold']), text)
    return text

# Display top 3 results with highlighting
for i, doc in enumerate(retrieved_docs[:3]):
    print(f"\nüìÑ Result {i+1}:")
    print("-" * 80)
    snippet = doc.page_content[:500]  # Show first 500 characters
    highlighted = highlight_keywords(snippet, keywords)
    print(highlighted)
    if len(doc.page_content) > 500:
        print("... (truncated)")
    print("-" * 80)

# ============================================================================
# STEP 8: PREPARE CONTEXT FOR LLM
# ============================================================================
print("\n" + "=" * 80)
print("STEP 8: PREPARING CONTEXT FOR LLM")
print("=" * 80)

def format_documents_for_llm(docs):
    """Format retrieved documents into a context string for LLM"""
    context = "\n"
    for i, doc in enumerate(docs, 1):
        context += f"\n[Document {i}]:\n"
        context += doc.page_content + "\n"
    return context

context = format_documents_for_llm(retrieved_docs)
print(f"‚úÖ Context prepared ({len(context)} characters)")
print(f"\nüí° This context can now be sent to an LLM along with your question.")

# ============================================================================
# STEP 9: EXAMPLE LLM PROMPT
# ============================================================================
print("\n" + "=" * 80)
print("STEP 9: EXAMPLE LLM PROMPT")
print("=" * 80)

example_prompt = f"""Based on the following context from the document, please answer this question:

Question: {user_question}

Context:
{context[:1000]}... (truncated for display)

Please provide a detailed answer based solely on the information in the context provided."""

print("Example prompt structure:")
print("-" * 80)
print(example_prompt)
print("-" * 80)

# ============================================================================
# SUMMARY & NEXT STEPS
# ============================================================================
print("\n" + "=" * 80)
print("‚úÖ RAG PIPELINE COMPLETED SUCCESSFULLY!")
print("=" * 80)

print("\nüìä Summary:")
print(f"  ‚Ä¢ PDF loaded: {len(pages)} pages")
print(f"  ‚Ä¢ Chunks created: {len(chunks)}")
print(f"  ‚Ä¢ Embeddings model: text-embedding-3-large")
print(f"  ‚Ä¢ Vector database: ChromaDB")
print(f"  ‚Ä¢ Retrieved chunks: {k}")

print("\nüéØ Next Steps:")
print("  1. Modify YOUR_PDF_PATH to point to your PDF")
print("  2. Customize the questions in sample_questions")
print("  3. Adjust keywords for highlighting")
print("  4. Connect to an LLM (OpenAI, Claude, etc.) to generate answers")
print("  5. Experiment with different chunk sizes and overlap values")
print("  6. Try different values of k (number of retrieved documents)")

print("\nüí° Tips:")
print("  ‚Ä¢ For technical documents, use smaller chunks (500-800 chars)")
print("  ‚Ä¢ For narrative content, use larger chunks (1000-1500 chars)")
print("  ‚Ä¢ Increase k if you want more context, but watch token limits")
print("  ‚Ä¢ Use specific keywords from your domain for better highlighting")

print("\n" + "=" * 80)

STEP 1: CONFIGURATION
‚úÖ OpenAI API key loaded successfully

STEP 2: LOADING PDF DOCUMENT
‚úÖ Successfully loaded PDF: /Users/Shyam/Desktop/Ironhack-Bootcamp/Week 7/D5/lab-intro-rag/steel_axle_thesis.pdf
üìÑ Total pages: 80

üìñ Sample from first page (first 300 characters):
--------------------------------------------------------------------------------
Investigation on material properties
of hardened steel axle and influence of
die-casting process and data analysis
Submitted in the partial fulfillment of the requirements for the
award of degree Masters of Science In
Metallic Materials Technology
Submitted by
Shyam sunder Chiliveri
Matriculation Nu...
--------------------------------------------------------------------------------

STEP 3: SPLITTING DOCUMENT INTO CHUNKS
‚úÖ Document split into 397 chunks
‚öôÔ∏è  Chunk size: 450 characters
‚öôÔ∏è  Chunk overlap: 50 characters

üìù Sample chunk:
--------------------------------------------------------------------------------
Investi