# 2-step RAG
The initial code is mostly based on langchain documentation
https://docs.langchain.com/oss/python/langchain/rag#rag-chains

## Load
I will need to use OCR because the provided pdf is scanned, therefore it cannot be chunked. 

In [24]:
from langchain_community.document_loaders import UnstructuredPDFLoader

# This will use OCR to read the text from the images/scans
loader = UnstructuredPDFLoader(
    "../data/raw/exhibit_1.pdf", 
    mode="single", # previously elements, but it returned caption instead of article
    strategy="ocr_only" # Forces OCR
)
raw_documents = loader.load()



## Chunking
Breaking 11 pages into smaller pieces

In [25]:
#CHUNKING
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)#from 500,100
chunks = text_splitter.split_documents(raw_documents)

# Let's see what happened
print(f"You now have {len(chunks)} pieces of text ready for your collection.")

You now have 31 pieces of text ready for your collection.


## Database
for chunking, indexing

In [26]:
#vectorstore setup
import chromadb
chroma_client = chromadb.Client()

collection = chroma_client.get_or_create_collection(name="thesis_project")
# Chroma needs: Unique IDs for every chunk and text itself

idsList = [f"id_{index}" for index in range(len(chunks))]
docsList= [doc.page_content for doc in chunks]

collection.add(ids=idsList, documents=docsList)
#success message
print(f"Successfully indexed {collection.count()} chunks into ChromaDB.")

Successfully indexed 175 chunks into ChromaDB.


# Retrieval

In [27]:
query = "What is Zostera marina?" # Example question

# Ask Chroma to find the 3 most relevant chunks
results = collection.query(
    query_texts=[query],
    n_results=5
)

# This is your retrieved context
retrieved_context = " ".join(results['documents'][0])
print("Retrieved Context:", retrieved_context[:200], "...")

Retrieved Context: Zostera marina Linnaeus, Figs 53-59 Figs 53-59. Zostera marina. 53. Apex with seven veins (Ve). 54. Surface of leaf. 55. Surface of seed coat. 56. Longitudinal section of seed coat. 57. Transverse sec ...


# Generation
I am using openai appi with gpt5-nano respond based on the query and retrieved knowledge.



In [None]:
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
  api_key=os.getenv("OPENAI_API_KEY")
  
)
prompt = f"""Answer the question based on the context below.

Context: {retrieved_context}

Question: {query}
Answer:"""
response = client.chat.completions.create(
  model="gpt-5-nano",
  messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content);


OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable