# Khipus.ai
## Retrieval Augmented Generation
### Case Study: RAG Pipeline
### LangChain + Azure OpenAI + Pinecone
<span>© Copyright Notice 2025, Khipus.ai - All Rights Reserved.</span>

### Retrieval-Augmented Generation (RAG) for question answering using PDF documents


### Note: This notebook requires Python 3.11. You can download from here https://www.python.org/ftp/python/3.11.0/python-3.11.0rc2-amd64.exe


In [7]:
#%pip install -r requirements.txt

### Step 1: Import Dependencies 

In [8]:
# Step 1: Import Dependencies 
import os
import pinecone
import openai
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from pinecone import Pinecone, ServerlessSpec
from langchain.chat_models import AzureChatOpenAI

### Step 2: Read Pinecone and Azure OpenAI Environment Variables

In [None]:
# Step 2: Read Pinecone and Azure OpenAI Environment Variables
os.environ["AZURE_OPENAI_API_KEY"] = "YOUR_AZURE_OPENAI_API_KEY" #key from the Azure OpenAI resource
os.environ["AZURE_OPENAI_API_BASE"] = "YOUR_AZURE_OPENAI_API_BASE"
os.environ["AZURE_OPENAI_DEPLOYMENT"] = "text-embedding-ada-002"
os.environ["AZURE_OPENAI_API_VERSION"] = "2023-05-15"
os.environ["PINECONE_API_KEY"] = "YOUR_PINECONE_API_KEY" #key from the Pinecone resource

openai.api_key = os.environ["AZURE_OPENAI_API_KEY"]
openai.api_base = os.environ["AZURE_OPENAI_API_BASE"]
openai.api_type = "azure"
openai.api_version = os.environ["AZURE_OPENAI_API_VERSION"]



### Step 3: Load your PDF and split into chunks

In [10]:
# Step 3: Load your PDF and split into chunks
pdf_path = "./docs/corollacross_brochure.pdf"  # Adjust the file path if needed
loader = PyPDFLoader(pdf_path)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
print(f"Loaded {len(documents)} document(s) and split into {len(docs)} chunks.")

Loaded 23 document(s) and split into 78 chunks.


### Step 4: Initialize the Azure OpenAI embeddings object using LangChain.

In [11]:
# Step 4: Initialize the Azure OpenAI embeddings object using LangChain.
embeddings = AzureOpenAIEmbeddings(
    openai_api_key=openai.api_key,
    azure_endpoint=openai.api_base,  
    openai_api_version=openai.api_version,
    deployment=os.environ["AZURE_OPENAI_DEPLOYMENT"]
)

### Step 5: Connect to Pinecone Client and get all availible indexes

In [None]:

# Replace these values as needed
api_key = "YOUR_PINECONE_API_KEY"

# Create an instance of the Pinecone class using the new API

pc = Pinecone(api_key=api_key)

# List indexes to check connectivity
print("Available indexes:", pc.list_indexes().names())


Available indexes: ['langchain-demo2', 'langchain-demo', 'assignment4']


### Step 6: Create index if you havent, it doesnt create if it already exists

In [13]:
index_name = "langchain-demo2"

if not pc.has_index(index_name):
    pc.create_index(
    name=index_name,
    dimension=1536, # The text-embedding-ada-0002 model has 1536 dimensions
    metric="cosine", # Cosine is the it's one of the most common distance metrics used with text embeddings like text-embedding-ada-0002.
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)

### Step 7 Create and store embeddings using the PineconeVectorStore

In [14]:
# Create and store embeddings using the PineconeVectorStore

index_name = "langchain-demo2"

vectorstore = PineconeVectorStore(
    index_name=index_name, 
    embedding=embeddings
    )

# Assuming 'docs' contains your document chunks
vectorstore.add_documents(docs)

print("Embeddings have been successfully stored in Pinecone!")

Embeddings have been successfully stored in Pinecone!


### Step 7: Perform a similarity search and retrieve the most relevant documents

In [15]:
# Step 7: Perform a similarity search and retrieve the most relevant documents

# Initialize the language model using Azure Chat OpenAI
llm = AzureChatOpenAI(
    temperature=0,
    openai_api_base=os.environ["AZURE_OPENAI_API_BASE"],
    openai_api_key=os.environ["AZURE_OPENAI_API_KEY"],
    openai_api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2024-10-21"),
    deployment_name=os.environ.get("AZURE_OPENAI_GPT4_MODEL_NAME", "gpt-4o")
)

# Load the QA chain
chain = load_qa_chain(llm, chain_type="stuff")




  llm = AzureChatOpenAI(
stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  chain = load_qa_chain(llm, chain_type="stuff")


In [16]:

# Define your query
query = "What is the engine size of the Toyota Corolla Cross?"
#What is the estimated fuel efficiency of the Corolla Cross Hybrid?

# Retrieve similar documents from the vector store (removed include_metadata)
docs = vectorstore.similarity_search(query)

# Optionally, access metadata from the documents if needed
for doc in docs:
    print("Metadata:", doc.metadata)



# Get the answer from the chain
result = chain.run(input_documents=docs, question=query)

print(f"Answer: \n\n{result}")


Metadata: {'creationdate': '2024-01-17T14:56:22-05:00', 'creator': 'Adobe InDesign 18.5 (Macintosh)', 'moddate': '2024-01-17T14:57:41-05:00', 'page': 2.0, 'page_label': '3', 'producer': 'Adobe PDF Library 17.0', 'source': './corollacross_brochure.pdf', 'total_pages': 23.0, 'trapped': '/False'}
Metadata: {'creationdate': '2024-01-17T14:56:22-05:00', 'creator': 'Adobe InDesign 18.5 (Macintosh)', 'moddate': '2024-01-17T14:57:41-05:00', 'page': 2.0, 'page_label': '3', 'producer': 'Adobe PDF Library 17.0', 'source': './docs/corollacross_brochure.pdf', 'total_pages': 23.0, 'trapped': '/False'}
Metadata: {'creationdate': '2024-01-17T14:56:22-05:00', 'creator': 'Adobe InDesign 18.5 (Macintosh)', 'moddate': '2024-01-17T14:57:41-05:00', 'page': 1.0, 'page_label': '2', 'producer': 'Adobe PDF Library 17.0', 'source': './docs/corollacross_brochure.pdf', 'total_pages': 23.0, 'trapped': '/False'}
Metadata: {'creationdate': '2024-01-17T14:56:22-05:00', 'creator': 'Adobe InDesign 18.5 (Macintosh)', 'mo

  result = chain.run(input_documents=docs, question=query)


Answer: 

The Toyota Corolla Cross features a 2.0-liter engine.
