# Khipus.ai
## Retrieval Augmented Generation
### RAG Assignment 4
### LangChain + Azure OpenAI + Pinecone
### Name: (You Name here)
<span>© Copyright Notice 2025, Khipus.ai - All Rights Reserved.</span>


### Note: This notebook requires Python 3.11. You can download from here https://www.python.org/ftp/python/3.11.0/python-3.11.0rc2-amd64.exe


### Retrieval-Augmented Generation (RAG) for question answering using PDF documents

In [1]:
#%pip install -r requirements.txt

### Step 1: Import Dependencies 

In [2]:
# Step 1: Import Dependencies 
import os
import pinecone
import openai
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from pinecone import Pinecone, ServerlessSpec
from langchain.chat_models import AzureChatOpenAI

### Step 2: Read Pinecone and Azure OpenAI Environment Variables

In [None]:
# Step 2: Read Pinecone and Azure OpenAI Environment Variables
os.environ["AZURE_OPENAI_API_KEY"] = "YOUR_AZURE_OPENAI_KEY" #key from the Azure OpenAI resource
os.environ["AZURE_OPENAI_API_BASE"] = "YOUR_AZURE_OPENAI_ENDPOINT"#https://azure-openai-<your-resource-name>.openai.azure.com/
os.environ["AZURE_OPENAI_DEPLOYMENT"] = "text-embedding-ada-002"
os.environ["AZURE_OPENAI_API_VERSION"] = "2023-05-15"
os.environ["PINECONE_API_KEY"] = "YOUR_PINECONE_API_KEY" #key from the Pinecone resource

PINECONE_API_KEY = os.environ["PINECONE_API_KEY"]
AZURE_OPENAI_API_KEY = os.environ["AZURE_OPENAI_API_KEY"]
AZURE_OPENAI_API_BASE = os.environ["AZURE_OPENAI_API_BASE"]
AZURE_OPENAI_DEPLOYMENT = os.environ["AZURE_OPENAI_DEPLOYMENT"]
AZURE_OPENAI_API_VERSION = os.environ["AZURE_OPENAI_API_VERSION"]


openai.api_key = os.environ["AZURE_OPENAI_API_KEY"]
openai.api_base = os.environ["AZURE_OPENAI_API_BASE"]
openai.api_type = "azure"
openai.api_version = os.environ["AZURE_OPENAI_API_VERSION"]



### Step 3: Load your PDF and split into chunks

In this step, you will load your employee_handbook.pdf file and split it into smaller chunks for processing. The document will be divided into manageable segments to create embeddings that can be stored in our vector database.

To use your own employee handbook:
1. Place your PDF file in the `./docs/` directory
2. Update the `pdf_path` variable in the next cell to point to your file
3. The handbook will be automatically chunked into segments of 1000 characters with 100 character overlap
4. These chunks will later be embedded and stored in Pinecone for retrieval

This approach allows you to query information from the employee handbook using natural language questions in the final step.

In [None]:
# Your code here
# Step 3: Load your PDF and split into chunks


Loaded 11 document(s) and split into 23 chunks.


### Step 4: Initialize the Azure OpenAI embeddings object using LangChain.

In [5]:
# Step 4: Initialize the Azure OpenAI embeddings object using LangChain.
embeddings = AzureOpenAIEmbeddings(
    openai_api_key=openai.api_key,
    azure_endpoint=openai.api_base,  
    openai_api_version=openai.api_version,
    deployment=os.environ["AZURE_OPENAI_DEPLOYMENT"]
)

### Step 5: Connect to Pinecone Client and create Index if doesnt exist

In [None]:

# Replace these values as needed
api_key = PINECONE_API_KEY
index_name = "assignment4"

# Create an instance of the Pinecone class using the new API

pc = Pinecone(api_key=api_key)

# List indexes to check connectivity
print("Available indexes:", pc.list_indexes().names())

# Create (or connect to) your Pinecone index if it doesn't exist
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"  
        )
    )
    print(f"Created index: {index_name}")
else:
    print(f"Index '{index_name}' already exists.")




Available indexes: ['langchain-demo2']
Created index: assignment4


### Step 6 Create and store embeddings using the PineconeVectorStore

In [None]:
# Step 6 Create and store embeddings using the PineconeVectorStore AND called it "vectorstore"



Embeddings have been successfully stored in Pinecone!


### Step 7: Perform a similarity search and retrieve the most relevant documents

In [9]:
# Step 7: Perform a similarity search and retrieve the most relevant documents

# Initialize the language model using Azure Chat OpenAI
llm = AzureChatOpenAI(
    temperature=0,
    openai_api_base=AZURE_OPENAI_API_BASE,
    openai_api_key=AZURE_OPENAI_API_KEY,
    openai_api_version=AZURE_OPENAI_API_VERSION,
    deployment_name=os.environ.get("AZURE_OPENAI_GPT4_MODEL_NAME", "gpt-4o")
)





  llm = AzureChatOpenAI(


### Step 8: Load the QA chain and run the query

In [None]:
# Step 8: Load the QA chain and run the query
# Load the QA chain
chain = load_qa_chain(llm, chain_type="stuff")

# Define your query
query = "What is Contoso Electronics’ mission statement?"
#How often are employee performance reviews conducted?

# Retrieve similar documents from the vector store (removed include_metadata)
# Answer here

# Get the answer from the t
# Answer here



### Step 9: Test Your RAG System

Test your Retrieval Augmented Generation system using the following questions. For each question, replace the `query` variable in the previous cell and run it to see how well your system retrieves and answers based on the document content:

1. How can employees report unethical or illegal conduct confidentially?
2. How often are employee performance reviews conducted?


In [None]:
# Your code here
#  How can employees report unethical or illegal conduct confidentially?

In [None]:
# Youe code here
# How often are employee performance reviews conducted?