### Cell 1: Import Libraries

## Libraries Overview:

1. **os**: A standard library that allows interaction with the operating system, such as environment variable access.

2. **dotenv**: Used to load environment variables from a `.env` file. This is useful for managing sensitive information like API keys.

3. **AzureOpenAI (from openai)**: This is the library used to interact with the Azure OpenAI service. It helps you communicate with the Azure-hosted OpenAI models, such as `GPT-4` and embeddings models.

4. **langchain.document_loaders.PyPDFLoader**: A LangChain utility for loading PDF files into a format that can be used for further text processing.

5. **langchain.text_splitter.RecursiveCharacterTextSplitter**: A LangChain utility for splitting large text documents into smaller chunks based on characters or words, with some overlap for better context during embeddings.

6. **langchain.vectorstores.faiss.FAISS**: FAISS is a vector search engine used to store document embeddings and allows quick similarity-based retrieval.

7. **langchain_core.embeddings.Embeddings**: Base class for embedding models in LangChain.

8. **langchain.docstore.InMemoryDocstore**: Stores documents in memory for quick retrieval.

9. **faiss**: A fast library for nearest neighbor search in high-dimensional vector spaces, crucial for creating vector stores.

10. **numpy**: A scientific computing library used here for handling arrays and matrices, particularly embedding vectors.


In [1]:
import os
from dotenv import load_dotenv
from openai import AzureOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain_openai import AzureOpenAIEmbeddings
from langchain.docstore import InMemoryDocstore
import faiss
import numpy as np



In [30]:
load_dotenv('variables.env')
# Azure OpenAI configuration
azure_openai_api_key = os.getenv('AZURE_OPENAI_API_KEY')
azure_openai_api_version = os.getenv('AZURE_OPENAI_API_VERSION')
azure_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')

In [31]:
print(azure_openai_api_key)
print(azure_openai_api_version)
print(azure_endpoint)

4d69d52da8164fc8939c74eae4d66ef9
2024-02-15-preview
https://home-openai-01.openai.azure.com/


## Custom Embeddings Class:

- This custom class **`CustomAzureEmbeddings`** is used to embed documents and queries using Azure OpenAI's embedding model.

### Class Structure:
- **`__init__` Method**: 
   - Initializes the class and sets up the Azure OpenAI client by providing an API key, API version, and the Azure endpoint.
   - The class interacts with Azure OpenAI's embeddings service to generate document embeddings.
  
- **`embed_documents` Method**: 
   - Takes a list of texts as input and returns embeddings for each text.
  
- **`embed_query` Method**:
   - Embeds a single text (query) by calling Azure OpenAI’s `embeddings.create` method.
   - Uses the `text-embedding-ada-002` model to generate embeddings.
#### List comprehension
 is a concise way to create lists in programming, particularly in Python. It allows you to generate a new list by applying an expression to each element of an existing iterable (like a list or range), often including a condition or filtering mechanism.
 [expression for item in iterable if condition]


In [2]:
class CustomAzureEmbeddings(AzureOpenAIEmbeddings):
    def __init__(self, api_key, api_version, azure_endpoint):
        self.client = AzureOpenAIEmbeddings(
            api_key=api_key,
            api_version=api_version,
            azure_endpoint=azure_endpoint
        )

    def embed_documents(self, texts):
        return [self.embed_query(text) for text in texts]

    def embed_query(self, text):
        response = self.client.embeddings.create(
            model="text-embedding-ada-002",
            input=text
        )
        return response.data[0].embedding

In [5]:
load_dotenv('variables.env')
# Azure OpenAI configuration
azure_openai_api_key = os.getenv('AZURE_OPENAI_API_KEY')
azure_openai_api_version = os.getenv('AZURE_OPENAI_API_VERSION')
azure_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')

## Encode PDF and Create FAISS Vector Store:

This function loads a PDF document, processes it into smaller chunks, embeds the chunks, and stores the embeddings in a FAISS index for efficient document retrieval.

### Detailed Steps:
1. **Azure OpenAI Configuration**:
   - Defines API keys, version, and the endpoint.
   
2. **CustomAzureEmbeddings Initialization**:
   - Initializes the `CustomAzureEmbeddings` class for interacting with Azure OpenAI’s embedding model.

3. **Load and Process PDF**:
   - Loads the PDF document using `PyPDFLoader`.
   - Splits the document into smaller chunks (with overlapping sections) using `RecursiveCharacterTextSplitter`.

4. **Generate Embeddings**:
   - Embeds the text chunks using the custom embeddings class.
   - Converts embeddings into a NumPy array for further processing.

5. **Create FAISS Index**:
   - Initializes a FAISS index with the dimensionality of the embeddings.
   - Adds the embeddings to the index.

6. **In-Memory Document Store**:
   - Uses `InMemoryDocstore` to store the original documents and maps the document embeddings in the FAISS index to the corresponding document IDs.

7. **Return Vector Store**:
   - Returns the FAISS vector store, which can now be used for retrieval based on similarity.


In [7]:

def encode_pdf(path, chunk_size=300, chunk_overlap=200):


    try:
        # Initialize the custom Azure OpenAI embeddings class
        embeddings_client = AzureOpenAIEmbeddings(
            api_key=azure_openai_api_key,
            api_version=azure_openai_api_version,
            azure_endpoint=azure_endpoint
            
        )

        # Load and process the PDF
        loader = PyPDFLoader(path)
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        texts = text_splitter.split_documents(documents)
        text_list = [doc.page_content for doc in texts]

        # Generate embeddings
        embeddings = embeddings_client.embed_documents(text_list)
        embeddings_array = np.array(embeddings)
        print(f"Document embeddings shape: {embeddings_array.shape}")

        # Create FAISS index
        dimension = embeddings_array.shape[1]
        index = faiss.IndexFlatL2(dimension)
        index.add(embeddings_array)

        print(f"FAISS index dimension: {index.d}")
        print(f"Number of vectors in FAISS index: {index.ntotal}")

        # Create InMemoryDocstore and index mapping
        docstore = InMemoryDocstore({str(i): doc for i, doc in enumerate(texts)})
        index_to_docstore_id = {i: str(i) for i in range(len(texts))}

        # Create FAISS vector store
        vectorstore = FAISS(
            embedding_function=embeddings_client.embed_query,
            index=index,
            docstore=docstore,
            index_to_docstore_id=index_to_docstore_id,
        )

        return vectorstore

    except Exception as e:
        print(f"An error occurred: {e}")
        raise



## Main Execution for Vector Store Creation:

This section defines the path to the PDF file and calls the `encode_pdf` function to create a FAISS vector store from the PDF content. If any errors occur during the process, they are caught and printed.


In [8]:
# Path to your PDF
pdf_path = './data/Understanding_Climate_Change.pdf'

# Encode the PDF and create the vector store
try:
    vectorstore = encode_pdf(pdf_path)
    print("Vector store created successfully.")
except Exception as e:
    print(f"An error occurred during vector store creation: {e}")

`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


Document embeddings shape: (738, 1536)
FAISS index dimension: 1536
Number of vectors in FAISS index: 738
Vector store created successfully.


## Document Retrieval:

This part uses the FAISS vector store to retrieve the most relevant documents based on the user’s query.

### Steps:
1. **Create a Retriever**: 
   - A retriever is created from the vector store using `as_retriever()`.
   - `search_kwargs={"k": 5}` specifies that the top 5 relevant documents should be retrieved.

2. **Define Query**: 
   - The query asks about the main cause of climate change.

3. **Retrieve Documents**:
   - The retriever finds the top 5 relevant documents and concatenates their content into a `context`.
   - If an error occurs during retrieval, it is caught and printed.


In [9]:
# Create a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Define a query
query = "What is the main cause of climate change?"

# Retrieve relevant documents
try:
    docs = retriever.get_relevant_documents(query)
    context = "\n\n".join([doc.page_content for doc in docs])
    print("\nRetrieved context:")
    print(context)
except Exception as e:
    print(f"An error occurred during document retrieval: {e}")

  docs = retriever.get_relevant_documents(query)



Retrieved context:
Greenhouse Gases  
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is  essential

Chapter 2: Causes of Climate Change  
Greenhouse Gases  
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous

driven by human activities, particularly the emission of greenhou se gases.  
Chapter 2: Causes of Climate Change  
Greenhouse Gases  
The primary cause of recent climate change is the increase in greenhouse gases in the

provide a historical record that scientists use to understand past climate conditions and 
predict future trends. The evidence overwhelmingly shows that recent changes are primarily 
driven by human activities, particularly the emission

## Generate Answer Using Azure OpenAI:

This section calls Azure OpenAI’s `GPT-4` model to generate an answer to the user’s query based on the retrieved document context.

### Steps:
1. **Initialize Azure OpenAI Chat Client**: 
   - The chat client is initialized using the Azure OpenAI API key, version, and endpoint.

2. **Create Chat Completion Request**: 
   - A completion request is made to Azure OpenAI, providing it with the retrieved context and user’s query.
   - The system message sets the assistant's behavior as helpful.

3. **Generate Answer**:
   - The `gpt-4o` model generates an answer to the query based on the context.
   - If an error occurs, it is caught and printed.


In [10]:

# Initialize the Azure OpenAI chat client
chat_client = AzureOpenAI(
    api_key=azure_openai_api_key,
    api_version=azure_openai_api_version,  # Ensure this version supports chat completions
    azure_endpoint= azure_endpoint
)

# Generate an answer using the chat model
try:
    response = chat_client.chat.completions.create(
        model="gpt-4o",  # Replace with your chat model deployment name
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer the question based on the provided context."}
        ]
    )
    print("\nAnswer:")
    print(response.choices[0].message.content)
except Exception as e:
    print(f"An error occurred during answer generation: {e}")



Answer:
The main cause of recent climate change is the increase in greenhouse gases in the atmosphere, primarily driven by human activities, particularly the emission of greenhouse gases such as carbon dioxide (CO2), methane (CH4), and nitrous oxide (N2O).


https://bitpeak.com/chunking-methods-in-rag-methods-comparison/