# Semantic Chunking 
## Overview
This code implements a semantic chunking approach for processing and retrieving information from PDF documents. Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.

## Motivation
Traditional text splitting methods often break documents at arbitrary points, potentially disrupting the flow of information and context. Semantic chunking addresses this issue by attempting to split text at more natural breakpoints, preserving semantic coherence within each chunk.

## Key Components
1-PDF processing and text extraction\
2-Semantic chunking using LangChain's SemanticChunker\
3-Vector store creation using FAISS and OpenAI embeddings\
4-Retriever setup for querying the processed documents\
## Method Details
### Document Preprocessing
1-The PDF is read and converted to a string using a custom read_pdf_to_string function.\


### Step 1: Import Libraries and Load OpenAI API Key
In this step, we import necessary libraries and load environment variables to access the OpenAI API. We append the parent directory to the Python path so we can access helper functions and evaluation modules.

We also introduce a new type of text splitter called `SemanticChunker`, which will later be used to chunk the text based on meaning rather than fixed size. Finally, the OpenAI API key is securely loaded using the `.env` file.
\
\
Description: This cell sets up the environment by importing necessary libraries for semantic chunking and evaluation. We use the SemanticChunker class to split text based on its meaning rather than fixed sizes. Additionally, we ensure that the OpenAI API key is loaded securely from the .env file for further use in the notebook.

### Cell 1: Import Necessary Libraries
This cell imports all required libraries for handling Azure OpenAI, document embeddings, vector storage, document loading, and processing.

#### Explanation:
 This imports Azure OpenAI API, document embedding and retrieval libraries, FAISS for vector stores, and utilities for document processing and text splitting. PyMuPDF (fitz) is used for handling PDF files.

In [3]:
import os
from dotenv import load_dotenv
from langchain_openai import AzureOpenAI
from langchain_core.embeddings import Embeddings
from langchain.vectorstores import FAISS
from langchain.docstore import InMemoryDocstore
from langchain.schema import Document
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import fitz  # PyMuPDF


In [4]:
load_dotenv('variables.env')
# Azure OpenAI configuration
azure_openai_api_key = os.getenv('AZURE_OPENAI_API_KEY')
azure_openai_api_version = os.getenv('AZURE_OPENAI_API_VERSION')
azure_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')

In [5]:
print(azure_openai_api_key)
print(azure_openai_api_version)
print(azure_endpoint)

4d69d52da8164fc8939c74eae4d66ef9
2024-02-15-preview
https://home-openai-01.openai.azure.com/


### Cell 2: Configure Azure OpenAI API
This cell sets up the Azure OpenAI client with your credentials (API key, endpoint, and API version).

Explanation: Initializes the Azure OpenAI client with the required API key, endpoint, and version. Make sure to replace these with your actual Azure credentials.

In [6]:

# Set up Azure OpenAI client
azure_client = AzureOpenAI(
    api_key=azure_openai_api_key,
    api_version=azure_openai_api_version,
    azure_endpoint=azure_endpoint
)


#### Cell 3: Define Custom Embeddings Class for Azure OpenAI
This cell defines a custom class that interacts with Azure OpenAI to embed documents and queries using the embeddings model.

Explanation: This class handles document embeddings by calling Azure OpenAI's embedding model (text-embedding-ada-002). It embeds both queries and documents.

In [7]:
class CustomAzureEmbeddings(Embeddings):
    def __init__(self, client):
        self.client = client

    def embed_documents(self, texts):
        return [self.embed_query(text) for text in texts]

    def embed_query(self, text):
        response = self.client.embeddings.create(
            model="text-embedding-ada-002",
            input=text
        )
        return response.data[0].embedding


#### Cell 4: Read PDF and Extract Text
This function extracts text from the PDF file using PyMuPDF.

Explanation: This function reads a PDF file and extracts its text content, which will be later used for chunking and embedding.

In [8]:
def read_pdf_to_string(path):
    doc = fitz.open(path)
    content = ""
    for page_num in range(len(doc)):
        page = doc[page_num]
        content += page.get_text()
    return content


### Cell 5: Semantic Text Splitting
This function splits large chunks of text into semantically meaningful sections using cosine similarity.

Explanation: This function splits the text into chunks based on semantic similarity. It uses sentence embeddings and compares them using cosine similarity to create meaningful chunks for further processing.

In [9]:
def semantic_split(text, embeddings_client, max_chunk_size=1000, similarity_threshold=0.7):
    sentences = [s.strip() for s in text.split('.') if s.strip()]
    sentence_embeddings = embeddings_client.embed_documents(sentences)
    
    chunks = []
    current_chunk = sentences[0]
    current_embedding = sentence_embeddings[0]
    
    for sentence, embedding in zip(sentences[1:], sentence_embeddings[1:]):
        similarity = cosine_similarity([current_embedding], [embedding])[0][0]
        if len(current_chunk) + len(sentence) < max_chunk_size and similarity > similarity_threshold:
            current_chunk += '. ' + sentence
            current_embedding = np.mean([current_embedding, embedding], axis=0)
        else:
            chunks.append(current_chunk)
            current_chunk = sentence
            current_embedding = embedding
    
    chunks.append(current_chunk)
    return chunks


### Cell 6: Encode PDF and Create FAISS Index
This cell encodes the PDF, embeds the chunks, and creates a FAISS index.

Explanation: This function reads the PDF, splits it semantically, embeds the chunks using Azure OpenAI, and stores them in a FAISS vector store for efficient retrieval.

In [10]:
def encode_pdf(path):


    try:
        embeddings_client = CustomAzureEmbeddings(azure_client)

        content = read_pdf_to_string(path)
        chunks = semantic_split(content, embeddings_client)
        texts = [Document(page_content=chunk) for chunk in chunks]

        embeddings = embeddings_client.embed_documents([t.page_content for t in texts])
        embeddings_array = np.array(embeddings)
        print(f"Document embeddings shape: {embeddings_array.shape}")

        dimension = embeddings_array.shape[1]
        index = FAISS.from_embeddings(zip([t.page_content for t in texts], embeddings), embeddings_client)

        print(f"FAISS index contains {len(index.docstore._dict)} documents")

        return index

    except Exception as e:
        print(f"An error occurred: {e}")
        raise


#### Cell 7: Main Execution for Vector Store Creation
This cell handles the main execution for creating the vector store from the PDF.

Explanation: Reads the PDF and encodes it using the previously defined functions. This stores the document embeddings in a FAISS index.

In [11]:
# Main execution
pdf_path = "./data/Understanding_Climate_Change.pdf"

try:
    vectorstore = encode_pdf(pdf_path)
    print("Vector store created successfully.")
except Exception as e:
    print(f"An error occurred during vector store creation: {e}")


Document embeddings shape: (78, 1536)
FAISS index contains 78 documents
Vector store created successfully.


### Cell 8: Retrieve Relevant Documents
This cell creates a retriever using the FAISS vector store and retrieves relevant documents based on a query.

Explanation: This part retrieves the most relevant documents from the FAISS index based on a user's query. It prints the context of the retrieved documents.

In [12]:
# Create a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Define a query
query = "What is the main cause of climate change?"

# Retrieve relevant documents
try:
    docs = retriever.get_relevant_documents(query)
    context = "\n\n".join([doc.page_content for doc in docs])
    print("\nRetrieved context:")
    print(context)
except Exception as e:
    print(f"An error occurred during document retrieval: {e}")


  docs = retriever.get_relevant_documents(query)



Retrieved context:
During the Holocene epoch, which 
began at the end of the last ice age, human societies flourished, but the industrial era has seen 
unprecedented changes. Modern Observations 
Modern scientific observations indicate a rapid increase in global temperatures, sea levels, 
and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has 
documented these changes extensively. Ice core samples, tree rings, and ocean sediments 
provide a historical record that scientists use to understand past climate conditions and 
predict future trends. The evidence overwhelmingly shows that recent changes are primarily 
driven by human activities, particularly the emission of greenhouse gases. Chapter 2: Causes of Climate Change 
Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating 

### Cell 9: Generate an Answer Using Azure OpenAI GPT-4o
This cell uses Azure OpenAI GPT-4o to generate an answer to the user's query based on the retrieved context.

Explanation: This part generates an answer to the user's query by passing the retrieved context and the question to Azure OpenAI GPT-4o, which generates a response.

In [13]:
# Generate an answer using the chat model
try:
    response = azure_client.chat.completions.create(
        model="gpt-4o",  # Replace with your chat model deployment name
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer the question based on the provided context."}
        ]
    )
    print("\nAnswer:")
    print(response.choices[0].message.content)
except Exception as e:
    print(f"An error occurred during answer generation: {e}")



Answer:
The main cause of recent climate change, as outlined in the provided context, is the increase in greenhouse gases in the atmosphere. Human activities, particularly the emission of greenhouse gases such as carbon dioxide (CO2), methane (CH4), and nitrous oxide (N2O), are the primary drivers of this increase.
