### Setting Up the Azure OpenAI Client
1. **Importing Libraries**: We import the necessary libraries including `openai` for interacting with the Azure OpenAI API and `dotenv` for loading environment variables from a `.env` file.
2. **Loading Environment Variables**: We load the API key, API version, and Azure endpoint from the `.env` file using `load_dotenv()` to ensure secure and dynamic configuration.
3. **Initializing Azure Client**: We create an instance of the `AzureOpenAI` client using the loaded environment variables.
4. **Environment Check**: To verify that everything is set up properly, we print the API key, version, and endpoint.


In [1]:
import os
import requests
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
import json
import re
import PyPDF2 
from typing import List, Dict
from dotenv import load_dotenv


In [None]:
load_dotenv('variables.env')
# Azure OpenAI configuration
azure_openai_api_key = os.getenv('AZURE_OPENAI_API_KEY')
azure_openai_api_version = os.getenv('AZURE_OPENAI_API_VERSION')
azure_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')
service_name = os.getenv('AZURE_SERVICE_NAME') 
admin_key = os.getenv('AZURE_ADMIN_KEY')  
endpoint = f"https://{service_name}.search.windows.net"
openai_deployment = "gpt-4o"  # Replace with your model deployment name

print(azure_openai_api_key)
print(azure_openai_api_version)
print(azure_endpoint)
print(service_name)
print(admin_key)
print(endpoint)

# Azure ai Search Index Creation

This cell creates an Azure Cognitive Search index for climate change documents.

Key points:
- Sets up index schema with fields for id, content, and chapter
- Uses REST API to create the index
- Prints success message if index is created, error details if it fails

Note: Ensure `admin_key` and `endpoint` are correctly set before running.

In [None]:
# [Previous code remains unchanged]

# TODO: Complete the index schema
index_schema = {
    "name": index_name,
    "fields": [
        {"name": "id", "type": "Edm.String", "key": True, "filterable": True},
        {"name": "content", "type": # Your code here},
        {"name": "chapter", "type": # Your code here}
    ]
}

# HINT: For the 'content' field, consider using type "Edm.String" and making it searchable.
# For the 'chapter' field, you might want to make it filterable and facetable.

# [REST API call remains unchanged]

# TODO: Add code to check the response status and print appropriate message
if response.status_code == # Your code here:
    print(# Your code here)
else:
    print(# Your code here)

# HINT: Use response.status_code to check if the request was successful (status code 200 or 201).
# Print a success message if the index was created, or an error message with response.text if it failed.

# PDF Text Extraction

This cell extracts text from a PDF file using PyPDF2.

Key operations:
- Defines a function to read and extract text from PDF
- Extracts text from "Understanding_Climate_Change.pdf"
- Prints a preview of the extracted text (first 500 characters)
- Saves the full extracted text to "extracted_text.txt"

Note: Ensure the PDF file path is correct before running.

In [None]:
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in # Your code here:
            text += # Your code here
    return text

# HINT: Use a for loop to iterate through reader.pages. For each page, use the extract_text() method and append the result to the text variable.

# [PDF extraction code remains unchanged]

# TODO: Save the extracted text to a file named 'extracted_text.txt'
with open("extracted_text.txt", "w", encoding="utf-8") as # Your code here:
    # Your code here

# HINT: Use the with open() statement in write mode ('w') and the write() method to save the text to a file.
# Don't forget to specify the encoding as 'utf-8' to handle special characters correctly.

# Content Chunking and Document Upload

This cell processes the extracted PDF text and uploads it to Azure Cognitive Search.

Key operations:
- Chunks the content into smaller segments
- Extracts chapter information
- Creates documents with ID, content, and chapter
- Uploads the documents to the search index

Note: Ensure the 'content' variable contains the actual PDF text before running.

In [None]:
def chunk_content(text, max_chunk_size=5000):
    chunks = []
    current_chunk = ""
    for line in # Your code here:
        if len(current_chunk) + len(line) > max_chunk_size:
            chunks.append(# Your code here)
            current_chunk = # Your code here
        else:
            current_chunk += # Your code here
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

# HINT: Split the text into lines and iterate through them. Add lines to the current_chunk until it reaches max_chunk_size, then append it to chunks and start a new current_chunk.

def extract_chapters(text):
    return re.findall(# Your code here)

# HINT: Use re.findall() with a pattern like r'Chapter \d+: .+' to find chapter titles in the text.

# [Code to read content from file remains unchanged]

documents = []
for i, chunk in enumerate(chunked_content):
    chapter = next((ch for ch in chapters if ch in chunk), "Unknown Chapter")
    doc = {
        "id": f"doc{i}",
        "content": # Your code here,
        "chapter": # Your code here
    }
    documents.append(doc)

# HINT: Iterate through chunked_content, creating a dictionary for each chunk with 'id', 'content', and 'chapter' keys. Append each dictionary to the documents list.

# [Upload API call remains unchanged]

# TODO: Check response status and print appropriate message
if response.status_code == # Your code here:
    print(f"Successfully uploaded {# Your code here} documents.")
else:
    print(# Your code here)

# HINT: Check response.status_code. If it's 200 or 201, print a success message with the number of documents uploaded. Otherwise, print an error message with response.text.

# Index Verification

This cell verifies the content uploaded to Azure Cognitive Search.

Key operations:
- Defines functions to:
  - Get total document count
  - Retrieve a sample of documents
- Prints total document count
- Displays sample documents with ID and chapter

Note: This helps confirm successful upload and indexing of documents.

In [None]:
def get_document_count():
    count_url = f"{endpoint}/indexes/{index_name}/docs/$count?api-version=2020-06-30"
    response = requests.get(# Your code here)
    if response.status_code == 200:
        return int(# Your code here)
    else:
        print(f"Failed to get document count. Status code: {response.status_code}")
        return None

# HINT: Use requests.get() to make the API call. If successful, return int(response.text). Handle potential errors and return None if the call fails.

def get_sample_documents(sample_size=5):
    search_url = f"{endpoint}/indexes/{index_name}/docs?api-version=2020-06-30"
    params = {
        'search': '*',
        'top': sample_size,
        'select': 'id,chapter'
    }
    response = requests.get(# Your code here)
    if response.status_code == 200:
        return response.json()[# Your code here]
    else:
        print(f"Failed to get sample documents. Status code: {response.status_code}")
        return None

# HINT: Use requests.get() with the search_url and params. If successful, return response.json()['value']. Handle potential errors and return None if the call fails.

# [Code to get document count and sample documents remains unchanged]

if sample_docs:
    print("\nSample documents:")
    for doc in sample_docs:
        print(f"ID: {doc[# Your code here]}, Chapter: {doc[# Your code here]}")

# HINT: Use a for loop to iterate through sample_docs. For each document, print its 'id' and 'chapter' values.

# Document Search and Results Display

This cell demonstrates searching the Azure Cognitive Search index.

Key operations:
- Defines a function to search documents with highlighting
- Performs test searches on climate change topics
- Displays search results including:
  - Document ID
  - Chapter
  - Highlighted relevant content

Note: This showcases the search functionality and result presentation.

In [None]:
def search_documents(query, top=3):
    search_url = f"{endpoint}/indexes/{index_name}/docs/search?api-version=2020-06-30"
    body = {
        "search": query,
        "top": top,
        "select": "id,chapter,content",
        "highlight": "content"
    }
    response = requests.post(# Your code here)
    if response.status_code == 200:
        return response.json()[# Your code here]
    else:
        print(f"Search failed. Status code: {response.status_code}")
        return []

# HINT: Use requests.post() with the search_url, headers, and body. If successful, return response.json()['value']. Handle potential errors and return an empty list if the call fails.

# [Test searches code remains unchanged]

for query in test_queries:
    print(f"\nSearching for: {query}")
    results = search_documents(query)
    if results:
        for result in results:
            print(f"ID: {result[# Your code here]}")
            print(f"Chapter: {result[# Your code here]}")
            print(f"Highlighted content: {result.get('@search.highlights', {}).get('content', ['No highlight'])[0]}")
            print()
    else:
        print("No results found.")

# HINT: For each result, print the 'id' and 'chapter'. For the highlighted content, access result.get('@search.highlights', {}).get('content', ['No highlight'])[0].

# Content Rechunking and Reindexing

This cell improves the document chunking process and reindexes the content.

Key operations:
- Redefines chunking function to preserve chapter information
- Loads extracted PDF text from file
- Creates new documents with improved chapter tracking
- Uploads rechunked documents to the search index

Note: This process enhances the organization and searchability of the content.

In [None]:


def chunk_content(text, max_chunk_size=5000):
    chunks = []
    current_chunk = ""
    current_chapter = "Unknown Chapter"
    
    for line in text.split('\n'):
        if line.strip().startswith("Chapter"):
            if current_chunk:
                chunks.append((# Your code here))
            current_chunk = # Your code here
            current_chapter = # Your code here
        elif len(current_chunk) + len(line) > max_chunk_size:
            chunks.append((# Your code here))
            current_chunk = # Your code here
        else:
            current_chunk += # Your code here
    
    if current_chunk:
        chunks.append((# Your code here))
    
    return chunks

# HINT: When appending to chunks, use (current_chapter, current_chunk.strip())
# For the 'else' case, remember to add a space before the new line

# Load the extracted PDF text
with open("extracted_text.txt", "r", encoding="utf-8") as file:
    pdf_content = file.read()

chunked_content = chunk_content(# Your code here)

documents = []
for i, (chapter, chunk) in enumerate(chunked_content):
    doc = {
        "id": f"doc{i}",
        "content": # Your code here,
        "chapter": # Your code here
    }
    documents.append(doc)

# HINT: Use the 'chunk' and 'chapter' variables to fill in the 'content' and 'chapter' fields of the doc dictionary

# Upload documents
upload_url = f"{endpoint}/indexes/{index_name}/docs/index?api-version=2020-06-30"
response = requests.post(upload_url, headers=headers, json={"value": documents})

if response.status_code == # Your code here:
    print(f"Successfully reindexed {# Your code here} documents.")
else:
    print(f"Failed to reindex documents. Status code: {response.status_code}")
    print(# Your code here)

# HINT: Check if the status code is 200 for a successful request
# Use len(documents) to get the number of reindexed documents
# Print response.text in case of an error

# RAG Pipeline Implementation

This cell implements a Retrieval-Augmented Generation (RAG) pipeline for climate change questions.

Key components:
- Document retrieval from Azure Cognitive Search
- Prompt creation using retrieved documents
- Response generation using Azure OpenAI

Features:
- Utilizes both search and OpenAI APIs
- Creates context-aware prompts
- Implements an interactive Q&A loop

Note: Ensure all API keys and endpoints are correctly set before running.

In [None]:
# [Headers setup remains unchanged]

def retrieve_documents(query: str, top: int = 3) -> List[Dict]:
    return search_documents(# Your code here)

# HINT: Use the search_documents() function you created earlier, passing the query and top parameters.

def create_prompt(query: str, docs: List[Dict]) -> str:
    context = "\n\n".join([f"Chapter: {doc['chapter']}\nContent: {doc['content']}" for doc in docs])
    return f"""You are an AI assistant specializing in climate change. Use the following information to answer the user's question. If you can't answer the question based on the provided information, say so.

Context:
{# Your code here}

User Question: {# Your code here}

Answer:"""

# HINT: Construct a string that includes context from the documents and the user's query. Consider using f-strings to format the prompt.

def generate_response(prompt: str) -> str:
    url = f"{azure_endpoint}/openai/deployments/{openai_deployment}/chat/completions?api-version=2023-05-15"
    payload = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant that answers questions about climate change."},
            {"role": "user", "content": prompt}
        ],
        "max_tokens": 150,
        "temperature": 0.7,
        "n": 1
    }
    response = requests.post(# Your code here)
    if response.status_code == 200:
        return response.json()['choices'][0]['message'][# Your code here]
    else:
        print(f"OpenAI API request failed. Status code: {response.status_code}")
        return "Sorry, I couldn't generate a response at this time."

# HINT: Use requests.post() with the url, headers, and json payload. Extract and return the generated text from the response JSON.

def rag_pipeline(query: str) -> str:
    docs = retrieve_documents(# Your code here)
    prompt = create_prompt(# Your code here)
    response = generate_response(# Your code here)
    return response

# HINT: Call retrieve_documents(), then create_prompt() with the retrieved documents, and finally generate_response() with the created prompt.

# [Example usage code remains unchanged]

# Enhanced RAG Pipeline with Customization Options

This cell implements an advanced Retrieval-Augmented Generation (RAG) pipeline for climate change queries.

Key features:
- Flexible document retrieval from Azure Cognitive Search
- Customizable prompt styles (default, detailed, concise)
- Adjustable response length with max_tokens
- Interactive user input for query parameters

Components:
- Document retrieval function
- Dynamic prompt creation based on style
- Azure OpenAI integration for response generation
- Main loop for continuous user interaction

Note: Ensure all API keys and endpoints are correctly configured before running.

In [None]:
search_headers = {
    'Content-Type': 'application/json',
    'api-key': admin_key
}

openai_headers = {
    'Content-Type': 'application/json',
    'api-key': azure_openai_api_key
}

def retrieve_documents(query: str, top: int = 3) -> List[Dict]:
    search_url = f"{endpoint}/indexes/{index_name}/docs/search?api-version=2020-06-30"
    body = {
        "search": query,
        "top": top,
        "select": "content,chapter",
        "highlight": "content"
    }
    response = requests.post(# Your code here)
    if response.status_code == 200:
        return response.json()[# Your code here]
    else:
        print(f"Search failed. Status code: {response.status_code}")
        return []

# HINT: Use the search_url, headers, and body for the POST request. Return the 'value' key from the JSON response.

def create_prompt(query: str, docs: List[Dict], prompt_style: str = "default") -> str:
    context = "\n\n".join([f"Chapter: {doc['chapter']}\nContent: {doc['content']}" for doc in docs])
    
    if prompt_style == "detailed":
        return f"""You are an AI assistant specializing in climate change. Use the following information to answer the user's question. If you can't answer the question based on the provided information, say so. Provide a detailed explanation and, if possible, include specific examples or data from the context.

Context:
{context}

User Question: {query}

Detailed Answer:"""
    
    elif prompt_style == "concise":
        return f"""You are an AI assistant specializing in climate change. Provide a brief and concise answer to the user's question based on the following information. If you can't answer the question based on the provided information, say so briefly.

Context:
{context}

User Question: {query}

Concise Answer:"""
    
    else:  # default prompt style
        return # Your code here

# HINT: For the default prompt style, use a format similar to the "detailed" and "concise" styles, but with a balance between detail and brevity.

def generate_response(prompt: str, max_tokens: int = 150) -> str:
    url = f"{azure_endpoint}/openai/deployments/{openai_deployment}/chat/completions?api-version=2023-05-15"
    payload = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant that answers questions about climate change."},
            {"role": "user", "content": prompt}
        ],
        "max_tokens": max_tokens,
        "temperature": 0.7,
        "n": 1
    }
    response = requests.post(# Your code here)
    if response.status_code == 200:
        return response.json()['choices'][0]['message'][# Your code here]
    else:
        print(f"OpenAI API request failed. Status code: {response.status_code}")
        return "Sorry, I couldn't generate a response at this time."

# HINT: Use the url, headers, and json payload for the POST request. Return the 'content' of the generated message.

def rag_pipeline(query: str, num_docs: int = 3, prompt_style: str = "default", max_tokens: int = 150) -> str:
    docs = retrieve_documents(# Your code here)
    prompt = create_prompt(# Your code here)
    response = generate_response(# Your code here)
    return response

# HINT: Pass the appropriate parameters to each function call in the pipeline.

# Example usage
if __name__ == "__main__":
    while True:
        user_query = input("Enter your question about climate change (or 'quit' to exit): ")
        if user_query.lower() == 'quit':
            break
        
        num_docs = int(input("Enter the number of documents to retrieve (1-5): "))
        prompt_style = input("Enter prompt style (default/detailed/concise): ")
        max_tokens = int(input("Enter maximum number of tokens for the response (50-500): "))
        
        answer = rag_pipeline(# Your code here)
        print(f"\nAnswer: {answer}\n")

# HINT: Pass all the user inputs to the rag_pipeline function.