### Setting Up the Azure OpenAI Client
1. **Importing Libraries**: We import the necessary libraries including `openai` for interacting with the Azure OpenAI API and `dotenv` for loading environment variables from a `.env` file.
2. **Loading Environment Variables**: We load the API key, API version, and Azure endpoint from the `.env` file using `load_dotenv()` to ensure secure and dynamic configuration.
3. **Initializing Azure Client**: We create an instance of the `AzureOpenAI` client using the loaded environment variables.
4. **Environment Check**: To verify that everything is set up properly, we print the API key, version, and endpoint.


In [1]:
import os
import requests
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
import json
import re
import PyPDF2 
from typing import List, Dict
from dotenv import load_dotenv


In [None]:
load_dotenv('variables.env')
# Azure OpenAI configuration
azure_openai_api_key = os.getenv('AZURE_OPENAI_API_KEY')
azure_openai_api_version = os.getenv('AZURE_OPENAI_API_VERSION')
azure_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')
service_name = os.getenv('AZURE_SERVICE_NAME') 
admin_key = os.getenv('AZURE_ADMIN_KEY')  
endpoint = f"https://{service_name}.search.windows.net"
openai_deployment = "gpt-4o"  # Replace with your model deployment name

print(azure_openai_api_key)
print(azure_openai_api_version)
print(azure_endpoint)
print(service_name)
print(admin_key)
print(endpoint)

# Azure ai Search Index Creation

This cell creates an Azure Cognitive Search index for climate change documents.

Key points:
- Sets up index schema with fields for id, content, and chapter
- Uses REST API to create the index
- Prints success message if index is created, error details if it fails

Note: Ensure `admin_key` and `endpoint` are correctly set before running.

In [None]:



# Index name
index_name = "climate-change-index"

# Headers for REST API calls
headers = {
    'Content-Type': 'application/json',
    'api-key': admin_key
}

# Updated Index schema
index_schema = {
    "name": index_name,
    "fields": [
        {"name": "id", "type": "Edm.String", "key": True, "filterable": True},
        {"name": "content", "type": "Edm.String", "searchable": True, "filterable": False, "sortable": False, "facetable": False},
        {"name": "chapter", "type": "Edm.String", "filterable": True, "sortable": True, "facetable": True}
    ]
}

# Create index
create_index_url = f"{endpoint}/indexes/{index_name}?api-version=2020-06-30"
response = requests.put(create_index_url, headers=headers, json=index_schema)

if response.status_code == 201:
    print(f"Index '{index_name}' created successfully.")
else:
    print(f"Failed to create index. Status code: {response.status_code}")
    print(response.text)

# PDF Text Extraction

This cell extracts text from a PDF file using PyPDF2.

Key operations:
- Defines a function to read and extract text from PDF
- Extracts text from "Understanding_Climate_Change.pdf"
- Prints a preview of the extracted text (first 500 characters)
- Saves the full extracted text to "extracted_text.txt"

Note: Ensure the PDF file path is correct before running.

In [None]:

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

# Path to your PDF file
pdf_path = "./data/Understanding_Climate_Change.pdf"  # Replace with the actual path to your PDF file

# Extract text from the PDF
pdf_text = extract_text_from_pdf(pdf_path)

# Print the first 500 characters to verify the extraction
print("Extracted text preview (first 500 characters):")
print(pdf_text[:500] + "...")

# Save the extracted text to a file (optional, but useful for verification)
with open("extracted_text.txt", "w", encoding="utf-8") as text_file:
    text_file.write(pdf_text)

print(f"\nFull text has been saved to 'extracted_text.txt'")

# Content Chunking and Document Upload

This cell processes the extracted PDF text and uploads it to Azure Cognitive Search.

Key operations:
- Chunks the content into smaller segments
- Extracts chapter information
- Creates documents with ID, content, and chapter
- Uploads the documents to the search index

Note: Ensure the 'content' variable contains the actual PDF text before running.

In [None]:


def chunk_content(text, max_chunk_size=5000):
    chunks = []
    current_chunk = ""
    for line in text.split('\n'):
        if len(current_chunk) + len(line) > max_chunk_size:
            chunks.append(current_chunk.strip())
            current_chunk = line
        else:
            current_chunk += " " + line
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

def extract_chapters(text):
    chapters = re.findall(r'Chapter \d+: .+', text)
    return chapters

# Assuming 'content' is the variable containing your PDF text
content = """Your PDF content goes here"""  # Replace this with your actual PDF content

chapters = extract_chapters(content)
chunked_content = chunk_content(content)

documents = []
for i, chunk in enumerate(chunked_content):
    chapter = next((ch for ch in chapters if ch in chunk), "Unknown Chapter")
    doc = {
        "id": f"doc{i}",
        "content": chunk,
        "chapter": chapter
    }
    documents.append(doc)

# Upload documents
upload_url = f"{endpoint}/indexes/{index_name}/docs/index?api-version=2020-06-30"
response = requests.post(upload_url, headers=headers, json={"value": documents})

if response.status_code == 200:
    print("Documents uploaded successfully.")
else:
    print(f"Failed to upload documents. Status code: {response.status_code}")
    print(response.text)

# Index Verification

This cell verifies the content uploaded to Azure Cognitive Search.

Key operations:
- Defines functions to:
  - Get total document count
  - Retrieve a sample of documents
- Prints total document count
- Displays sample documents with ID and chapter

Note: This helps confirm successful upload and indexing of documents.

In [None]:



# Function to get document count
def get_document_count():
    count_url = f"{endpoint}/indexes/{index_name}/docs/$count?api-version=2020-06-30"
    response = requests.get(count_url, headers=headers)
    if response.status_code == 200:
        return int(response.text)
    else:
        print(f"Failed to get document count. Status code: {response.status_code}")
        return None

# Function to get a sample of documents
def get_sample_documents(sample_size=5):
    search_url = f"{endpoint}/indexes/{index_name}/docs?api-version=2020-06-30"
    params = {
        'search': '*',
        'top': sample_size,
        'select': 'id,chapter'
    }
    response = requests.get(search_url, headers=headers, params=params)
    if response.status_code == 200:
        return response.json()['value']
    else:
        print(f"Failed to get sample documents. Status code: {response.status_code}")
        return None

# Verify content
doc_count = get_document_count()
if doc_count is not None:
    print(f"Total documents in the index: {doc_count}")

sample_docs = get_sample_documents()
if sample_docs:
    print("\nSample documents:")
    for doc in sample_docs:
        print(f"ID: {doc['id']}, Chapter: {doc['chapter']}")

# Document Search and Results Display

This cell demonstrates searching the Azure Cognitive Search index.

Key operations:
- Defines a function to search documents with highlighting
- Performs test searches on climate change topics
- Displays search results including:
  - Document ID
  - Chapter
  - Highlighted relevant content

Note: This showcases the search functionality and result presentation.

In [None]:


def search_documents(query, top=3):
    search_url = f"{endpoint}/indexes/{index_name}/docs/search?api-version=2020-06-30"
    body = {
        "search": query,
        "top": top,
        "select": "id,chapter,content",
        "highlight": "content"
    }
    response = requests.post(search_url, headers=headers, json=body)
    if response.status_code == 200:
        return response.json()['value']
    else:
        print(f"Search failed. Status code: {response.status_code}")
        return None

# Test searches
test_queries = [
    "climate change impacts",
    "renewable energy solutions",
    "carbon capture technologies"
]

for query in test_queries:
    print(f"\nSearching for: {query}")
    results = search_documents(query)
    if results:
        for result in results:
            print(f"\nID: {result['id']}")
            print(f"Chapter: {result['chapter']}")
            print(f"Relevant content: {result.get('@search.highlights', {}).get('content', ['No highlight'])[0]}")
    else:
        print("No results found.")

# Content Rechunking and Reindexing

This cell improves the document chunking process and reindexes the content.

Key operations:
- Redefines chunking function to preserve chapter information
- Loads extracted PDF text from file
- Creates new documents with improved chapter tracking
- Uploads rechunked documents to the search index

Note: This process enhances the organization and searchability of the content.

In [None]:


def chunk_content(text, max_chunk_size=5000):
    chunks = []
    current_chunk = ""
    current_chapter = "Unknown Chapter"
    
    for line in text.split('\n'):
        if line.strip().startswith("Chapter"):
            if current_chunk:
                chunks.append((current_chapter, current_chunk.strip()))
            current_chunk = line
            current_chapter = line.strip()
        elif len(current_chunk) + len(line) > max_chunk_size:
            chunks.append((current_chapter, current_chunk.strip()))
            current_chunk = line
        else:
            current_chunk += " " + line
    
    if current_chunk:
        chunks.append((current_chapter, current_chunk.strip()))
    
    return chunks

# Load the extracted PDF text
with open("extracted_text.txt", "r", encoding="utf-8") as file:
    pdf_content = file.read()

chunked_content = chunk_content(pdf_content)

documents = []
for i, (chapter, chunk) in enumerate(chunked_content):
    doc = {
        "id": f"doc{i}",
        "content": chunk,
        "chapter": chapter
    }
    documents.append(doc)

# Upload documents
upload_url = f"{endpoint}/indexes/{index_name}/docs/index?api-version=2020-06-30"
response = requests.post(upload_url, headers=headers, json={"value": documents})

if response.status_code == 200:
    print(f"Successfully reindexed {len(documents)} documents.")
else:
    print(f"Failed to reindex documents. Status code: {response.status_code}")
    print(response.text)

# RAG Pipeline Implementation

This cell implements a Retrieval-Augmented Generation (RAG) pipeline for climate change questions.

Key components:
- Document retrieval from Azure Cognitive Search
- Prompt creation using retrieved documents
- Response generation using Azure OpenAI

Features:
- Utilizes both search and OpenAI APIs
- Creates context-aware prompts
- Implements an interactive Q&A loop

Note: Ensure all API keys and endpoints are correctly set before running.

In [10]:

search_headers = {
    'Content-Type': 'application/json',
    'api-key': admin_key
}

openai_headers = {
    'Content-Type': 'application/json',
    'api-key': azure_openai_api_key
}

def retrieve_documents(query: str, top: int = 3) -> List[Dict]:
    search_url = f"{endpoint}/indexes/{index_name}/docs/search?api-version=2020-06-30"
    body = {
        "search": query,
        "top": top,
        "select": "content,chapter",
        "highlight": "content"
    }
    response = requests.post(search_url, headers=search_headers, json=body)
    if response.status_code == 200:
        return response.json()['value']
    else:
        print(f"Search failed. Status code: {response.status_code}")
        return []

def create_prompt(query: str, docs: List[Dict]) -> str:
    context = "\n\n".join([f"Chapter: {doc['chapter']}\nContent: {doc['content']}" for doc in docs])
    return f"""You are an AI assistant specializing in climate change. Use the following information to answer the user's question. If you can't answer the question based on the provided information, say so.

Context:
{context}

User Question: {query}

Answer:"""

def generate_response(prompt: str) -> str:
    url = f"{azure_endpoint}/openai/deployments/{openai_deployment}/chat/completions?api-version=2023-05-15"
    payload = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant that answers questions about climate change."},
            {"role": "user", "content": prompt}
        ],
        "max_tokens": 150,
        "temperature": 0.7,
        "n": 1
    }
    response = requests.post(url, headers=openai_headers, json=payload)
    if response.status_code == 200:
        return response.json()['choices'][0]['message']['content'].strip()
    else:
        print(f"OpenAI API request failed. Status code: {response.status_code}")
        return "Sorry, I couldn't generate a response at this time."

def rag_pipeline(query: str) -> str:
    # Retrieve relevant documents
    docs = retrieve_documents(query)
    
    # Create prompt
    prompt = create_prompt(query, docs)
    
    # Generate response
    response = generate_response(prompt)
    
    return response

# Example usage
if __name__ == "__main__":
    while True:
        user_query = input("Enter your question about climate change (or 'quit' to exit): ")
        if user_query.lower() == 'quit':
            break
        
        answer = rag_pipeline(user_query)
        print(f"\nAnswer: {answer}\n")

# Enhanced RAG Pipeline with Customization Options

This cell implements an advanced Retrieval-Augmented Generation (RAG) pipeline for climate change queries.

Key features:
- Flexible document retrieval from Azure Cognitive Search
- Customizable prompt styles (default, detailed, concise)
- Adjustable response length with max_tokens
- Interactive user input for query parameters

Components:
- Document retrieval function
- Dynamic prompt creation based on style
- Azure OpenAI integration for response generation
- Main loop for continuous user interaction

Note: Ensure all API keys and endpoints are correctly configured before running.

In [None]:

search_headers = {
    'Content-Type': 'application/json',
    'api-key': admin_key
}

openai_headers = {
    'Content-Type': 'application/json',
    'api-key': azure_openai_api_key
}

def retrieve_documents(query: str, top: int = 3) -> List[Dict]:
    search_url = f"{endpoint}/indexes/{index_name}/docs/search?api-version=2020-06-30"
    body = {
        "search": query,
        "top": top,
        "select": "content,chapter",
        "highlight": "content"
    }
    response = requests.post(search_url, headers=search_headers, json=body)
    if response.status_code == 200:
        return response.json()['value']
    else:
        print(f"Search failed. Status code: {response.status_code}")
        return []

def create_prompt(query: str, docs: List[Dict], prompt_style: str = "default") -> str:
    context = "\n\n".join([f"Chapter: {doc['chapter']}\nContent: {doc['content']}" for doc in docs])
    
    if prompt_style == "detailed":
        return f"""You are an AI assistant specializing in climate change. Use the following information to answer the user's question. If you can't answer the question based on the provided information, say so. Provide a detailed explanation and, if possible, include specific examples or data from the context.

Context:
{context}

User Question: {query}

Detailed Answer:"""
    
    elif prompt_style == "concise":
        return f"""You are an AI assistant specializing in climate change. Provide a brief and concise answer to the user's question based on the following information. If you can't answer the question based on the provided information, say so briefly.

Context:
{context}

User Question: {query}

Concise Answer:"""
    
    else:  # default prompt style
        return f"""You are an AI assistant specializing in climate change. Use the following information to answer the user's question. If you can't answer the question based on the provided information, say so.

Context:
{context}

User Question: {query}

Answer:"""

def generate_response(prompt: str, max_tokens: int = 150) -> str:
    url = f"{azure_endpoint}/openai/deployments/{openai_deployment}/chat/completions?api-version=2023-05-15"
    payload = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant that answers questions about climate change."},
            {"role": "user", "content": prompt}
        ],
        "max_tokens": max_tokens,
        "temperature": 0.7,
        "n": 1
    }
    response = requests.post(url, headers=openai_headers, json=payload)
    if response.status_code == 200:
        return response.json()['choices'][0]['message']['content'].strip()
    else:
        print(f"OpenAI API request failed. Status code: {response.status_code}")
        return "Sorry, I couldn't generate a response at this time."

def rag_pipeline(query: str, num_docs: int = 3, prompt_style: str = "default", max_tokens: int = 150) -> str:
    # Retrieve relevant documents
    docs = retrieve_documents(query, num_docs)
    
    # Create prompt
    prompt = create_prompt(query, docs, prompt_style)
    
    # Generate response``
    response = generate_response(prompt, max_tokens)
    
    return response

# Example usage
if __name__ == "__main__":
    while True:
        user_query = input("Enter your question about climate change (or 'quit' to exit): ")
        if user_query.lower() == 'quit':
            break
        
        num_docs = int(input("Enter the number of documents to retrieve (1-5): "))
        prompt_style = input("Enter prompt style (default/detailed/concise): ")
        max_tokens = int(input("Enter maximum number of tokens for the response (50-500): "))
        
        answer = rag_pipeline(user_query, num_docs, prompt_style, max_tokens)
        print(f"\nAnswer: {answer}\n")

####    Perform a filtered search on the Azure AI Search index.

  This function allows you to search for documents that match both a search query
  and a specific filter condition. Filtering is useful for narrowing down search 
  results based on specific criteria.

  Parameters:
  - query (str): The search query string.
  - filter_condition (str): An OData filter expression. This allows for complex
    filtering based on field values. Examples include:
    - "id eq 'doc14'"  # Exact match on id
    - "content/any(t: t eq 'climate')"  # Check if 'climate' is in the content
    - "id ge 'doc10' and id le 'doc20'"  # Range of document ids
  - top (int, optional): The maximum number of results to return. Default is 10.

  Returns:
  - dict: A dictionary containing the search results if successful, None otherwise.
    The 'value' key in the dictionary contains the list of matching documents.

  Usage example:
  result = filtered_search("climate change", "id eq 'doc14'")

In [None]:
def filtered_search(query, filter_condition, top=10):
    search_url = f"{endpoint}/indexes/{index_name}/docs/search?api-version=2020-06-30"
    body = {
        "search": query,
        "filter": filter_condition,
        "top": top,
        "select": "id,content"
    }
    try:
        response = requests.post(search_url, headers=headers, json=body)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error occurred: {e}")
        if hasattr(e, 'response'):
            print(f"Response status code: {e.response.status_code}")
            print(f"Response content: {e.response.content}")
        return None

# Example usage
filter_condition = "id eq 'doc14'"  # Filter for a specific document
result = filtered_search("climate change", filter_condition)

if result:
    print(f"Filtered search successful. Number of results: {len(result.get('value', []))}")
    for item in result.get('value', []):
        print(f"ID: {item.get('id')}")
        print(f"Content snippet: {item.get('content')[:100]}...")
        print("---")
else:
    print("Filtered search failed. Please check the error messages above.")

#### Description:

Purpose: Performs a search and returns faceted results.
Parameters:

query: The search term(s).
facets: List of fields to facet on.
top: Maximum number of results to return.


Usage: Helpful for creating categorized search results, allowing users to drill down into specific categories.

In [None]:
def faceted_search(query, facets, top=10):
    search_url = f"{endpoint}/indexes/{index_name}/docs/search?api-version=2020-06-30"
    body = {
        "search": query,
        "facets": facets,
        "top": top,
        "select": "id,content"
    }
    try:
        response = requests.post(search_url, headers=headers, json=body)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error occurred: {e}")
        if hasattr(e, 'response'):
            print(f"Response status code: {e.response.status_code}")
            print(f"Response content: {e.response.content}")
        return None

# Example usage
facets = ["id"]  # Using 'id' as a facet for demonstration
result = faceted_search("climate change", facets)

if result:
    print(f"Faceted search successful. Number of results: {len(result.get('value', []))}")
    print("Facets:")
    for facet in result.get('@search.facets', {}).get('id', []):
        print(f"- {facet['value']}: {facet['count']}")
    print("\nTop results:")
    for item in result.get('value', [])[:3]:
        print(f"ID: {item.get('id')}")
        print(f"Content snippet: {item.get('content')[:100]}...")
        print("---")
else:
    print("Faceted search failed. Please check the error messages above.")

#### Description:

Purpose: Performs a search and returns results with highlighted matching terms.
Parameters:

query: The search term(s).
top: Maximum number of results to return.


Usage: Enhances readability of search results by visually emphasizing the matched terms in the content.

Each of these functions enhances the search experience in different ways:

Filtered search allows for precise querying based on specific field values.
Faceted search provides a way to categorize and navigate through search results.
Highlighted search helps users quickly identify where their search terms appear in the results.

In [None]:
def highlighted_search(query, top=10):
    search_url = f"{endpoint}/indexes/{index_name}/docs/search?api-version=2020-06-30"
    body = {
        "search": query,
        "highlight": "content",
        "highlightPreTag": "<em>",
        "highlightPostTag": "</em>",
        "top": top,
        "select": "id,content"
    }
    try:
        response = requests.post(search_url, headers=headers, json=body)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error occurred: {e}")
        if hasattr(e, 'response'):
            print(f"Response status code: {e.response.status_code}")
            print(f"Response content: {e.response.content}")
        return None

# Example usage
result = highlighted_search("climate change impacts")

if result:
    print(f"Highlighted search successful. Number of results: {len(result.get('value', []))}")
    for item in result.get('value', [])[:3]:
        print(f"ID: {item.get('id')}")
        print("Highlighted content:")
        for highlight in item.get('@search.highlights', {}).get('content', []):
            print(f"- {highlight}")
        print("---")
else:
    print("Highlighted search failed. Please check the error messages above.")

#### # Azure AI Search UI Features and Example Queries

Here's a list of features you can test in the Azure AI Search UI, along with example queries for each:

1. Basic Full-Text Search:
```
search=climate change
```

2. Filtering:
```
search=climate&$filter=chapter eq 'Chapter 7: The Economics of Climate Change'
```

3. Sorting:
```
search=climate&$orderby=id desc
```

4. Faceting:
```
search=climate&$facet=chapter
```

5. Selecting Specific Fields:
```
search=climate&$select=id,chapter
```

6. Limiting Results:
```
search=climate&$top=5
```

7. Skipping Results (for pagination):
```
search=climate&$skip=10&$top=10
```

8. Highlighting:
```
search=climate&$highlight=content
```

9. Count:
```
search=climate&$count=true
```

10. Fuzzy Search:
```
search=climte~1
```

11. Wildcard Search:
```
search=clim*
```

12. Regular Expression:
```
search=/clim(ate|e)/
```

13. Search Multiple Fields:
```
search=climate change&searchFields=content,chapter
```

14. Boosting Fields:
```
search=climate^2 change
```

15. Complex Queries:
```
search=climate change&$filter=chapter eq 'Chapter 13: Climate Change and Social Justice'&$facet=id&$orderby=id&$select=id,chapter&$top=5&$highlight=content&$count=true
```

16. Semantic Search (if enabled):
```
search=impact of climate change on agriculture&queryType=semantic&semanticConfiguration=default
```

17. Autocomplete (if suggesters are configured):
```
/autocomplete?search=cli&suggesterName=sg
```

Remember to replace field names (like 'chapter') with the actual names in your index if they differ.

## Tips for Testing:
1. Start with simple queries and gradually add complexity.
2. Pay attention to the JSON response to understand how each parameter affects the results.
3. Use the 'Request URL' generated by the UI to understand how queries are constructed.
4. Experiment with combining different features to see how they interact.

## Note:
Some features (like Semantic Search or Autocomplete) may require additional configuration in your search service. If these are not set up, the corresponding queries may not work.