# Needle(s) in a Haystack Test

How many 'needles' can GPT4o extract from a large amount of context? For reference there are about 21,600 tokens in the Microsoft Build Document.

Let's have GPT4o generate a structured table based on the announcments from Microsoft Build 2024 Book of News!

## Import Libraries 🧑‍💻

We are brining in a few libraries here, most of them are LangChain Libraries:

1. AzureAIDocumentIntelligenceLoader again to load and convert the PDF to Markdown

2. AzureChatOpenAI to send and receive API requests from GPT4o

3. ChatPromptTemplate so we can build a prompt to ask GPT4o to structure data for us

4. StrOutputParser to ensure the output from the LLM is in string format. This is important since we are going to leverage Pandas to query the data once it is generated

5. Pandas so we can do some light filtering on the generated data

In [15]:
from dotenv import load_dotenv
load_dotenv() 
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
import os
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
import pandas as pd

## Load Microsoft Build Document 📄

Load the Book of News Doucument

In [None]:
loader = AzureAIDocumentIntelligenceLoader(file_path="C:\\Users\\conne\\development\\repos\\chunking_for_rag\\Book_Of_News.pdf", 
                                           api_key=os.environ.get('DOCUMENT_INTELLIGENCE_KEY'), 
                                           api_endpoint=os.environ.get('DOCUMENT_INTELLIGENCE_ENDPOINT'))
book_of_build = loader.load()

## Bring in GPT4o 🤖

Bring in GPT4o with the extra large context window of 96,000 words. This is the LLM we are going to feed 21,600 words of the Book of News document to.

In [16]:
llm = AzureChatOpenAI(
    azure_deployment="gpt4o",
    temperature=0,
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_version="2024-02-01"
)

## Make an LLM Call to Make Untructured Data Structured 📞

Let's see how many 'needles' (ie:facts) GPT4o can extract from 21,600 words and effectivley structure it for us.

In [17]:
prompt = ChatPromptTemplate.from_template("""
You are an assistant that will summarize all of the Azure AI announcements in the below context. Make sure to put the data into the following table. Make sure to only respond with the table and nothing else.

| Service      | Announcement |
|--------------|--------------||
|              |              |             

Context:
{docs_string}
""")

docs_string = ""
for page in book_of_build:
    docs_string += page.page_content

chain = prompt | llm | StrOutputParser()
table = chain.invoke({"docs_string": book_of_build})

## Print LLM Generated Table 🤖

Let's have a look at the table GPT4o generated for us and see how many needles it found.

In [18]:
print(table)

| Service      | Announcement |
|--------------|--------------|
| Azure AI Services | Announcing Azure Patterns and Practices for Private Chatbots |
| Azure AI Services | Announcing Custom Generative Mode in Preview Soon |
| Azure AI Services | Azure AI Search Features Search Relevance Updates and New Integrations |
| Azure AI Services | Azure AI Studio Lets Developers Responsibly Build and Deploy Custom Copilots |
| Azure AI Services | Azure OpenAI Service Features Key AI Advancements |
| Azure AI Services | Khan Academy and Microsoft Announce Partnership |
| Azure AI Services | Microsoft Adds Multimodal Phi-3 Model Phi-3-Vision |
| Azure AI Services | Safeguard Copilots with New Azure AI Content Safety Capabilities |
| Azure AI Services | Speech Analytics, Video Dubbing in Preview in Azure AI Speech |
| Azure Data | Introducing Real-Time Intelligence in Microsoft Fabric |
| Azure Data | New AI Capabilities in Azure Database for PostgreSQL |
| Azure Data | New Capabilities and Updates

## Minor Data Cleaning 🧼

Let's get the LLM generated table into a format we can query in Pandas 🐼

In [19]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

rows = [row.strip().split('|') for row in table.strip().split('\n')[2:]]
data_list = [[value.strip() for value in row[1:-1]] for row in rows]
df = pd.DataFrame(data_list, columns=["Service", "Announcement"])

## Query using Pandas 🐼

Leverage Pandas to filter data that the LLM generated.

In [20]:
azure_ai_df = df[df['Service'] == 'Azure AI Services']
print(azure_ai_df)

             Service                                                                  Announcement
0  Azure AI Services                  Announcing Azure Patterns and Practices for Private Chatbots
1  Azure AI Services                             Announcing Custom Generative Mode in Preview Soon
2  Azure AI Services        Azure AI Search Features Search Relevance Updates and New Integrations
3  Azure AI Services  Azure AI Studio Lets Developers Responsibly Build and Deploy Custom Copilots
4  Azure AI Services                             Azure OpenAI Service Features Key AI Advancements
5  Azure AI Services                               Khan Academy and Microsoft Announce Partnership
6  Azure AI Services                            Microsoft Adds Multimodal Phi-3 Model Phi-3-Vision
7  Azure AI Services              Safeguard Copilots with New Azure AI Content Safety Capabilities
8  Azure AI Services                 Speech Analytics, Video Dubbing in Preview in Azure AI Speech
