# Needle(s) in a Haystack Test

How many 'needles' can GPT4o extract from a large amount of context? For reference there are about 21,600 tokens in the Microsoft Build Document.

Let's have GPT4o generate a structured table based on the announcments from Microsoft Build 2024!

## Import Libraries 🧑‍💻

We are brining in a few libraries here, most of them are LangChain Libraries:

1. AzureAIDocumentIntelligenceLoader again to load and convert the PDF to Markdown

2. AzureChatOpenAI to send and receive API requests from GPT4o

3. ChatPromptTemplate so we can build a prompt to ask GPT4o to structure data for us

4. StrOutputParser to ensure the output from the LLM is in string format. This is important since we are going to leverage Pandas to query the data once we have it

In [1]:
from dotenv import load_dotenv
load_dotenv() 
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
import os
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
import pandas as pd

## Load Microsoft Build Document 📄

Load the Book of News Doucument

In [None]:
loader = AzureAIDocumentIntelligenceLoader(file_path="C:\\Users\\conne\\development\\repos\\chunking_for_rag\\Book_Of_News.pdf", api_key=os.environ.get('DOCUMENT_INTELLIGENCE_KEY'), api_endpoint=os.environ.get('DOCUMENT_INTELLIGENCE_ENDPOINT'), api_model="prebuilt-layout")
book_of_build = loader.load()

## Bring in GPT4o 🤖

Bring in GPT4o with the extra large context window of 96,000 words. This is the LLM we are going to feed 21,600 words of the Book of News document to.

In [None]:
llm = AzureChatOpenAI(
    azure_deployment="gpt4o",
    temperature=0,
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_version="2024-02-01"
)

## Make an LLM Call to Make Untructured Data Structured 📞

Let's see how many 'needles' GPT4o can extract from 21,600 words and effectivley structure it for us.

In [None]:
prompt = ChatPromptTemplate.from_template("""
You are an assistant that will summarize all of the Azure AI and Data Services announcements in the below context. Make sure to put the data into the following table. Make sure to only respond with the table and nothing else.

| Service      | Announcement |
|--------------|--------------|
|              |              |             

Context:
{docs_string}
""")

docs_string = ""
for page in book_of_build:
    docs_string += page.page_content

chain = prompt | llm | StrOutputParser()
table = chain.invoke({"docs_string": book_of_build})

## Print LLM Generated Table 🤖

Let's have a look at the table GPT4o generated for us.

In [None]:
print(table)

## Minor Data Cleaning 🧼

Let's get the LLM generated table into a format we can query in Pandas 🐼

In [None]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

rows = [row.strip().split('|') for row in table.strip().split('\n')[2:]]
data_list = [[value.strip() for value in row[1:-1]] for row in rows]
df = pd.DataFrame(data_list, columns=["Service", "Announcement"])

## Query using Pandas 🐼

Write a few Pandas Query to filter the data that the LLM generated.

In [None]:
azure_ai_df = df[df['Service'] == 'Azure AI Services']
print(azure_ai_df)

azure_ai_df = df[df['Service'] == 'Developer Tools & DevOps']
print(azure_ai_df)