In [11]:
import requests
from IPython.display import display, Markdown
from pathlib import Path

from document_ai_agents.document_parsing_agent import (
    DocumentLayoutParsingState,
    DocumentParsingAgent,
)
from document_ai_agents.document_rag_agent import DocumentRAGAgent, DocumentRAGState

In [2]:
path = "docs.pdf"

In [3]:
with open(path, "wb") as f: f.write(requests.get("https://www.seas.upenn.edu/~cis5190/fall2017/lectures/01_introduction.pdf").content)

In [4]:
agent1 = DocumentParsingAgent()

[32m2024-11-27 14:15:24.829[0m | [1mINFO    [0m | [36mdocument_ai_agents.document_parsing_agent[0m:[36m__init__[0m:[36m47[0m - [1mUsing Gemini model with schema: {'properties': {'layout_items': {'items': {'properties': {'element_type': {'description': 'Type of detected Item. Find Tables, figures and images. Use Text-Block for everything else, be as exhaustive as possible. Return 10 Items at most.', 'enum': ['Table', 'Figure', 'Image', 'Text-block'], 'type': 'string'}, 'summary': {'description': 'A detailed description of the layout Item.', 'type': 'string'}}, 'required': ['element_type', 'summary'], 'type': 'object'}, 'type': 'array'}}, 'type': 'object'}[0m


In [5]:
agent2 = DocumentRAGAgent()

[32m2024-11-27 14:15:24.850[0m | [1mINFO    [0m | [36mchromadb.telemetry.product.posthog[0m:[36m__init__[0m:[36m22[0m - [1mAnonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.[0m


In [6]:
state1 = DocumentLayoutParsingState(
    document_path=path
)

In [7]:
result1 = agent1.graph.invoke(state1)

[32m2024-11-27 14:15:25.037[0m | [1mINFO    [0m | [36mdocument_ai_agents.document_utils[0m:[36mextract_images_from_pdf[0m:[36m10[0m - [1mExtracting images from PDF: docs.pdf[0m
[32m2024-11-27 14:15:25.038[0m | [1mINFO    [0m | [36mdocument_ai_agents.document_utils[0m:[36mextract_images_from_pdf[0m:[36m13[0m - [1mConverting PDF to images using temporary directory: /tmp/tmpc6nigtsm[0m
[32m2024-11-27 14:15:30.066[0m | [1mINFO    [0m | [36mdocument_ai_agents.document_utils[0m:[36mextract_images_from_pdf[0m:[36m15[0m - [1mExtracted 51 images from the PDF.[0m
[32m2024-11-27 14:15:30.905[0m | [1mINFO    [0m | [36mdocument_ai_agents.document_parsing_agent[0m:[36mfind_layout_items[0m:[36m86[0m - [1mProcessing page 1[0m
[32m2024-11-27 14:15:30.906[0m | [1mINFO    [0m | [36mdocument_ai_agents.document_parsing_agent[0m:[36mfind_layout_items[0m:[36m86[0m - [1mProcessing page 2[0m
[32m2024-11-27 14:15:30.908[0m | [1mINFO    [0m | [36md

In [12]:
state2 = DocumentRAGState(
    question="What is this document about?",
    document_path=path,
    pages_as_base64_jpeg_images=result1["pages_as_base64_jpeg_images"],
    documents=result1["documents"],
)
result2 = agent2.graph.invoke(state2)
display(Markdown(result2["response"]))

[32m2024-11-27 14:18:29.150[0m | [1mINFO    [0m | [36mdocument_ai_agents.document_rag_agent[0m:[36mindex_documents[0m:[36m55[0m - [1mDocuments for this file are already indexed, exiting this node[0m
[32m2024-11-27 14:18:33.866[0m | [1mINFO    [0m | [36mdocument_ai_agents.document_rag_agent[0m:[36manswer_question[0m:[36m73[0m - [1mResponding to question What is this document about?[0m


This document is about supervised learning, specifically regression.  It explains that in regression, a function f(x) is learned to predict a real-valued y given x, using a set of data points (x₁, y₁), (x₂, y₂), ..., (xn, yn).  A graph illustrating a regression model for September Arctic Sea Ice Extent is provided as an example.

In [13]:
state2 = DocumentRAGState(
    question="Explain automatic speech recognition architecture step by step from the document",
    document_path=path,
    pages_as_base64_jpeg_images=result1["pages_as_base64_jpeg_images"],
    documents=result1["documents"],
)
result2 = agent2.graph.invoke(state2)
result2["response"]
display(Markdown(result2["response"]))

[32m2024-11-27 14:18:35.718[0m | [1mINFO    [0m | [36mdocument_ai_agents.document_rag_agent[0m:[36mindex_documents[0m:[36m55[0m - [1mDocuments for this file are already indexed, exiting this node[0m
[32m2024-11-27 14:18:41.345[0m | [1mINFO    [0m | [36mdocument_ai_agents.document_rag_agent[0m:[36manswer_question[0m:[36m73[0m - [1mResponding to question Explain automatic speech recognition architecture step by step from the document[0m


Here's a step-by-step explanation of automatic speech recognition (ASR) architecture based on the provided document:

1. **Input:** The process begins with the input audio signal.

2. **Feature Extraction:**  The raw audio is processed to extract relevant features.  This step converts the waveform into a representation (like a spectrogram) that highlights the acoustic properties important for speech recognition.  The image shows a spectrogram as the output of this step.

3. **Neural Network:** The extracted features are fed into a neural network.  This is the core of the system, learning complex patterns and relationships within the speech data.  The depth of the network (number of hidden layers) significantly impacts performance, as shown in the table.  Deep learning models have yielded state-of-the-art results.

4. **Decoder:** The neural network's output is then passed to a decoder. The decoder interprets the neural network's predictions and converts them into a sequence of phonemes or words.

5. **Language Model (and Transducer):** The decoder works in conjunction with a language model. This model incorporates linguistic knowledge (grammar, word probabilities) to refine the output, making the ASR system more accurate and robust to errors. The transducer component links the acoustic and linguistic information.

6. **Output:** The final output is a transcription of the spoken audio, representing the system's understanding of what was said.

The document emphasizes the use of Machine Learning (ML) and, specifically, deep learning, to improve the accuracy of the prediction of phone states (basic units of speech sound) from the sound spectrogram during the neural network stage.
