In [2]:
import pypdf

def extract_text_from_pdf(pdf_path):
    reader = pypdf.PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

pdf_text = extract_text_from_pdf("attention.pdf")
print("Extracted Text:", pdf_text)


Extracted Text: Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutio

In [20]:
import fitz  # PyMuPDF
import pytesseract
from PIL import Image
import io
import requests
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI

# Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialize OpenAI Chat Model
llm = ChatOpenAI(model="gpt-4o", api_key=openai_api_key)

def extract_images_from_pdf(pdf_path):
    """Extract images from the PDF file and return them as a list of PIL images."""
    doc = fitz.open(pdf_path)
    extracted_images = []

    for page_num in range(len(doc)):
        for img_index, img in enumerate(doc[page_num].get_images(full=True)):
            xref = img[0]  # Reference to the image
            base_image = doc.extract_image(xref)  # Extract image
            image_data = base_image["image"]  # Get raw image data
            
            # Convert raw image data to PIL Image
            image = Image.open(io.BytesIO(image_data))
            extracted_images.append(image)

    return extracted_images

def extract_text_from_images(images):
    """Extract text from a list of images using OCR."""
    extracted_texts = []
    
    for idx, image in enumerate(images):
        text = pytesseract.image_to_string(image).strip()
        extracted_texts.append(f"Image {idx + 1}:\n{text}\n")
    
    return "\n".join(extracted_texts)

def process_text_with_llm(text):
    """Process extracted text using GPT-4o for summarization or analysis."""
    if not text.strip():
        return "No readable text found in the images."
    
    response = llm.invoke(f"Extract meaning and summarize this text:\n{text}")
    return response.content  # Extract content from AI response

# Step 1: Extract images from PDF
pdf_path = "attention.pdf"
images = extract_images_from_pdf(pdf_path)

# Step 2: Extract text from images
extracted_text = extract_text_from_images(images)
print("\n📝 Extracted Text from Images:\n", extracted_text)

# Step 3: Process text with LLM
if extracted_text.strip():
    llm_output = process_text_with_llm(extracted_text)
    print("\n🔍 LLM Processed Output:\n", llm_output)
else:
    print("\n⚠️ No valid text to process with LLM.")



📝 Extracted Text from Images:
 Image 1:
Output
Probabilities

Add & Norm
Feed
Forward
Add & Norm

Multi- Head
Attention

Add & Norm

Add & Norm

Nx | Gada. Norm
Add & Norm Masked
Multi- Head Multi-Head
Attention Attention
SE a, of

Positional Positional

Encoding @ © © @ Encoding
Input Output

Embedding Embedding

Inputs Outputs
(shifted right)

Image 2:


Image 3:
Linear

YY
Es 2
as
Scaled Dot-Product }
Attention y

4


🔍 LLM Processed Output:
 The text describes various components of a transformer neural network architecture, commonly used in deep learning, particularly in natural language processing tasks. It mentions key elements like multi-head attention, feed-forward components, and positional encoding. The architecture includes processes like adding and normalizing (Add & Norm), and the inputs and outputs are embedded with specific representations. The structure generally involves multiple layers (denoted as Nx) of repeating these components for enhanced processing. The mention

In [22]:
import pdfplumber
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI

# Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialize OpenAI Chat Model
llm = ChatOpenAI(model="gpt-4o", api_key=openai_api_key)

def extract_tables_from_pdf(pdf_path):
    """Extract tables from a PDF and return them as formatted text."""
    extracted_tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            tables = page.extract_tables()
            
            for table_index, table in enumerate(tables, start=1):
                # Convert table data to text format
                table_text = "\n".join([" | ".join([str(cell) if cell else "" for cell in row]) for row in table])
                formatted_table = f"Table {table_index} from Page {page_num}:\n{table_text}\n"
                extracted_tables.append(formatted_table)

    return extracted_tables

def process_table_with_llm(table_text):
    """Send extracted tables to an LLM for structured processing."""
    if not table_text.strip():
        return "No valid table data found."
    
    response = llm.invoke(f"Extract useful insights from this table:\n{table_text}")
    return response.content  # Extract content from AI response

# Step 1: Extract tables from PDF
pdf_path = "attention.pdf"
pdf_tables = extract_tables_from_pdf(pdf_path)

# Step 2: Process extracted tables using an LLM
for idx, table in enumerate(pdf_tables):
    print(f"\n📊 Extracted Table {idx + 1}:\n", table)
    llm_output = process_table_with_llm(table)
    print("\n🔍 LLM Processed Output:\n", llm_output)
    print("-" * 50)



📊 Extracted Table 1:
 Table 1 from Page 9:
train
N d d h d d P ϵ
model ff k v drop ls steps
6 512 2048 8 64 64 0.1 0.1 100K
1 512 512
4 128 128
16 32 32
32 16 16
16
32
2
4
8
256 32 32
1024 128 128
1024
4096
0.0
0.2
0.0
0.2
positionalembeddinginsteadofsinusoids
6 1024 4096 16 0.3 300K


🔍 LLM Processed Output:
 The table you've provided seems to contain configurations for different machine learning models, likely used in training neural networks. Let's break down the values and extract some insights from it:

1. **Key Variables**: 
   - **N** seems to represent the number of layers or network depth.
   - **d** might refer to dimensions, likely of model components such as embeddings.
   - **h** could indicate the number of attention heads in a transformer model.
   - **P** might denote some hyperparameter, potentially dropout probability.
   - **ϵ** could be a small constant for numerical stability or another hyperparameter.
   - **model** column might suggest different configurations o

In [26]:
import os
import io
import fitz  # PyMuPDF
import pypdf
import pytesseract
import pdfplumber
from PIL import Image
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI

# Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialize OpenAI Chat Model
llm = ChatOpenAI(model="gpt-4o", api_key=openai_api_key)

def extract_images_from_pdf(pdf_path):
    """Extract images from a PDF and return them as a list of resized PIL images."""
    doc = fitz.open(pdf_path)
    extracted_images = []

    for page_num in range(len(doc)):
        for img_index, img in enumerate(doc[page_num].get_images(full=True)):
            xref = img[0]  # Reference to the image
            base_image = doc.extract_image(xref)  # Extract image
            image_data = base_image["image"]  # Get raw image data
            
            # Convert raw image data to PIL Image
            image = Image.open(io.BytesIO(image_data))

         
            
            extracted_images.append(image)

    return extracted_images

def extract_text_from_images(images):
    """Extract text from images using OCR."""
    extracted_texts = []
    
    for idx, image in enumerate(images):
        text = pytesseract.image_to_string(image).strip()
        if text:
            extracted_texts.append(f"Image {idx + 1}:\n{text}\n")
    
    return "\n".join(extracted_texts)

def extract_tables_from_pdf(pdf_path):
    """Extract tables from a PDF and return them as formatted text."""
    extracted_tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            tables = page.extract_tables()
            
            for table_index, table in enumerate(tables, start=1):
                # Convert table data to text format
                table_text = "\n".join([" | ".join([str(cell) if cell else "" for cell in row]) for row in table])
                formatted_table = f"Table {table_index} from Page {page_num}:\n{table_text}\n"
                extracted_tables.append(formatted_table)

    return "\n".join(extracted_tables)

def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF (excluding tables & images)."""
    reader = pypdf.PdfReader(pdf_path)
    extracted_text = ""
    for page in reader.pages:
        text = page.extract_text()
        if text:
            extracted_text += text + "\n"
    return extracted_text.strip()

def process_text_with_llm(text, stage):
    """Process extracted text with an LLM in sequential stages."""
    if not text.strip():
        return f"No readable text found in the {stage}."

    response = llm.invoke(f"Process the following {stage} data and extract key insights:\n{text}")
    return response.content  # Extract content from AI response

# Step 1: Extract data from PDF
pdf_path = "attention.pdf"

# Process Images First
images = extract_images_from_pdf(pdf_path)
image_text = extract_text_from_images(images)

# Send Image Text to LLM First
if image_text.strip():
    llm_images_output = process_text_with_llm(image_text, "images")
    
else:
    llm_images_output = "\n⚠️ No valid text extracted from images."

# Process Tables Next
tables_text = extract_tables_from_pdf(pdf_path)


# Send Table Text to LLM Next
if tables_text.strip():
    llm_tables_output = process_text_with_llm(tables_text, "tables")
    
else:
    llm_tables_output = "\n⚠️ No valid text extracted from tables."

# Process Regular Text Last
pdf_text = extract_text_from_pdf(pdf_path)

# Combine All Outputs
final_combined_output = f"""
{llm_images_output}
{llm_tables_output}
{pdf_text}
"""

print(final_combined_output)



The images you've provided appear to be related to neural network architectures, possibly from a research paper or educational resource focused on attention mechanisms or transformers. Let me break down the key components and insights from these descriptions:

**Image 1:**
- **Components Described:**
  - **Add & Norm:** These are likely referring to the addition and normalization layers commonly found in transformer models. They involve adding the input to the output of a previous layer and then normalizing the result, usually with layer normalization.
  - **Feed Forward:** This refers to the feed-forward neural network layer that processes the embeddings independently in transformer models.
  - **Multi-Head Attention:** This is a core part of the transformer architecture, which allows the model to focus on different parts of the input sequence simultaneously.
  - **Masked Multi-Head Attention:** In the context of transformer models (like those used in language modeling), this allows 