# OpenAI Summarization Service Demo

This notebook demonstrates how to use the `OpenAISummarizationService` with fake input data to test summarization capabilities including:
- Single content summarization
- Long content with map-reduce strategy
- Batch summarization

## Setup and Imports

First, let's import the necessary modules and set up the service.

In [2]:
from dotenv import load_dotenv
load_dotenv(".env")

from backend.app.infrastructure.services.openai_summarization_service import OpenAISummarizationService

## Initialize the Service

Create an instance of the summarization service. Make sure you have your OpenAI API key set in the environment or settings.

In [5]:
# Initialize the service
# Note: API key should be set in your environment variables or settings
service = OpenAISummarizationService(
    model="gpt-4o-mini",
    max_tokens=500,
    temperature=0.4,
    chunk_size=8000
)

print("✓ Summarization service initialized successfully")

✓ Summarization service initialized successfully


## Test 1: Short Content Summarization

Summarize a short article about artificial intelligence.

In [6]:
# Fake short content
short_content = """
Should LLMs just treat text content as an image?
Several days ago, DeepSeek released a new OCR paper. OCR, or “optical character recognition”, is the process of converting an image of text - say, a scanned page of a book - into actual text content. Better OCR is obviously relevant to AI because it unlocks more text data to train language models on1. But there’s a more subtle reason why really good OCR might have deep implications for AI models.
Optical compression
According to the DeepSeek paper, you can pull out 10 text tokens from a single image token with near-100% accuracy. In other words, a model’s internal representation of an image is ten times as efficient as its internal representation of text. Does this mean that models shouldn’t consume text at all? When I paste a few paragraphs into ChatGPT, would it be more efficient to convert that into an image of text before sending it to the model? Can we supply 10x or 20x more data to a model at inference time by supplying it as an image of text instead of text itself?
This is called “optical compression”. It reminds me of a funny idea from June of this year to save money on OpenAI transcriptions: before uploading the audio, run it through ffmpeg to speed it up by 2x. The model is smart enough to still pull out the text, and with one simple trick you’ve cut your inference costs and time by half. Optical compression is the same kind of idea: before uploading a big block of text, take a screenshot of it (and optionally downscale the quality) and upload the screenshot instead.
Some people are already sort-of doing this with existing multimodal LLMs. There’s a company selling this as a service, an open-source project, and even a benchmark. It seems to work okay! Bear in mind that this is not an intended use case for existing models, so it’s plausible that it could get a lot better if AI labs start actually focusing on it.
The DeepSeek paper suggests an interesting way2 to use tighter optical compression for long-form text contexts. As the context grows, you could decrease the resolution of the oldest images so they’re cheaper to store, but are also literally blurrier. The paper suggests an analogy between this and human memory, where fresh memories are quite vivid but older ones are vaguer and have less detail.
Why would this work?
Optical compression is pretty unintuitive to many software engineers. Why on earth would an image of text be expressible in fewer tokens than the text itself?
In terms of raw information density, an image obviously contains more information than its equivalent text. You can test this for yourself by creating a text file, screenshotting the page, and comparing the size of the image with the size of the text file: the image is about 200x larger. Intuitively, the word “dog” only contains a single word’s worth of information, while an image of the word “dog” contains information about the font, the background and text color, kerning, margins, and so on. How, then, could it be possible that a single image token can contain ten tokens worth of text?
The first explanation is that text tokens are discrete while image tokens are continuous. Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be any sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens.
Another way of looking at the same intuition is that text tokens are a really inefficient way of expressing information. This is often obscured by the fact that text tokens are a reasonably efficient way of sharing information, so long as the sender and receiver both know the list of all possible tokens. When you send a LLM a stream of tokens and it outputs the next one, you’re not passing around slices of a thousand numbers for each token - you’re passing a single integer that represents the token ID. But inside the model this is expanded into a much more inefficient representation (inefficient because it encodes some amount of information about the meaning and use of the token)3. So it’s not that surprising that you could do better than text tokens.
Zooming out a bit, it’s plausible to me that processing text as images is closer to how the human brain works. To state the obvious, humans don’t consume text as textual content; we consume it as image content (or sometimes as audio). Maybe treating text as a sub-category of image content could unlock ways of processing text that are unavailable when you’re just consuming text content. As a toy example, emoji like :) are easily-understandable as image content but require you to “already know the trick” as text content4.
Final thoughts
Of course, AI research is full of ideas that sounds promising but just don’t work that well. It sounds like you should be able to do this trick on current multimodal LLMs - particularly since many people just use them for OCR purposes anyway - but it hasn’t worked well enough to become common practice.
Could you train a new large language model on text represented as image content? It might be tricky. Training on text tokens is easy - you can simply take a string of text and ask the model to predict the next token. How do you train on an image of text?
You could break up the image into word chunks and ask the model to generate an image of the next word. But that seems to me like it’d be really slow, and tricky to check if the model was correct or not (e.g. how do you quickly break a file into per-word chunks, how do you match the next word in the image, etc). Alternatively, you could ask the model to output the next word as a token. But then you probably have to train the model on enough tokens so it knows how to manipulate text tokens. At some point you’re just training a normal LLM with no special “text as image” superpowers.
edit: this post got some comments on Hacker News. Some commenters are suspicious that image tokenization could ever be better than text tokenization, while other commenters say they’ve already been supplying text-as-image prompts successfully.
edit: I also remembered a relevant point from my past amateur research into owl call identification. State of the art bird call identifier systems like BirdNet do so by visual processing, not audio processing - they convert the audio into a spectrogram and then run that image through a CNN, instead of directly embedding the audio stream.
-
AI labs are desperate for high-quality text, but only around 30% of written books have been digitized. It’s really hard to find recent data on this, but as a very rough estimate Google Books had ~40M books in 2023, but Google estimates there to have been ~130M books in 2010. That comes out to 30%.
↩ -
See Figure 13.
↩ -
Not to skip too far ahead, but this is one reason to think that representing a block of text tokens in a single image might not be such a great idea.
↩ -
Of course current LLMs can interpret these emojis. Less-toy examples: image-based LLMs might have a better feel for paragraph breaks and headings, might be better able to take a big picture view of a single page of text, and might find it easier to “skip through” large documents by skimming the start of each paragraph. Or they might not! We won’t know until somebody tries.
↩
If you liked this post, consider subscribing to email updates about my new posts, or sharing it on Hacker News.
October 21, 2025 │ Tags: ai
"""

# Summarize the content
summary = await service.summarize(short_content)

print("Original length:", len(short_content), "characters")
print("\nSummary:")
print("-" * 80)
print(summary)
print("-" * 80)

Original length: 7612 characters

Summary:
--------------------------------------------------------------------------------
DeepSeek-OCR introduces "contexts optical compression," converting text into images to reduce token usage by up to 20x, maintaining 97% accuracy at 10x compression and 60% at 20x. This method enhances efficiency in processing extensive documents, surpassing traditional OCR models. ([deepseek-ocr.ai](https://www.deepseek-ocr.ai/?utm_source=openai))

The model comprises DeepEncoder, which compresses high-resolution text into images, and DeepSeek3B-MoE-A570M, a decoder that reconstructs text from these images. This approach is particularly effective for complex documents, including those with tables and charts. ([deepseekocr.io](https://deepseekocr.io/?utm_source=openai))

While the concept of treating text as images is innovative, it may not be universally applicable. The effectiveness of this method depends on the specific requirements of the task and the nature of

## Test 2: Long Content with Map-Reduce

Test the map-reduce strategy with a longer article that will be split into chunks.

In [None]:
# Fake long content (simulating a long article)
long_content = """
The Evolution of Web Development: A Comprehensive Overview

Introduction:
Web development has undergone tremendous changes since the early days of the internet. 
What started as simple static HTML pages has evolved into complex, interactive 
applications that power modern digital experiences. This article explores the major 
milestones in web development history and emerging trends shaping its future.

The Early Days (1990s):
In the beginning, websites were primarily static HTML documents with basic styling 
provided by CSS. Developers had limited tools and browsers had inconsistent support 
for web standards. JavaScript was introduced in 1995, enabling basic interactivity, 
but it was primitive compared to today's capabilities. Web design was constrained 
by slow internet connections and limited browser capabilities.

The Rise of Dynamic Web (2000s):
The 2000s saw the emergence of server-side technologies like PHP, ASP, and JSP, 
enabling dynamic content generation. AJAX revolutionized user experience by allowing 
partial page updates without full reloads. Content Management Systems like WordPress 
and Drupal made it easier for non-technical users to create and manage websites. 
Web 2.0 brought user-generated content and social networking to the forefront.

Modern Web Development (2010s):
The 2010s introduced powerful JavaScript frameworks like Angular, React, and Vue.js, 
enabling the creation of Single Page Applications (SPAs). Mobile-first design became 
essential as smartphone usage exploded. Responsive web design techniques ensured 
websites worked seamlessly across different screen sizes. RESTful APIs and later 
GraphQL enabled better separation between frontend and backend systems.

Cloud Computing and DevOps:
Cloud platforms like AWS, Azure, and Google Cloud transformed how applications are 
deployed and scaled. Containerization with Docker and orchestration with Kubernetes 
became standard practices. Continuous Integration and Continuous Deployment (CI/CD) 
pipelines automated the development workflow. Serverless architectures emerged, 
allowing developers to focus on code rather than infrastructure management.

Current Trends and Future Directions:
Progressive Web Apps (PWAs) blur the line between web and native applications, 
offering offline functionality and app-like experiences. WebAssembly enables 
near-native performance for compute-intensive tasks in the browser. JAMstack 
architecture promotes better performance, security, and scalability through 
pre-rendering and CDN distribution. Artificial Intelligence and Machine Learning 
are being integrated into web applications for personalization and automation.

Performance Optimization:
Modern web development emphasizes performance optimization through techniques like 
code splitting, lazy loading, and image optimization. Core Web Vitals have become 
important metrics for measuring user experience. Edge computing brings computation 
closer to users, reducing latency. Build tools like Webpack, Vite, and esbuild 
optimize bundle sizes and improve load times.

Security Considerations:
Web security has become increasingly critical with the rise of cyber threats. 
HTTPS is now standard, with browsers warning users about insecure connections. 
Content Security Policy (CSP) helps prevent XSS attacks. OAuth and JWT provide 
secure authentication mechanisms. Regular security audits and dependency updates 
are essential practices for maintaining secure applications.

The Future of Web Development:
Looking ahead, WebGPU promises to bring high-performance graphics to the browser. 
Web3 technologies are exploring decentralized applications and blockchain integration. 
Voice interfaces and AR/VR experiences are becoming more prevalent on the web. 
The focus on accessibility ensures websites are usable by everyone, regardless of 
abilities. Low-code and no-code platforms are democratizing web development.

Conclusion:
Web development continues to evolve at a rapid pace, driven by technological 
advancements and changing user expectations. Developers must stay current with 
new tools, frameworks, and best practices to build effective web applications. 
The future promises even more exciting innovations that will further transform 
how we create and interact with web content.
""" * 3  # Repeat to make it longer

# Summarize the long content
print(f"Long content length: {len(long_content)} characters")
print(f"Chunk size: {service.chunk_size} characters")
print("\nSummarizing (this may take a moment for map-reduce)...")

long_summary = await service.summarize(long_content)

print("\nSummary:")
print("-" * 80)
print(long_summary)
print("-" * 80)

## Test 3: Batch Summarization

Summarize multiple articles concurrently using batch processing.

In [None]:
# Fake batch content - multiple articles
batch_contents = [
    """
    Quantum computing represents a paradigm shift in computational power. Unlike 
    classical computers that use bits (0s and 1s), quantum computers use qubits 
    that can exist in superposition. This allows them to perform certain calculations 
    exponentially faster than classical computers. Companies like IBM, Google, and 
    Microsoft are racing to build practical quantum computers. Applications include 
    cryptography, drug discovery, and optimization problems.
    """,
    
    """
    Climate change is one of the most pressing challenges facing humanity. Rising 
    global temperatures are causing more frequent extreme weather events, melting 
    ice caps, and rising sea levels. Scientists agree that human activities, 
    particularly greenhouse gas emissions, are the primary cause. Solutions include 
    transitioning to renewable energy, improving energy efficiency, and developing 
    carbon capture technologies. International cooperation is essential to address 
    this global challenge.
    """,
    
    """
    The gig economy has transformed how people work, offering flexibility and 
    independence. Platforms like Uber, Airbnb, and Upwork connect workers with 
    customers directly. While this provides opportunities for supplemental income 
    and entrepreneurship, it also raises concerns about job security, benefits, 
    and worker protections. Policymakers are grappling with how to regulate these 
    new forms of employment while preserving their benefits.
    """,
    
    """
    Space exploration has entered a new era with private companies joining government 
    agencies. SpaceX has successfully launched reusable rockets, dramatically reducing 
    launch costs. NASA's Artemis program aims to return humans to the Moon and 
    eventually reach Mars. Commercial space stations and space tourism are becoming 
    reality. The next frontier includes asteroid mining and establishing permanent 
    human presence beyond Earth.
    """
]

# Batch summarize
print(f"Summarizing {len(batch_contents)} articles concurrently...\n")

batch_summaries = await service.summarize_batch(batch_contents)

# Display results
for i, (original, summary) in enumerate(zip(batch_contents, batch_summaries), 1):
    print(f"Article {i}:")
    print(f"Original length: {len(original)} characters")
    print(f"Summary: {summary}")
    print("-" * 80)

## Test 4: Error Handling

Test how the service handles edge cases and errors.

In [None]:
# Test empty content
print("Testing empty content handling...")
try:
    await service.summarize("")
except ValueError as e:
    print(f"✓ Correctly caught error: {e}")

# Test whitespace-only content
print("\nTesting whitespace-only content...")
try:
    await service.summarize("   \n\t   ")
except ValueError as e:
    print(f"✓ Correctly caught error: {e}")

print("\n✓ Error handling works as expected")

## Summary

This notebook demonstrated:
1. ✓ Basic single content summarization
2. ✓ Long content summarization with automatic chunking (map-reduce)
3. ✓ Batch summarization of multiple items concurrently
4. ✓ Error handling for invalid inputs

The `OpenAISummarizationService` successfully handles various content lengths and uses map-reduce strategy automatically when content exceeds the chunk size.