# Text Summarization

In this recipe, we show some techniques to improve an LLM’s ability to summarize a long text from simple (e.g. `"Summarize this text: {text}..."`) to more complex prompting and chaining techniques. We will use OpenAI’s GPT-4o-mini model (128k input token limit), but you can use any model you’d like to implement these summarization techniques, as long as they have a large context window.

<div class="admonition tip">
<p class="admonition-title">Mirascope Concepts Used</p>
<ul>
<li><a href="../../../learn/prompts/">Prompts</a></li>
<li><a href="../../../learn/calls/">Calls</a></li>
<li><a href="../../../learn/response_models/">Response Models</a></li>
</ul>
</div>


<div class="admonition note">
<p class="admonition-title">Background</p>
<p>
    Large Language Models (LLMs) have revolutionized text summarization by enabling more coherent and contextually aware abstractive summaries. Unlike earlier models that primarily extracted or rearranged existing sentences, LLMs can generate novel text that captures the essence of longer documents while maintaining readability and factual accuracy.

</p>
</div>

## Simple Call

For our examples, we’ll use the [Wikipedia article on python](https://en.wikipedia.org/wiki/Python_(programming_language)). We will be referring to this article as `wikipedia-python.txt`.

The command below will download the article to your local machine by using the `curl` command. If you don't have `curl` installed, you can download the article manually from the link above and save it as `wikipedia-python.txt`.

In [1]:
!curl "https://en.wikipedia.org/wiki/Guido_van_Rossum" -o wikipedia-guido_van_rossum.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  166k  100  166k    0     0   273k      0 --:--:-- --:--:-- --:--:--  273k


We will be using a simple call as our baseline:

In [3]:
from mirascope.core import openai, prompt_template

with open("wikipedia-guido_van_rossum.html") as file:
    text = file.read()


@openai.call(model="gpt-4o-mini")
@prompt_template(
    """
    Summarize the following text:
    {text}
    """
)
def simple_summarize_text(text: str): ...


print(simple_summarize_text(text))

Guido van Rossum is a Dutch programmer, best known as the creator of the Python programming language. Born on January 31, 1956, in The Hague, he graduated with a master's degree in mathematics and computer science from the University of Amsterdam in 1982. Van Rossum developed Python in December 1989 as a hobby project during the Christmas holidays. He served as the "benevolent dictator for life" (BDFL) of Python until he stepped down in July 2018. He has worked with various organizations, including Centrum Wiskunde & Informatica, Google, Dropbox, and currently Microsoft as a Distinguished Engineer. Van Rossum continues to contribute significantly to Python's development and the wider programming community. He has received several awards for his contributions, including the Award for the Advancement of Free Software and the C&C Prize.


LLMs excel at summarizing shorter texts, but they often struggle with longer documents, failing to capture the overall structure while sometimes including minor, irrelevant details that detract from the summary's coherence and relevance.

One simple update we can make is to improve our prompt by providing an initial outline of the text then adhere to this outline to create its summary.

# Simple Call with Outline

This prompt engineering technique is an example of [Chain of Thought](https://www.promptingguide.ai/techniques/cot) (CoT), forcing the model to write out its thinking process. It also involves little work and can be done by modifying the text of the single call. With an outline, the summary is less likely to lose the general structure of the text.


In [4]:
@openai.call(model="gpt-4o-mini")
@prompt_template(
    """
    Summarize the following text by first creating an outline with a nested structure,
    listing all major topics in the text with subpoints for each of the major points.
    The number of subpoints for each topic should correspond to the length and
    importance of the major point within the text. Then create the actual summary using
    the outline.
    {text}
    """
)
def summarize_text_with_outline(text: str): ...


print(summarize_text_with_outline(text))

### Outline

1. **Introduction**
   - Brief Overview
   - Notable Role as Python Creator

2. **Personal Life**
   - Birth and Early Life
   - Education and Academic Achievements
   - Family Life

3. **Career Milestones**
   - Early Work (Centrum Wiskunde & Informatica)
   - Contributions to Python Development
   - Roles at Prominent Companies
     - Google
     - Dropbox
     - Microsoft

4. **Contributions to Python**
   - Development of Python
   - Notable Achievements and Innovations
   - Impact on Programming Community

5. **Awards and Recognition**
   - Honors and Awards Received 
   - Notable Recognitions in the Tech Community

6. **Conclusion**
   - Retirement and Current Status
   - Legacy in Programming

### Summary

Guido van Rossum, born on January 31, 1956, in The Hague, Netherlands, is a prominent computer programmer best known for creating the Python programming language, where he previously served as the "benevolent dictator for life" until stepping down in July 2018. He

By providing an outline, we enable the LLM to better adhere to the original article's structure, resulting in a more coherent and representative summary.

For our next iteration, we'll explore segmenting the document by topic, requesting summaries for each section, and then composing a comprehensive summary using both the outline and these individual segment summaries.

## Segment then Summarize

This more comprehensive approach not only ensures that the model adheres to the original text's structure but also naturally produces a summary whose length is proportional to the source document, as we combine summaries from each subtopic.

To apply this technique, we create a `SegmentedSummary` Pydantic `BaseModel` to contain the outline and section summaries, and extract it in a chained call from the original summarize_text() call:


In [None]:
from pydantic import BaseModel, Field


class SegmentedSummary(BaseModel):
    outline: str = Field(
        ...,
        description="A high level outline of major sections by topic in the text",
    )
    section_summaries: list[str] = Field(
        ..., description="A list of detailed summaries for each section in the outline"
    )


@openai.call(model="gpt-4o", response_model=SegmentedSummary)
@prompt_template(
    """
    Extract a high level outline and summary for each section of the following text:
    {text}
    """
)
def summarize_by_section(text): ...


@openai.call(model="gpt-4o")
@prompt_template(
    """
    The following contains a high level outline of a text along with summaries of a
    text that has been segmented by topic. Create a composite, larger summary by putting
    together the summaries according to the outline.
    Outline:
    {outline}

    Summaries:
    {summaries}
    """
)
def summarize_text_chaining(text: str) -> openai.OpenAIDynamicConfig:
    segmented_summary = summarize_by_section(text)
    return {
        "computed_fields": {
            "outline": segmented_summary.outline,
            "summaries": segmented_summary.section_summaries,
        }
    }


print(summarize_text_chaining(text))

<div class="admonition tip">
<p class="admonition-title">Additional Real-World Applications</p>
<ul>
<li><b>Meeting Notes</b>: Convert meeting from speech-to-text then summarize the text for reference.</li>
<li><b>Education</b>: Create study guides or slides from textbook material using summaries.</li>
<li><b>Productivity</b>: Summarize email chains, slack threads, word documents for your day-to-day.</li>
</ul>
</div>

When adapting this recipe to your specific use-case, consider the following:
    - Refine your prompts to provide clear instructions and relevant context for text summarization.
    - Experiment with different model providers and version to balance quality and speed.
    - Provide a feedback loop, use an LLM to evaluate the quality of the summary based on a criteria and feed that back into the prompt for refinement.

