# Abstractive Summarization

This notebook contains a sample for abstractive summarization using chain of density prompting.

https://arxiv.org/abs/2309.04269

Griffin Adams et al. introduce "Chain of Density" (CoD), an iterative GPT-4 prompting method generating increasingly entity-dense summaries at fixed length. CoD identifies 1-3 missing salient entities, incorporating them over five steps. Tested on 100 CNN/DailyMail articles, human evaluation (by the first four authors) reveals preference for summaries with entity density (entities/token) ~0.15, surpassing vanilla GPT-4 (0.122) and matching human-written (0.151) summaries. Entity density increases from 0.089 (step 1) to 0.167 (step 5), with step 3 (0.148) most preferred (expected step: 3.06). Low inter-annotator agreement (Fleiss' kappa: 0.112) indicates task subjectivity. Abstractiveness (extractive density), fusion (relative ROUGE gain), and lead bias measured; NLTK/Spacy used for tokenization/entity recognition. GPT-4 Likert-scale assessments show informativeness peaks at step 4, while Quality/Coherence decline after steps 2/1. Qualitative analysis reveals coherence/informativeness trade-off, with factual correctness challenging at high densities. CoD outperforms previous entity-based summarization approaches. Study limitations: news-only focus, evaluation subjectivity. 500 annotated/5,000 unannotated CoD summaries open-sourced for further research, enabling density distillation into open-source models.

This particular summary was generation 3 of Claude Sonnet 3.5 summarizing the text of the PDF using `Summarize this PDF.` and `Identify 10 key missing points, and any errors in the summary, and output a new, improved summary. Increase the density of the resulting summary, but do not increase the length.`

In [1]:
from dotenv import load_dotenv
import logging
import pandas as pd

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

load_dotenv()

True

Step 1: Pick your model provider

In [2]:
from biagen.llm import CohereProvider, GroqProvider, AIStudioProvider

llm = AIStudioProvider.from_env()

Step 2: Load some data

In [7]:
# load some data

area_df = pd.read_csv('assets/mo_conservation.tsv.gz', sep='\t')

data = area_df.loc[432]

article = data['area_info']

data

id                                                         432
area_id                       lone-jack-lake-conservation-area
area_name                     Lone Jack Lake Conservation Area
area_info     Lone Jack Lake Conservation Area Lone Jack La...
Name: 432, dtype: object

In [8]:
summaries = []

In [9]:
prompt_a = f"""Analyze the purpose and main topics of this article. Provide a detailed and accurate analysis spanning all covered topics.

{article}

## Analysis (the most significant topics, with one sentence commentary)
"""

analysis = llm.generate_one(prompt_a, max_tokens=1024, temperature=0.7, stop_sequences=None)

print(analysis)
len(analysis.split())

INFO:root:<function AIStudioProvider.generate at 0x120f2a200>: Exec: 3.0078041553497314s


## Analysis of Lone Jack Lake Conservation Area Article

**Purpose:** This article is a comprehensive guide to Lone Jack Lake Conservation Area, providing information for visitors, hunters, anglers, and anyone interested in the area's natural resources. 

**Main Topics:**

* **Location and History:** The article details the conservation area's location in southeastern Jackson County, Missouri, and its historical significance as a site of the Battle of Lone Jack during the Civil War.
* **Lake and Fishing:** The article describes the 35-acre lake, its fish populations (largemouth bass, bluegill, redear sunfish, and channel catfish), and the available fishing amenities, such as a concrete boat launch and fishing jetties.
* **Wildlife and Habitat:** The article highlights the area's diverse wildlife, including waterfowl, deer, wild turkey, quail, rabbits, and squirrels, and explains the management practices used to maintain habitat for these species.
* **Area Map and Brochure:**  The artic

297

In [10]:
prompt_s = f"""Given the article and the following analysis, provide a detailed, erudite, succinct, and accurate summary.

## Article

```article
{article}
```

## Analysis

{analysis}

## Summary (5 paragraphs)

"""

summary= llm.generate_one(prompt_s, max_tokens=1024, temperature=0.5, stop_sequences=None)

summaries.append(summary)

print(summary)
len(summary.split())

INFO:root:<function AIStudioProvider.generate at 0x120f2a200>: Exec: 2.3711349964141846s


Lone Jack Lake Conservation Area, situated in southeastern Jackson County, Missouri, offers a unique blend of natural beauty and historical significance. Established in 1989, the 295-acre area encompasses a 35-acre lake stocked with popular fish species like largemouth bass, bluegill, redear sunfish, and channel catfish, making it a prime destination for anglers. Visitors can enjoy a concrete boat launch and two fishing jetties, one of which is accessible for individuals with disabilities. 

Beyond the lake, the conservation area boasts diverse habitats, including woodlands, grasslands, croplands, and old fields, which provide year-round sustenance and shelter for a variety of wildlife. The area is home to a thriving population of waterfowl, deer, wild turkey, quail, rabbits, and squirrels, attracting nature enthusiasts and birdwatchers alike. The Missouri Department of Conservation actively manages the area's habitats through controlled burning, haying, and native plant plantings to e

318

### Improvement Loop

In [19]:
# Get our last summary

summary = summaries[-1]

In [20]:
# Check for missing information in the summary

prompt_m = f"""Given the article and the current summary, identify missing information or ways to improve the summary.

## Article

```article
{article}
```

## Current Summary

{summary}

## Missing Information (provide 12 entries of novel information not contained in the summary)
"""

missing = llm.generate_one(prompt_m, max_tokens=1024, temperature=0.7, stop_sequences=None)

print(missing)
len(missing.split())

INFO:root:<function AIStudioProvider.generate at 0x120f2a200>: Exec: 3.3386199474334717s


Here are 12 pieces of novel information not contained in the summary, based on the provided article:

1. **Historical Significance:** The summary mentions the Battle of Lone Jack, but it doesn't elaborate on its significance. The article states it was a "pivotal event" in Civil War history.  The summary could expand on this, perhaps mentioning the battle's strategic importance or the impact it had on the local area.

2. **Accessibility:**  The summary mentions that one jetty is accessible for individuals with disabilities, but it doesn't mention if the concrete boat launch is also accessible.  The article states that it is, so the summary could be updated to reflect this.

3. **Cave Pond:** The summary mentions Cave Pond, but it doesn't provide details about its size or location. The article states it's a one-acre pond by the entrance.  

4. **Wildlife Management:** The summary mentions habitat management through controlled burning and haying, but it doesn't mention the use of native p

437

In [21]:
prompt_r = f"""Using the current summary and the identified missing information, identify what information is important and unimportant.

## Article

```article
{article}
```

## Current Summary

{summary}

## Missing Information

{missing}

## Most Important Information
"""

important = llm.generate_one(prompt_r, max_tokens=2048, temperature=0.5, stop_sequences=None)

print(important)
len(important.split())

INFO:root:<function AIStudioProvider.generate at 0x120f2a200>: Exec: 3.26546311378479s


Here's a breakdown of the most important and unimportant information based on your analysis:

**Most Important Information**

* **Location and Access:** The summary accurately conveys the location of the conservation area and its accessibility via Brown Road.  This is crucial for visitors to find the area.
* **Key Features:**  The summary highlights the main attractions, including the 35-acre lake, the fishing jetties (with accessibility information), and Cave Pond. This helps visitors understand what the area offers.
* **Wildlife:** The summary mentions the diversity of wildlife present, including waterfowl, deer, turkey, quail, rabbits, and squirrels. This is a major draw for nature enthusiasts.
* **Hours of Operation and General Regulations:** The summary provides the daily operating hours and mentions that certain activities are allowed 24/7. It also briefly mentions regulations about target shooting and trapping. This helps visitors understand the general rules.
* **Hunting Season

421

In [22]:
prompt_r = f"""Using the current summary and the identified missing information, create an improved summary of the article.

## Article

```article
{article}
```

## Current Summary

{summary}

## Missing Information

{missing}

## Ideas To Consider

{important}

## Improved Summary (reorganized, expanded, and including new information; 6 paragraphs)
"""

resummarized = llm.generate_one(prompt_r, max_tokens=2048, temperature=0.5, stop_sequences=None)

summaries.append(resummarized)

print(resummarized)
len(resummarized.split())

INFO:root:<function AIStudioProvider.generate at 0x120f2a200>: Exec: 3.1773760318756104s


## Improved Summary

Lone Jack Lake Conservation Area, nestled in southeastern Jackson County, Missouri, offers a unique blend of natural beauty, historical significance, and recreational opportunities.  Just one mile northwest of Lone Jack on Brown Road, the area is easily accessible for outdoor enthusiasts.  The conservation area encompasses 295 acres, including a 35-acre lake stocked with largemouth bass, bluegill, redear sunfish, and channel catfish.  This makes it a popular destination for anglers, who can enjoy a concrete boat launch and two fishing jetties, both of which are accessible for individuals with disabilities.  The area also features a one-acre pond called Cave Pond, located by the entrance and stocked with similar fish species, offering another fishing option.

Beyond the lake, Lone Jack Lake Conservation Area boasts diverse habitats, including woodlands, grasslands, croplands, and old fields, providing year-round sustenance and shelter for a variety of wildlife.  The

557