# Objective (Markdown cell)

Goal:
Automatically generate concise summaries from long documents.

This notebook:

-Uses BBC news articles

-Runs fully on CPU

-Uses a pretrained Transformer model

-Avoids heavy training

# ðŸ§  Why Summarization Matters (Business View)
| **Industry** | **Use Case**            |
| ------------ | ----------------------- |
| Finance      | Earnings call summaries |
| Legal        | Contract abstraction    |
| Media        | News highlights         |
| Healthcare   | Clinical notes summary  |


# IMPORTs

In [1]:
import pandas as pd
import torch
from transformers import pipeline


  from .autonotebook import tqdm as notebook_tqdm


# Load Dataset (RAW text)

In [2]:
bbc_df = pd.read_csv(
    "../data/bbc-news-data.csv",
    sep="\t",
    engine="python"
)

bbc_df["full_text"] = bbc_df["title"] + ". " + bbc_df["content"]


# Choose CPU-Friendly Model

We use:

sshleifer/distilbart-cnn-12-6


Why?

Optimized for summarization

Smaller than full BART

Runs on CPU (slow but stable)

In [3]:
summarizer = pipeline(
    "text2text-generation",
    model="t5-small",
    device=-1
)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cpu


# Handle Long Text

In [4]:
# Transformers have token limits. 

def summarize_text(text, max_input_length=1024):
    # Safely truncate the text to prevent exceeding input limits
    text = text[:max_input_length]  # Truncate the text based on the max_input_length

    # Generate summary using the summarizer pipeline
    summary = summarizer(
        "summarize: " + text,
        max_length=130,
        min_length=40,
        do_sample=False
    )
    
    # Return the generated summary text
    return summary[0]["generated_text"]


# Test on a Single Article

In [5]:
sample_text = bbc_df.loc[0, "full_text"]

print("ORIGINAL TEXT:\n", sample_text[:500])
print("\nSUMMARY:\n", summarize_text(sample_text))


ORIGINAL TEXT:
 Ad sales boost Time Warner profit.  Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier.  The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.  Time 


Both `max_new_tokens` (=256) and `max_length`(=130) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



SUMMARY:
 quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier . fourth quarter sales rose 2% to $11.1bn from $10.9bn . profits buoyed by one-off gains which offset a profit dip at Warner Bros .


# Apply to Small Subset (CPU-safe)

In [6]:
bbc_df_subset = bbc_df.sample(10, random_state=42)

bbc_df_subset["summary"] = bbc_df_subset["full_text"].apply(summarize_text)

bbc_df_subset[["category", "summary"]]


Both `max_new_tokens` (=256) and `max_length`(=130) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=130) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=130) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=130) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Unnamed: 0,category,summary
414,business,"house prices dipped slightly in November, the ..."
420,business,the London Stock Exchange is planning to annou...
1644,sport,Imanol Harinordoquy has been dropped from the ...
416,business,shares in barclays have risen on Monday follow...
1232,politics,labour and the Conservatives are still telepho...
1544,sport,appoints former coach Glenn Hoddle as the new ...
1748,sport,Daniela Hantuchova now faces serena Williams i...
1264,politics,the comments come a day ahead of a high court ...
629,entertainment,partridge worked with marley from 1977 until t...
1043,politics,three councillors in Birmingham caught operati...


## Evaluation Strategy

Summarization does not have a single correct output.
Therefore, evaluation was performed qualitatively by reviewing generated summaries for:

- Coherence and readability
- Coverage of key information
- Factual consistency with the source document
- Conciseness

This approach aligns with standard practices in abstractive summarization tasks.
