#**ABSTRACTIVE TEXT SUMMARIZATION USING TRANSFORMER**

In [1]:
#Run this Code to avoid Error
!pip uninstall -y wandb


[0m




## `Loading the Pretrained BART Model`

In [2]:
!pip install transformers



In [3]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Load tokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")

# Load model
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### `Loading the Dataset`

In [4]:
from datasets import load_dataset

# Load BillSum dataset
dataset = load_dataset("billsum")

# Show dataset structure
print(dataset)

# Show how many samples
print("\nTraining samples:", len(dataset["train"]))
print("Test samples:", len(dataset["test"]))


DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 18949
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 3269
    })
    ca_test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 1237
    })
})

Training samples: 18949
Test samples: 3269


In [5]:
print("\nNumber of Training Samples:", len(dataset["train"]))
print("Number of Test Samples:", len(dataset["test"]))

print("\nColumn Names:\n")
print(dataset["train"].column_names)

# Show example bill text and summary
print("\nSample Bill Text (first 500 chars):\n")
print(dataset["train"][0]["text"][:500])

print("\nSample Summary:\n")
print(dataset["train"][0]["summary"])


Number of Training Samples: 18949
Number of Test Samples: 3269

Column Names:

['text', 'summary', 'title']

Sample Bill Text (first 500 chars):

SECTION 1. LIABILITY OF BUSINESS ENTITIES PROVIDING USE OF FACILITIES 
              TO NONPROFIT ORGANIZATIONS.

    (a) Definitions.--In this section:
            (1) Business entity.--The term ``business entity'' means a 
        firm, corporation, association, partnership, consortium, joint 
        venture, or other form of enterprise.
            (2) Facility.--The term ``facility'' means any real 
        property, including any building, improvement, or appurtenance.
            (3) Gros

Sample Summary:

Shields a business entity from civil liability relating to any injury or death occurring at a facility of that entity in connection with a use of such facility by a nonprofit organization if: (1) the use occurs outside the scope of business of the business entity; (2) such injury or death occurs during a period that such facility is

## ` Testing the Model on a Sample Article`

In [6]:
sample_text = """
The United Nations on Monday announced a new global initiative aimed at reducing carbon emissions by 40% before the year 2035.
During a press conference held in New York, the UN Secretary-General emphasized the importance of immediate climate action,
stating that countries must work together to limit global warming. The initiative includes investments in renewable energy,
restrictions on fossil fuel usage, and support for developing nations that are most vulnerable to climate change.
Environmental experts around the world welcomed the decision, calling it a crucial step toward a sustainable future.
"""

# Encode input text
inputs = tokenizer([sample_text], max_length=1024, return_tensors="pt", truncation=True)

# Generate summary
summary_ids = model.generate(inputs["input_ids"], max_length=60, min_length=20, length_penalty=2.0)

# Decode summary
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))


The United Nations announced a new global initiative aimed at reducing carbon emissions by 40% before the year 2035. The initiative includes investments in renewable energy, restrictions on fossil fuel usage and support for developing nations.


In [7]:
# Pick one sample article from the training set
sample_article = dataset["train"][0]["text"]
sample_summary = dataset["train"][0]["summary"]

print("Original Article (first 700 chars):\n")
print(sample_article[:700])

print("\nReference Summary:\n")
print(sample_summary)


Original Article (first 700 chars):

SECTION 1. LIABILITY OF BUSINESS ENTITIES PROVIDING USE OF FACILITIES 
              TO NONPROFIT ORGANIZATIONS.

    (a) Definitions.--In this section:
            (1) Business entity.--The term ``business entity'' means a 
        firm, corporation, association, partnership, consortium, joint 
        venture, or other form of enterprise.
            (2) Facility.--The term ``facility'' means any real 
        property, including any building, improvement, or appurtenance.
            (3) Gross negligence.--The term ``gross negligence'' means 
        voluntary and conscious conduct by a person with knowledge (at 
        the time of the conduct) that the conduct is likely to be 
        h

Reference Summary:

Shields a business entity from civil liability relating to any injury or death occurring at a facility of that entity in connection with a use of such facility by a nonprofit organization if: (1) the use occurs outside the scope of business 

## `Fine-Tuning the Model on BillSum dataset`

> Using only samples from BillSum



`Taking a Small Subset (200 samples)`



In [8]:
from datasets import Dataset

# Take first 200 samples from training set
small_train = dataset["train"].select(range(200))
small_test = dataset["test"].select(range(50))   # small eval set

print("Train subset:", len(small_train))
print("Test subset:", len(small_test))


Train subset: 200
Test subset: 50


`Tokenizing the Dataset`

In [9]:
from transformers import BartTokenizerFast

tokenizer = BartTokenizerFast.from_pretrained("facebook/bart-large-cnn")

max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
        padding="max_length"
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"],
            max_length=max_target_length,
            truncation=True,
            padding="max_length"
        )
    labels["input_ids"] = [
        [(l if l != tokenizer.pad_token_id else -100) for l in label]
        for label in labels["input_ids"]
    ]
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = small_train.map(preprocess_function, batched=True)
tokenized_test = small_test.map(preprocess_function, batched=True)


Map:   0%|          | 0/50 [00:00<?, ? examples/s]



`Data Collator`

In [10]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="facebook/bart-large-cnn")


`Training Arguments`

In [11]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./mini-bart-model",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    num_train_epochs=1,    # only 1 epoch — very fast
    learning_rate=5e-5,
    fp16=True,
    logging_steps=10,
)


`Trainer & Train`



In [12]:
from transformers import BartForConditionalGeneration, Trainer

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()


  trainer = Trainer(


Step,Training Loss
10,2.4207
20,2.1121
30,2.0684
40,1.9181
50,2.0929




TrainOutput(global_step=50, training_loss=2.1224359130859374, metrics={'train_runtime': 81.6805, 'train_samples_per_second': 2.449, 'train_steps_per_second': 0.612, 'total_flos': 433420920422400.0, 'train_loss': 2.1224359130859374, 'epoch': 1.0})

`Save the Lightly Fine-Tuned Model`

In [13]:
trainer.save_model("mini_bart_finetuned")
tokenizer.save_pretrained("mini_bart_finetuned")


('mini_bart_finetuned/tokenizer_config.json',
 'mini_bart_finetuned/special_tokens_map.json',
 'mini_bart_finetuned/vocab.json',
 'mini_bart_finetuned/merges.txt',
 'mini_bart_finetuned/added_tokens.json',
 'mini_bart_finetuned/tokenizer.json')

## `Create the Summarization Function`

> Function that takes text → returns summary



In [14]:
from transformers import BartForConditionalGeneration, BartTokenizerFast

# Load pretrained BART for summarization
tokenizer = BartTokenizerFast.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

# Tokenize your sample article
inputs = tokenizer(
    [sample_article],
    max_length=1024,
    truncation=True,
    return_tensors="pt"
)

# Generate summary
summary_ids = model.generate(
    inputs["input_ids"],
    max_length=150,
    min_length=30,
    num_beams=4,
    length_penalty=2.0,
    early_stopping=True
)

generated_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("\nPretrained Model Summary:\n")
print(generated_summary)



Pretrained Model Summary:

A business entity shall not be subject to civil liability relating to any injury or death occurring at a facility of the business entity in connection with a use of such facility by a nonprofit organization. Nonprofit organization means any not-for-profit organization organized and operated for public benefit.


## `Building the Gradio App Interface`

In [15]:
!pip install gradio transformers



In [16]:
import gradio as gr
from transformers import BartTokenizerFast, BartForConditionalGeneration


In [17]:
# Load pretrained BART
tokenizer = BartTokenizerFast.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

# Summarization function
def summarize_text(article):
    inputs = tokenizer(
        [article],
        max_length=1024,
        truncation=True,
        return_tensors="pt"
    )

    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=150,
        min_length=30,
        num_beams=4,
        length_penalty=2.0,
        early_stopping=True
    )

    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Updated description text
description_text = """
### **What This Summarizer Works Best On**
- News articles
- Long paragraphs
- Reports and factual content
- Well-structured informational text
- Formal writing with clear meaning
"""

# Gradio Interface
interface = gr.Interface(
    fn=summarize_text,
    inputs=gr.Textbox(lines=12, label="Enter text to summarize"),
    outputs=gr.Textbox(lines=6, label="Summary"),
    title="Text Summarizer",
    description=description_text
)

# Launch app
interface.launch()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://2169a1700197d6bcaf.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


