# Day 4 : Mastering Summarization & Paraphrasing
Welcome to today's intensive session. We're moving beyond basic implementation to truly understand and control the AI models that power TextMorph. You'll see how they handle diverse texts and learn to tune their parameters like a pro.

## Today's Agenda:
## Part 1: The Art of Summarization:

Detailed explanation of Abstractive vs. Extractive summarization.

Code examples with technical, business, and creative texts.

An interactive "Summarizer Studio" to experiment with your own text.

In-depth guide to tuning parameters.

## Part 2: The Craft of Paraphrasing:

Practical use-cases for paraphrasing.

Code examples with formal, casual, and marketing language.

An interactive "Paraphraser Playground" to rephrase any sentence.



### Section 0: Session Setup
Let's get our environment ready by installing and importing the necessary libraries.

In [None]:
#@title 0.1: Install and Import Libraries
# Install the required libraries quietly.
!pip install transformers sentencepiece --quiet
print("✅ Libraries installed.")

# Import the necessary classes from the transformers library.
from transformers import T5ForConditionalGeneration, T5Tokenizer
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
# 'textwrap' is a great tool for formatting our output nicely.
import textwrap
print("✅ Libraries imported successfully.")

✅ Libraries installed.
✅ Libraries imported successfully.


##Part 1: The Art of Summarization
### Section 1.1: What is Abstractive Summarization?
Before we code, let's understand the magic. Imagine you have a pile of Lego bricks (the words in an article).

Extractive Summarization: This is like picking the most important-looking Lego bricks and presenting them as the summary. It selects key sentences directly from the original text. It's fast and factually accurate but can sound robotic.

Abstractive Summarization: This is like looking at the Lego pile, understanding the idea (e.g., "it's a car"), and then building a smaller, new car using your own Lego bricks. Our T5 model works this way—it reads the text, understands the core concepts, and then generates new, human-like sentences to form a summary. This is more advanced and leads to more fluent results.

### Section 1.2: Loading Our T5 Summarization Model
We'll use t5-base, a powerful and well-balanced version of Google's T5 model. It's a great starting point for high-quality summarization.

In [None]:
#@title 1.2: Load the T5 Model and Tokenizer
# The model name we'll be using.
t5_model_name = 't5-base'

# The Tokenizer is responsible for converting text into a format the model understands.
t5_tokenizer = T5Tokenizer.from_pretrained(t5_model_name)

# The Model is the pre-trained AI that performs the summarization.
t5_model = T5ForConditionalGeneration.from_pretrained(t5_model_name)

print(f"✅ T5 Model ('{t5_model_name}') is loaded and ready!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


✅ T5 Model ('t5-base') is loaded and ready!


### Section 1.3: The Summarizer Function
This is our core function. We're adding a new parameter, no_repeat_ngram_size, to prevent the model from repeating the same phrases.

In [None]:
#@title 1.3: Define the Core Summarizer Function
def generate_summary(text, min_len=40, max_len=120, beams=4):
    """
    Generates a high-quality abstractive summary for a given text using the T5 model.
    """
    # T5 models require a "summarize: " prefix to know which task to perform.
    input_text = "summarize: " + text.strip().replace("\n", " ")

    # Tokenize the text, ensuring it's not too long for the model.
    inputs = t5_tokenizer.encode(input_text, return_tensors='pt', max_length=1024, truncation=True)

    # Generate the summary using our specified parameters.
    summary_ids = t5_model.generate(
        inputs,
        max_length=max_len,
        min_length=min_len,
        num_beams=beams,
        no_repeat_ngram_size=3, # Prevents repeating phrases of 3 words.
        length_penalty=2.0,
        early_stopping=True
    )

    # Decode the result back into human-readable text.
    summary = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

### Section 1.4: Code Examples with Diverse Texts
Let's test our summarizer on different styles of writing to see how it performs.

In [None]:
#@title 1.4.1: Example 1 - Technical Text (Quantum Computing)
technical_text = """
Quantum computing is a revolutionary type of computation that harnesses the collective properties of quantum states,
such as superposition and entanglement, to perform calculations. While classical computers use bits that can be either
a 0 or a 1, a quantum computer uses qubits, which can be a 0, a 1, or both at the same time. This fundamental difference
allows quantum computers to solve complex problems that are intractable for even the most powerful classical supercomputers,
with potential applications in cryptography, materials science, and drug discovery.
"""

summary = generate_summary(technical_text, min_len=25, max_len=50)

print("----------- TECHNICAL TEXT -----------")
print(textwrap.fill(technical_text, width=100))
print("\n✨---------- T5 SUMMARY -----------✨")
print(textwrap.fill(summary, width=100))

----------- TECHNICAL TEXT -----------
 Quantum computing is a revolutionary type of computation that harnesses the collective properties
of quantum states, such as superposition and entanglement, to perform calculations. While classical
computers use bits that can be either a 0 or a 1, a quantum computer uses qubits, which can be a 0,
a 1, or both at the same time. This fundamental difference allows quantum computers to solve complex
problems that are intractable for even the most powerful classical supercomputers, with potential
applications in cryptography, materials science, and drug discovery.

✨---------- T5 SUMMARY -----------✨
quantum computing is a revolutionary type of computation that harnesses quantum states . a quantum
computer uses qubits, which can be a 0, a 1, or both at the same time . quantum computers can solve
complex problems


In [None]:
#@title 1.4.2: Example 2 - Business Text (Market Analysis)
business_text = """
The global market for renewable energy is projected to experience robust growth over the next decade. Key drivers include
increasing government incentives for clean energy, declining costs of solar and wind technologies, and growing consumer
awareness regarding climate change. However, challenges remain, such as the intermittency of renewable sources and the
need for significant grid infrastructure upgrades. Companies that can innovate in energy storage solutions and grid
management are best positioned to capitalize on this market trend.
"""

summary = generate_summary(business_text, min_len=30, max_len=90)

print("----------- BUSINESS TEXT -----------")
print(textwrap.fill(business_text, width=100))
print("\n✨---------- T5 SUMMARY -----------✨")
print(textwrap.fill(summary, width=100))

----------- BUSINESS TEXT -----------
 The global market for renewable energy is projected to experience robust growth over the next
decade. Key drivers include increasing government incentives for clean energy, declining costs of
solar and wind technologies, and growing consumer awareness regarding climate change. However,
challenges remain, such as the intermittency of renewable sources and the need for significant grid
infrastructure upgrades. Companies that can innovate in energy storage solutions and grid management
are best positioned to capitalize on this market trend.

✨---------- T5 SUMMARY -----------✨
the global market for renewable energy is projected to experience robust growth over the next decade
. key drivers include increasing government incentives for clean energy . challenges remain, such as
the intermittency of renewable sources and the need for significant grid upgrades .


In [None]:
#@title 1.4.3: Example 3 - Creative Text (Literary Description)
creative_text = """
The ancient library was a labyrinth of shadows and whispered knowledge. Sunlight struggled through a high,
arched window, illuminating motes of dust that danced like tiny sprites in the golden shafts of light.
The air smelled of aging paper, leather, and a faint, sweet hint of vanilla. Every towering bookshelf
was a gateway to another world, each leather-bound volume a silent promise of adventure, history,
or forgotten magic. It was a place where time itself seemed to slow down, holding its breath in reverence
for the stories it contained.
"""

summary = generate_summary(creative_text, min_len=20, max_len=70)

print("----------- CREATIVE TEXT -----------")
print(textwrap.fill(creative_text, width=100))
print("\n✨---------- T5 SUMMARY -----------✨")
print(textwrap.fill(summary, width=100))

----------- CREATIVE TEXT -----------
 The ancient library was a labyrinth of shadows and whispered knowledge. Sunlight struggled through
a high, arched window, illuminating motes of dust that danced like tiny sprites in the golden shafts
of light. The air smelled of aging paper, leather, and a faint, sweet hint of vanilla. Every
towering bookshelf was a gateway to another world, each leather-bound volume a silent promise of
adventure, history, or forgotten magic. It was a place where time itself seemed to slow down,
holding its breath in reverence for the stories it contained.

✨---------- T5 SUMMARY -----------✨
the ancient library was a labyrinth of shadows and whispered knowledge . each leather-bound volume a
silent promise of adventure, history, or forgotten magic . a place where time itself seemed to slow
down in reverence for the stories it contained .


### Section 1.5: The Interactive Summarizer Studio
Now it's your turn! Paste your own text, experiment with the settings, and see how you can craft the perfect summary.

In [None]:
#@title 1.5: Your Interactive Summarizer Studio! ⚡️
#@markdown ### 👈 Paste your text below and tune the parameters!
input_text = 'The James Webb Space Telescope (JWST) is a space telescope designed primarily to conduct infrared astronomy. As the largest optical telescope in space, its high resolution and sensitivity allow it to view objects too old, distant, or faint for the Hubble Space Telescope. This has enabled investigations in many fields of astronomy and cosmology, such as observation of the first stars, the formation of the first galaxies, and detailed atmospheric characterization of potentially habitable exoplanets. The U.S. National Aeronautics and Space Administration (NASA) led JWST\'s development in collaboration with the European Space Agency (ESA) and the Canadian Space Agency (CSA).' #@param {type:"string"}
min_length = 45 #@param {type:"slider", min:10, max:100, step:5}
max_length = 140 #@param {type:"slider", min:50, max:200, step:10}
num_beams = 5 #@param {type:"slider", min:2, max:8, step:1}

# --- Run the summarizer with your settings ---
generated_summary = generate_summary(input_text, min_len=min_length, max_len=max_length, beams=num_beams)

# --- Display the results and analysis ---
original_word_count = len(input_text.split())
summary_word_count = len(generated_summary.split())
reduction = 100 - (summary_word_count / original_word_count * 100)

print("----------- YOUR INPUT TEXT -----------")
print(textwrap.fill(input_text, width=100))
print("\n✨---------- GENERATED SUMMARY -----------✨")
print(textwrap.fill(generated_summary, width=100))
print("\n📊---------- ANALYSIS -----------📊")
print(f"Original Word Count: {original_word_count}")
print(f"Summary Word Count: {summary_word_count}")
print(f"Text Reduction: {reduction:.1f}%")

----------- YOUR INPUT TEXT -----------
The James Webb Space Telescope (JWST) is a space telescope designed primarily to conduct infrared
astronomy. As the largest optical telescope in space, its high resolution and sensitivity allow it
to view objects too old, distant, or faint for the Hubble Space Telescope. This has enabled
investigations in many fields of astronomy and cosmology, such as observation of the first stars,
the formation of the first galaxies, and detailed atmospheric characterization of potentially
habitable exoplanets. The U.S. National Aeronautics and Space Administration (NASA) led JWST's
development in collaboration with the European Space Agency (ESA) and the Canadian Space Agency
(CSA).

✨---------- GENERATED SUMMARY -----------✨
the James Webb Space Telescope (JWST) is the largest optical telescope in space . its high
resolution and sensitivity allow it to view objects too old, distant, or faint . this has enabled
investigations in many fields of astronomy and c

## Part 2: The Craft of Paraphrasing
### Section 2.1: Why is Paraphrasing Useful?
Paraphrasing is more than just changing a few words. It's a powerful tool for:

Improving Clarity: Rephrasing a complex sentence can make it easier to understand.

Content Creation: Generating multiple versions of a marketing headline or social media post.

Avoiding Plagiarism: Expressing someone else's idea in your own unique words (while still citing them!).

Enhancing Writing: Finding more creative or engaging ways to say something.

### Section 2.2: Loading Our PEGASUS Paraphrasing Model
We'll use a PEGASUS model that has been specifically fine-tuned for the task of paraphrasing.

In [None]:
#@title 2.2: Load the PEGASUS Model and Tokenizer
paraphrase_model_name = 'tuner007/pegasus_paraphrase'

pegasus_tokenizer = PegasusTokenizer.from_pretrained(paraphrase_model_name)
pegasus_model = PegasusForConditionalGeneration.from_pretrained(paraphrase_model_name)

print(f"✅ PEGASUS Model ('{paraphrase_model_name}') is loaded and ready!")

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at tuner007/pegasus_paraphrase and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ PEGASUS Model ('tuner007/pegasus_paraphrase') is loaded and ready!


### Section 2.3: The Paraphraser Function
This function will take our input sentence and generate a list of high-quality alternatives.



In [None]:
#@title 2.3: Define the Core Paraphraser Function
def generate_paraphrases(text, num_return=5, beams=10):
    """
    Generates multiple high-quality paraphrases for a given text using the PEGASUS model.
    """
    # Tokenize the input text.
    inputs = pegasus_tokenizer.encode(text, return_tensors='pt', truncation=True)

    # Generate the paraphrases using beam search.
    paraphrase_ids = pegasus_model.generate(
        inputs,
        max_length=60,
        num_beams=beams,
        num_return_sequences=num_return,
        early_stopping=True
    )

    # Decode the results back into text.
    paraphrases = pegasus_tokenizer.batch_decode(paraphrase_ids, skip_special_tokens=True)
    return paraphrases

### Section 2.4: Code Examples with Diverse Sentences
Let's see how PEGASUS handles different kinds of language.

In [None]:
#@title 2.4.1: Example 1 - Formal / Academic Sentence
formal_sentence = "The empirical data indicates a statistically significant correlation between the two variables."
paraphrases = generate_paraphrases(formal_sentence, num_return=4)

print(f"----------- ORIGINAL FORMAL SENTENCE -----------\n'{formal_sentence}'\n")
print("🤖---------- PEGASUS PARAPHRASES ----------🤖")
for i, p in enumerate(paraphrases):
    print(f"  {i+1}. {p}")

----------- ORIGINAL FORMAL SENTENCE -----------
'The empirical data indicates a statistically significant correlation between the two variables.'

🤖---------- PEGASUS PARAPHRASES ----------🤖
  1. There is a statistically significant correlation between the two variables.
  2. The data shows a statistically significant correlation between the two variables.
  3. The empirical data shows a correlation between the two variables.
  4. The data shows a correlation between the two variables.


In [None]:
#@title 2.4.2: Example 2 - Casual / Idiomatic Sentence
casual_sentence = "To be honest, that new project is a real pain in the neck."
paraphrases = generate_paraphrases(casual_sentence, num_return=4)

print(f"----------- ORIGINAL CASUAL SENTENCE -----------\n'{casual_sentence}'\n")
print("🤖---------- PEGASUS PARAPHRASES ----------🤖")
for i, p in enumerate(paraphrases):
    print(f"  {i+1}. {p}")

----------- ORIGINAL CASUAL SENTENCE -----------
'To be honest, that new project is a real pain in the neck.'

🤖---------- PEGASUS PARAPHRASES ----------🤖
  1. It's a real pain in the neck for that new project.
  2. It is a real pain in the neck to have that new project.
  3. The new project is a real pain in the neck.
  4. It is a real pain in the neck to have a new project.


In [None]:
#@title 2.4.3: Example 3 - Marketing Call-to-Action
marketing_sentence = "Don't miss out on our exclusive offer – shop now to save 50%!"
paraphrases = generate_paraphrases(marketing_sentence, num_return=4)

print(f"----------- ORIGINAL MARKETING SENTENCE -----------\n'{marketing_sentence}'\n")
print("🤖---------- PEGASUS PARAPHRASES ----------🤖")
for i, p in enumerate(paraphrases):
    print(f"  {i+1}. {p}")

----------- ORIGINAL MARKETING SENTENCE -----------
'Don't miss out on our exclusive offer – shop now to save 50%!'

🤖---------- PEGASUS PARAPHRASES ----------🤖
  1. Don't forget to take advantage of our exclusive offer and save 50%.
  2. Shop now to save 50% on our exclusive offer.
  3. Don't forget to shop now to save 50% on our exclusive offer.
  4. Don't forget to take advantage of our exclusive offer and save 50%!


### Section 2.5: The Interactive Paraphraser Playground
Your turn! Enter any sentence and generate creative new ways to phrase it.

In [None]:
#@title 2.5: Your Interactive Paraphraser Playground! ⚡️
#@markdown ### 👈 Type your sentence and choose your settings!
input_sentence = "Learning new skills is essential for career growth." #@param {type:"string"}
num_paraphrases = 5 #@param {type:"slider", min:1, max:10, step:1}
quality_vs_speed_beams = 9 #@param {type:"slider", min:2, max:15, step:1}

# --- Run the paraphraser with your settings ---
generated_paraphrases = generate_paraphrases(input_sentence, num_return=num_paraphrases, beams=quality_vs_speed_beams)

# --- Display the results ---
print(f"----------- ORIGINAL SENTENCE -----------\n'{input_sentence}'\n")
print(f"🤖---------- {len(generated_paraphrases)} GENERATED PARAPHRASES (Quality: {quality_vs_speed_beams}) ----------🤖")
for i, p in enumerate(generated_paraphrases):
    print(f"  {i+1}. {p}")

----------- ORIGINAL SENTENCE -----------
'Learning new skills is essential for career growth.'

🤖---------- 5 GENERATED PARAPHRASES (Quality: 10) ----------🤖
  1. It's important to learn new skills for career growth.
  2. It's important for career growth to learn new skills.
  3. Career growth can be achieved by learning new skills.
  4. Career growth is dependent on learning new skills.
  5. New skills are needed for career growth.
