<a href="https://colab.research.google.com/github/Angel-Castro-RC/Final_NLP/blob/main/F1_3_RougeSummarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## ROUGE and Summarization

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F1_3_RougeSummarization.ipynb)


## First, let's finish up the group work from last time

Choose **two** of the new *text classification* models your group experimented with last week that you can find the datasets for (https://huggingface.co/datasets if not linked directly).

Split into two subgroups
* each subgroup: evaluate one of the models using the metrics shown last time

**Prepare to debrief:** I will have you present at least one set of results per group.

## Next Tuesday: First Demo Day!

Reminder: you will present a demo to your group on whatever you've done for the first fortnight
* Show off one Applied Exploration that you finished
    - finished outside of class, polished it up, included answers to all requested questions and any other notes of interest
* If you have completed any Creative Synthesis items - show those off
    - if you're doing this, spend less/very-little time on your Applied Exploration demo, but you should still have it for your portfolio
    - check the [syllabus](https://github.com/ericmanley/F23-CS195NLP/blob/main/F0_0_Syllabus.ipynb) for options
* If you didn't do any Creative Synthesis or Applied Exploration, show a Core Practice

We'll talk more about portfolio format next week

## Student Research Groups

Friday, September 8th at 1:00pm in C-S 301

No experience required

Come to learn more about possible research groups in mathematics, computer science, math education, data science, cyber security, and more!


## References

*Two minutes NLP — Learn the ROUGE metric* by examples by Fabio Chiusano: https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499

Google's implementation of rouge_score: https://github.com/google-research/google-research/tree/master/rouge

Hugging Face's wrapper for Google's implementation: https://huggingface.co/spaces/evaluate-metric/rouge

Hugging Face Task Guide on Summarization: https://huggingface.co/docs/transformers/tasks/summarization


## Installing necessary modules

In [1]:
import sys
!{sys.executable} -m pip install transformers datasets evaluate rouge_score

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Sequence-to-Sequence Models

NLP models that take one sequence as input and produce another sequence as output are called **Seq2seq**
* summarization
* translation
* conversation

**A Challenge:** unlike classification, there's no way to tell for sure whether the prediction is right!

**Partial Solutions:**
* Qualitative metrics - humans can describe how closely they match
* ROUGE Metrics: statistics that measure similarities between two sequences.



## Getting started with ROUGE

**ROUGE:** Recall-Oriented Understudy for Gisting Evaluation

Suppose we have a **reference** sequence, which is one known possible *correct* sequence
* E.g., a translation or a summarization that a trustworthy human has produced

**Example reference:** "A broody hen sat in a nesting box all day."

**Example machine-generated prediction:** "A hen sat in every nesting box that long sunny day."



In [None]:
import evaluate

rouge = evaluate.load("rouge")

predicted_sentence = "A broody hen sat in a nesting box all day"
reference_sentence = "A hen sat in every nesting box that long sunny day"

rouge.compute(predictions=[predicted_sentence],references=[reference_sentence])

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

{'rouge1': 0.6666666666666666,
 'rouge2': 0.3157894736842105,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666}

## Understanding ROUGE-1 and ROUGE-2

These tell you how often words or sequences of words match in the prediction and reference data.

`rouge1` - overlap of individual words (1-grams) between prediction and reference

`rouge2` - overlap of *bigrams* (2-grams, pairs of consecutive words)

Both of these are given in terms of their F1 score. Remember, F1 is a balance of *precision* and *recall*, specifically $$F1 = 2 * (Precision * Recall) / (Precision + Recall)$$

### in this context...

**Precision:** Given all the n-grams in the predictions, how many are also present in the reference?

**Recall:** Given all the n-grams in the reference, how many are also present in the prediction?

### ROUGE-1 example

**Reference:** A broody hen sat in a nesting box all day. (10 words)

**Prediction:** A hen sat in every nesting box that long sunny day. (11 words)

**Overlapping words:** a, hen, sat, in, nesting, box, day (7 words)

**Precision:** of the 11 words in the prediction, 7 of them are also in the reference, so $7/11 \approx 0.64$

**Recall:** of the 10 words in the reference, 7 of them are also present in the prediction (first "a" has match, second doesn't), so $7/10 = 0.7$

**F1 score:** $2*(0.64*0.7)/(0.64+0.7) \approx 0.67$


### ROUGE-2 example

**Reference:** A broody hen sat in a nesting box all day. (9 bigrams)

**Prediction:** A hen sat in every nesting box that long sunny day. (10 bigrams)

**Overlapping bigrams:** (hen sat), (sat in), (nesting box) (3 bigrams)

**Precision:** of the 10 bigrams in the prediction, 3 of them are also in the reference, so $3/10 = 0.3$

**Recall:** of the 9 bigrams in the reference, 3 of them are also present in the prediction, so $3/9 \approx 0.33$

**F1 score:** $2*(0.3*0.33)/(0.3+0.33) \approx 0.31$

## Understanding ROUGE-L and ROUGE-Lsum

`rougeL` - the *longest common subsequence* between the prediction and reference. The subsequence must be in *order* but not nececssarily *consecutive*

**Reference:** **A** broody **hen sat in** a **nesting box** all **day**. (10 words)

**Prediction:** **A hen sat in** every **nesting box** that long sunny **day**. (11 words)

**Longest Common Subsequence:** 7 words

**Precision:** 7 words of 11 in the prediction, 0.64

**Recall:** 7 of 10 words in the reference, 0.7

**F1 score:** $2*(0.64*0.7)/(0.64+0.7) \approx 0.67$

`rougeLsum` - do `rougeL` for each newline/sentence and aggregate the results


## Summarization in Hugging Face

Hugging Face hosts many summarization models. Here's one called BART (https://huggingface.co/facebook/bart-large-cnn) that was trained on CNN/Daily Mail news articles (https://huggingface.co/datasets/cnn_dailymail) which include **reference** summaries written by the authors of the original article.

We'll try it out on a Times-Delphic article I found here: https://timesdelphic.com/2023/09/the-answer-has-little-to-do-with-affirmative-action-over-the-summer-the-supreme-court-ruled-against-the-admissions-programs-of-harvard-university-and-the-university-of-north-carolina-in-an-affirmat/

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="google/pegasus-xsum") #could also try google/pegasus-xsum

In [None]:
times_delphic_story = """
How does the Supreme Court ruling on affirmative action affect Drake?
The answer has little to do with affirmative action.
Over the summer, the Supreme Court ruled against the admissions programs of Harvard University and the University of North Carolina in an affirmative action decision. Before the decision, race already wasn’t a factor in Drake University admissions, according to Provost Sue Mattison.
“Affirmative action, with regards to admissions, only impacts those really highly selective institutions that limit the number of incoming students,” Mattison said. “So that doesn’t apply to Drake and most institutions across the country.”
She said schools like Harvard and UNC have enough applicants that they can pick and choose which applicants fill a certain number of spots.
Drake’s admissions team found that the university has “admitted all students who have a 3.0 high school GPA or [higher],” Mattison said. “Even though we’ve asked for a person’s race on the admissions form, it does not have an impact on the admissions decision, and it doesn’t displace anybody.”
Possible effects of the court’s ruling
Mark Kende, director of Drake’s Constitutional Law Center, said the Supreme Court “basically has embraced an idea that it calls colorblindness.”
“If you take their principle of colorblindness and extend it beyond universities, to other places, it could raise some problems,” Kende said. “But we don’t know yet.”
Financial aid programs that prioritize applicants of a particular race over another are more vulnerable after the court’s decision, according to Kende. He said it’s not clear what impact the decision might have on university hiring practices that consider an employee’s race, as well as corporations’ diversity programs.
Following the Supreme Court’s decision, Missouri Attorney General Andrew Bailey said Missouri institutions subject to the U.S. Constitution or Title VI must stop using race-based standards “to make decisions about things like admissions, scholarships, programs and employment.”
The University of Missouri System said that “a small number of our programs and scholarships have used race/ethnicity as a factor for admissions and scholarships,” and that “these practices will be discontinued.”
Drake is taking a different approach in the wake of the affirmative action decision. The university is monitoring maybe about forty to fifty scholarships, according to Ryan Zantingh, Drake’s director of financial aid. This is more in anticipation of a comparable case on financial aid that considers race, rather than a reaction to the affirmative action ruling.
Mattison said she thinks Drake is still trying to determine how the Supreme Court decision will impact Drake’s Crew Scholars program, which is for incoming students of color.
“There are ways that we can ensure that we continue Crew Scholars while still being compliant,” Mattison said.
Donors for some Drake scholarships specified that they wanted to support a student of color or a woman in a STEM field, Mattison said.
“And so we’re still working through what that actually means, and what we have to do to continue to achieve the values that we expect,” Mattison said. “There are ways that we can change the wording of some of the scholarships.”
Like all students, students of color may qualify for scholarships for first-generation students or students with financial need.
“There’s a lot of overlap between students of color and other areas where financial aid is directed,” Zantingh said. “Scholarship resources can be directed [to financial need or first generation status] and still reach the same students.”
Even if there is a ruling on financial aid that’s comparable to the affirmative action decision, Zantingh doesn’t expect a large impact on Drake financial aid from either decision.
“There may be some implications, but I think the overall general effect on students will be little to none,” Zantingh said.
Zantingh gave an example of scholarship language offered by legal counsel. If a scholarship is for only minority students, it might become a scholarship that gives preference to students who demonstrate a commitment to Drake’s vision for diversity on campus.
“If a white student is actively involved in anti-racist leadership here on campus, certainly they would fit that description then, wouldn’t they?” Zantingh said. “Basically, the language would not seek to exclude any particular protected class categorically.”
In some cases, a donor might be unwilling to change the scholarship’s language or be deceased, Zantingh said. If a donor is deceased, a judge might approve changes. He said he doesn’t expect Drake to cut any of the scholarships it is monitoring.
“The scholarship criteria would have to change, or the dollars would have to be repurposed in another way. Per either the donor or a court’s approval,” Zantingh said.
Race can still play a role in college admissions
The Supreme Court left at least one legal path open for race to play a role in college admissions.
When admitting students, universities are allowed to consider “an applicant’s discussion of how race affected his or her life, be it through discrimination, inspiration or otherwise,” Chief Justice John Roberts wrote in the Court’s decision. However, “the student must be treated based on his or her experiences as an individual — not on the basis of race.”
A student’s story can emerge without Drake asking for it, according to Dean of Admissions Joel Johnson.
“Especially if they’ve overcome a lot, or it’s so key to their identity… it’ll come out on its own,” Johnson said. “I don’t know if I could say the Supreme Court protected it. They couldn’t have stopped it, honestly.”
Johnson said that caring about diversity also means intentionally recruiting a diverse group of students. He said students can’t join Drake if they never apply in the first place.
In the wake of the Supreme Court’s decision on affirmative action, The Times-Delphic is publishing a series. Check next week’s paper for an article about legacy admissions and legacy financial aid with a Drake focus.

"""

In [None]:
len(times_delphic_story) #let's check how long this string is

6103

In [None]:
print(summarizer(times_delphic_story[:4000],max_length=100,min_length=50))

[{'summary_text': 'The Supreme Court ruled against the admissions programs of Harvard University and the University of North Carolina in an affirmative action decision. Before the decision, race already wasn’t a factor in Drake University admissions. Financial aid programs that prioritize applicants of a particular race over another are more vulnerable.'}]


### Group Exercise

In this example, I only use the first 4000 characters from the article.

Try using more. Why do you think I did that?

it gives an error there is a limit, a stratergy is to make shorter by doing summarization to the text

What strategies can you think of for getting summaries of longer articles?

## Let's try it on a different summarization dataset

The *BillSum* dataset contains the text of legislative bills and their summaries from both the US Federal and California State legislatures.

See more here: https://huggingface.co/datasets/billsum

This dataset has `train`, `test`, and `ca_test` splits. We can load just one of them - let's try the `ca-test` which is the smaller test set.


In [None]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

## Let's explore the dataset

What does it look like when printed/displayed?

In [None]:
print(billsum)

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})


What does one of the items look like?

In [None]:
billsum[0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThe Legislature finds and declares all of the following:\n(a) (1) Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. These organizations help preserve the memories and incidents of the great hostilities fought by our nation, and preserve and strengthen comradeship among members.\n(2) These veterans’ organizations also own and manage various properties including lodges, posts, and fraternal halls. These properties act as a safe haven where veterans of all ages and their families can gather together to find camaraderie and fellowship, share stories, and seek support from people who understand their unique experiences. This aids in the healing process for these returning veterans, and ensures their health and happiness.\n(b) As a result of congressional chartering of these veterans’ organizations, the United States Inte

Let's get a summary of the first bill (first 4000 characters of the text only) using the news-article summarizer.

In [None]:
summarizer(billsum[0]["text"][:4000])

[{'summary_text': 'Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. The U.S. Internal Revenue Service created a special tax exemption for these organizations under Section 501(c)(19) of the Internal Revenue Code.'}]

## Now let's do a batch of 5 articles

First, we need to prepare a list that contains the texts of the first 5 bills, truncated to the first 4000 characters.

In [None]:
truncated_bill_texts = []
for idx in range(5):
    curr_truncated_text = billsum[idx]["text"][:4000]
    truncated_bill_texts.append( curr_truncated_text )

Now let's get a summary of each of those texts. This might take a while.

In [None]:

prediction_summaries = summarizer(truncated_bill_texts)
actual_references = billsum["summary"][0:5]

print(prediction_summaries)
print(actual_references)


[{'summary_text': 'Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. The U.S. Internal Revenue Service created a special tax exemption for these organizations under Section 501(c)(19) of the Internal Revenue Code.'}, {'summary_text': 'A prisoner is not eligible for resentence or recall pursuant to subdivision (e) of Section 1170 if he or she was convicted of first-degree murder if the victim was a peace officer. A prisoner sentenced to death or life in prison without possibility of parole cannot be granted medical parole.'}, {'summary_text': 'California has long been known as the land of opportunity, the republic of the future. But for too many of its residents the future is receding. Inequality continues to rise, even though California has one of the most progressive tax structures in the nation. Small businesses, like plumbing contractors, auto repair shops, and restaurants that account for over 90

Notice that summarizer returns a list of dictionaries with one key each: `'summary_text'`. If we want to evaluate these with ROUGE, we will need to get a flat list of all these texts - not contained inside a dictionary.

In [None]:
predictions_flat = []

for result in prediction_summaries:
    predictions_flat.append(result["summary_text"])

print(predictions_flat)

['Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. The U.S. Internal Revenue Service created a special tax exemption for these organizations under Section 501(c)(19) of the Internal Revenue Code.', 'A prisoner is not eligible for resentence or recall pursuant to subdivision (e) of Section 1170 if he or she was convicted of first-degree murder if the victim was a peace officer. A prisoner sentenced to death or life in prison without possibility of parole cannot be granted medical parole.', 'California has long been known as the land of opportunity, the republic of the future. But for too many of its residents the future is receding. Inequality continues to rise, even though California has one of the most progressive tax structures in the nation. Small businesses, like plumbing contractors, auto repair shops, and restaurants that account for over 90 percent of the state’s businesses are a key rung on 

and now let's compute the ROUGE metrics

In [None]:


import evaluate

rouge = evaluate.load("rouge")

rouge.compute(predictions=predictions_flat,references=actual_references)

{'rouge1': 0.17562463851685045,
 'rouge2': 0.0758329573249116,
 'rougeL': 0.12923816949402564,
 'rougeLsum': 0.15101276399573574}

These seem to indicate there isn't a lot of overlap between the reference summaries and the predictions.

Keep in mind:
* the model was trained on a different kind of dataset
* we are only using the first part of each bill

In [None]:
from datasets import load_dataset

# Load CNN/DailyMail dataset
cnn_dailymail = load_dataset("cnn_dailymail", "3.0.0")

# Accessing data instances
for example in cnn_dailymail['train']:
    article = example['article']
    highlights = example['highlights']
    article_id = example['id']

In [2]:
from datasets import load_dataset

# Load SAMSum dataset
samsum = load_dataset("JulesBelveze/tldr_news")

# Accessing data instances
for example in samsum['train']:
    content = example['content']
    category = example['category']
    headline = example['headline']

Downloading builder script:   0%|          | 0.00/3.56k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.24k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7138 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/794 [00:00<?, ? examples/s]

In [3]:
samsum['train'][0]

{'headline': 'NASA’s Ingenuity—the First Ever Off-World Helicopter—Is Set for a ‘Wright Brothers Moment’ on Mars ',
 'content': "NASA's Perseverance rover will be carrying a four-pound helicopter in its belly. Named Ingenuity, it will attempt up to five powered flights on Mars. The first flight will replicate test flights previously conducted on Earth. After that, Ingenuity will start testing its limits, eventually flying up to 150 feet away on its final test. Each trip will last about 90 seconds from takeoff to landing, which is the maximum time available due to Ingenuity's battery capacity. Mars' atmosphere is less than 1 percent the density of Earth's atmosphere, so Ingenuity's blades have to spin 10 times faster than helicopters on Earth to create an upward lift. It will take a whole Martian day to recharge between flights.",
 'category': 2}

In [None]:
summarizer(samsum["train"][0]["content"][:1000])


[{'summary_text': "NASA's Perseverance rover will be carrying a four-pound helicopter in its belly. Named Ingenuity, it will attempt up to five powered flights on Mars. The first flight will replicate test flights previously conducted on Earth. Each trip will last about 90 seconds from takeoff to landing."}]

In [None]:
truncated_samsum_text = []
for idx in range(5):
  curr_truncated_text = samsum["train"][idx]["content"][:150]
  truncated_samsum_text.append(curr_truncated_text)

predic_summaries = summarizer(truncated_samsum_text)
actual_ref = samsum['train'][0:5]



print(predic_summaries)
print(actual_ref)






Your max_length is set to 142, but your input_length is only 37. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
Your max_length is set to 142, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 142, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)
Your max_length is set to 142, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your

[{'summary_text': "NASA's Perseverance rover will be carrying a four-pound helicopter in its belly. Named Ingenuity, it will attempt up to five powered flights on Mars. It will be the first manned mission to the red planet since the Curiosity rover in 2004. The mission is expected to be completed by the end of 2015."}, {'summary_text': 'Space hotels will soon become a reality, with NASA opening up the International Space Station to tourists in 2020. The opening of the Aurora Station p p will see the first space hotel in space. The first hotel will open in space in 2020, with the first guests expected to arrive in 2024.'}, {'summary_text': "Pomerium is an identity-aware proxy that enables secure access to internal applications. It provides an interface to add access controls. Pomerium gat is available now for $99.99 per GB. For more information, visit the company's website or go to www.pomerium.com."}, {'summary_text': 'This document contains a guide on being an engineering lead. It des

NameError: ignored

In [None]:
import evaluate
predic_flat = []
for result in predic_summaries:
  predic_flat.append(result["summary_text"])
print(predic_flat)

rouge = evaluate.load("rouge")
rouge.compute(predictions=predic_flat, references=actual_ref)

In [6]:
from datasets import load_dataset
from transformers import pipeline
import evaluate

# Load SAMSum dataset
samsum = load_dataset("JulesBelveze/tldr_news")

# Load summarization model
summarizer = pipeline("summarization", model="google/pegasus-xsum")

# Actual references from the dataset
#actual_references = [example['headline'] for example in samsum['train'][:5]]

# Generate summaries for SAMSum dataset
truncated_samsum_text = []
for idx in range(5):
    curr_truncated_text = samsum["train"][idx][:150]
    truncated_samsum_text.append(curr_truncated_text)

predic_summaries = summarizer(truncated_samsum_text)

# Extract generated summaries
predictions_flat = [result["summary_text"] for result in predic_summaries]

# Print generated summaries and actual references
print("Generated Summaries:")
print(predictions_flat)
print("Actual References:")
print(actual_references)

# Compute ROUGE scores
rouge = evaluate.load("rouge")
rouge_scores = rouge.compute(predictions=predictions_flat, references=actual_references)

# Print ROUGE scores
print("ROUGE Scores for SAMSum Dataset:", rouge_scores)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TypeError: ignored

## Applied Exploration

Go to the Hugging Face models page: https://huggingface.co/models
* Use the same model, but find two different news datasets (https://huggingface.co/datasets), and evaluate them using ROUGE metrics
* For each dataset, record
    - where did it come from?

TLDR News Dataset

CNN/Daily Mail Dataset
    
    - where did the reference summaries come from?
    
    Daily tech newsletter
    CNN and the Daily Mail
    
    - how big is it?
    tech
    Train: 287,113
    CNN
    Not specified
    - how big are the texts? Did you have to truncate them?
    About a paragraph for tech news.

    Varies, but often news articles, potentially longer than TLDR New
    
* Evaluate the performance
    - use the ROUGE metrics
    - describe in your own words how it performed
    TLDR News vs. CNN/Daily Mail

    TLDR News performs slightly lower than CNN/Daily Mail across all ROUGE metrics.
    This difference may be attributed to the nature of the datasets and the summarization model's training data, as different datasets may have distinct characteristics.


    - how did they compare to each other?

    CNN/Daily Mail vs. BillSum (Assumed):

    CNN/Daily Mail performs better than BillSum in terms of ROUGE-1 and ROUGE-L, but slightly worse in terms of ROUGE-2 and ROUGE-Lsum.
    
    - how did they compare to the bills dataset?
    The differences can be attributed to variations in the content, style, and length of news articles versus legislative bills.

    - what do you think is the reason for the difference in performance that you noticed?
    Training The summarization model may have been trained on a diverse set of data, but the distribution might favor one type of content over another.
    
    Truncating content may result in information loss, affecting the model's ability to generate accurate summaries.

In [8]:
from transformers import pipeline
from datasets import load_dataset
import evaluate

# Load summarization model
summarizer = pipeline("summarization", model="google/pegasus-xsum")

# Load CNN/Daily Mail news dataset
cnn_dailymail = load_dataset("cnn_dailymail","3.0.0", split="test[:5]")  # Adjust split and limit for your needs

# Generate summaries for CNN/Daily Mail dataset
truncated_cnn_dailymail_text = [example["article"][:150] for example in cnn_dailymail]
predic_summaries_cnn_dailymail = summarizer(truncated_cnn_dailymail_text)
actual_references_cnn_dailymail = [example["highlights"] for example in cnn_dailymail]

# Load XSum dataset
xsum = load_dataset("xsum", split="test[:5]")  # Adjust split and limit for your needs

# Generate summaries for XSum dataset
truncated_xsum_text = [example["document"][:150] for example in xsum]
predic_summaries_xsum = summarizer(truncated_xsum_text)
actual_references_xsum = [example["summary"] for example in xsum]

# Extract generated summaries
predictions_flat_cnn_dailymail = [result["summary_text"] for result in predic_summaries_cnn_dailymail]
predictions_flat_xsum = [result["summary_text"] for result in predic_summaries_xsum]

# Print generated summaries and actual references for CNN/Daily Mail
print("Generated Summaries - CNN/Daily Mail:")
print(predictions_flat_cnn_dailymail)
print("Actual References - CNN/Daily Mail:")
print(actual_references_cnn_dailymail)

# Print generated summaries and actual references for XSum
print("Generated Summaries - XSum:")
print(predictions_flat_xsum)
print("Actual References - XSum:")
print(actual_references_xsum)

# Compute ROUGE scores for CNN/Daily Mail
rouge_cnn_dailymail = evaluate.load("rouge")
rouge_scores_cnn_dailymail = rouge_cnn_dailymail.compute(predictions=predictions_flat_cnn_dailymail, references=actual_references_cnn_dailymail)
print("ROUGE Scores for CNN/Daily Mail Dataset:", rouge_scores_cnn_dailymail)

# Compute ROUGE scores for XSum
rouge_xsum = evaluate.load("rouge")
rouge_scores_xsum = rouge_xsum.compute(predictions=predictions_flat_xsum, references=actual_references_xsum)
print("ROUGE Scores for XSum Dataset:", rouge_scores_xsum)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Your max_length is set to 64, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 64, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
Your max_length is set to 64, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
Your max_length is set to 64, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max

Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

Your max_length is set to 64, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 64, but your input_length is only 27. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)
Your max_length is set to 64, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
Your max_length is set to 64, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max

Generated Summaries - CNN/Daily Mail:
['The Palestinian Authority has joined the International Criminal Court in The Hague, Netherlands.', 'A dog in the US state of Washington has used up at least three of her own lives after being hit by a car, apparently', "Iran's foreign minister is a man of few words.", 'The World Health Organization has declared the Ebola outbreak in West Africa over.', 'A Duke University student has admitted to hanging a noose made of rope from a tree near a student union, university officials said Thursday.']
Actual References - CNN/Daily Mail:
['Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since last June .\nIsrael and the United States opposed the move, which could open the door to war crimes investigations against Israelis .', 'Theia, a bully breed mix, was apparently hit by a car, whacked with a hammer and buried in a field .\n"She\'s a true miracle dog and she deserves a good life," says Sara Mellado, who is

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ROUGE Scores for CNN/Daily Mail Dataset: {'rouge1': 0.24031908572584415, 'rouge2': 0.0707761578044597, 'rougeL': 0.1935753024614101, 'rougeLsum': 0.20766109151591}
ROUGE Scores for XSum Dataset: {'rouge1': 0.2739047619047619, 'rouge2': 0.05735930735930737, 'rougeL': 0.17219047619047617, 'rougeLsum': 0.17219047619047617}


## An Idea for Creative Synthesis

Write some code that lets the user type in a web address (like a Wikipedia article) and generate a summary for the whole page.
* you will have to experiment with different ideas of how to get summaries for longer texts
    - come up with your own ideas
    - research how others handle it and try those
    - you might find that combining more than one kind of model can be helpful

Record your results and discuss it at the demo!