**STARTING OVER**

In [1]:
!pip install sentencepiece
!pip install torch==1.9.0
!pip install transformers==4.11.3
!pip install rouge-score==0.0.4
!pip install datasets==1.14.0
!pip install rouge
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**The code block** contains several pip install commands to install required packages such as sentencepiece, torch, transformers, rouge-score, datasets, rouge, and scikit-learn.

The first line installs sentencepiece, a library for subword tokenization. The second line installs version 1.9.0 of the PyTorch deep learning framework, while the third line installs version 4.11.3 of the transformers library, a popular library for natural language processing tasks such as text summarization. The fourth line installs version 0.0.4 of the rouge-score package, which is used for evaluating the quality of text summaries. The fifth line installs version 1.14.0 of the datasets package, which provides access to a large number of datasets for machine learning tasks. The sixth line installs the rouge package, which is an alternative package for evaluating the quality of text summaries. Finally, the last line installs scikit-learn, a popular machine learning library for Python.

In [2]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
from sklearn.datasets import fetch_20newsgroups
from rouge import Rouge
import plotly.express as px
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import DataLoader
from rouge_score import rouge_scorer

**This code block** imports necessary libraries for the text summarization task. Specifically, the libraries imported are:

1. torch: PyTorch, a popular deep learning library
2. T5Tokenizer and T5ForConditionalGeneration from the transformers library: these classes are used to tokenize input texts and generate summaries using a pre-trained T5 model.
3. fetch_20newsgroups from sklearn.datasets: this is used to load the 20 newsgroups dataset, a collection of newsgroup documents.
4. Rouge from the rouge library: this is used to evaluate the quality of the generated summaries.
5. plotly.express as px: this is used to visualize the results.
6. pandas as pd: this is used to manipulate dataframes to create the visualization.
7. numpy as np: this is used for numerical computing.
8. CountVectorizer from sklearn.feature_extraction.text: this is used to count the frequency of each word in the corpus.
9. DataLoader from torch.utils.data: this is used to create a PyTorch DataLoader object to iterate through the test dataset.
10. rouge_scorer from rouge_score: this is used to initialize the ROUGE scorer for evaluation.




In [3]:
# Load the 20 newsgroups dataset
newsgroups_data = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
input_texts = newsgroups_data.data
target_texts = [text.split('\n\n')[0] for text in input_texts]

# Select only 1% of the data
num_samples = int(0.01 * len(input_texts))
input_texts = input_texts[:num_samples]
target_texts = target_texts[:num_samples]

# Tokenize inputs and targets
tokenizer = T5Tokenizer.from_pretrained('t5-small')
input_encodings = tokenizer(input_texts, truncation=True, padding=True)
target_encodings = tokenizer(target_texts, truncation=True, padding=True)

# Convert the dataset to a DataLoader
class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings['input_ids'])

test_dataset = NewsGroupsDataset(input_encodings)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=True)

# Instantiate the model and load pre-trained weights
model = T5ForConditionalGeneration.from_pretrained('t5-small')
model.eval()

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)


**This code block** loads the 20 newsgroups dataset and prepares it for use with a T5 model. The fetch_20newsgroups function from sklearn.datasets is used to retrieve the test subset of the newsgroups dataset, and the headers, footers, and quotes are removed from the data. The input_texts and target_texts are created by splitting the data at the first double newline character, which separates the title of the newsgroup article from the body.

Next, only 1% of the dataset is selected by setting num_samples to 1% of the length of input_texts, and then input_texts and target_texts are sliced accordingly. The T5Tokenizer is used to tokenize both the inputs and targets, with padding and truncation enabled, creating input_encodings and target_encodings.

The NewsGroupsDataset class is defined to convert the encodings to a PyTorch DataLoader using torch.utils.data.Dataset. This class takes in encodings as a parameter, initializes it as an attribute, and has three methods: __getitem__, which returns a dictionary of input_ids and attention_mask tensors, __len__, which returns the length of input_ids, and __init__, which initializes encodings as an attribute.

A test_dataset object is created by instantiating NewsGroupsDataset with input_encodings, and a test_loader object is created by instantiating DataLoader with test_dataset, with batch_size set to 2 and shuffle set to True.

A T5ForConditionalGeneration model is created from the t5-small pretrained model using from_pretrained, and the model.eval() method is called to put the model in evaluation mode.

Finally, a RougeScorer object is instantiated with rouge1 and stemmer enabled as the parameters. This is used to evaluate the generated summaries against the target summaries later in the code.

In [4]:
# Load the 20 newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

test_texts = newsgroups_test.data[:int(len(newsgroups_test.data)*0.01)]
target_summaries = []
generated_summaries = []

# Tokenize inputs and targets
tokenizer = T5Tokenizer.from_pretrained('t5-small')
test_inputs = tokenizer.batch_encode_plus(test_texts, padding=True, truncation=True, return_tensors='pt')

for i, input_ids in enumerate(test_inputs['input_ids']):
    # Decode the input_ids to string
    input_str = tokenizer.decode(input_ids, skip_special_tokens=True)
    input_str = input_str.replace('\n','')
    # Generate a summary
    summary_ids = model.generate(input_ids.unsqueeze(0), num_beams=4, max_length=50, early_stopping=True)
    generated_summary = tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True)
    # Remove any extra whitespace and add to the list
    generated_summary = generated_summary.strip()
    generated_summaries.append(generated_summary)
    
    # Add target summary for scoring
    target_summary = newsgroups_test.data[i].split('\n\n')[0].strip()
    target_summaries.append(target_summary)

# Score the summaries using ROUGE
rouge = Rouge()
scores = rouge.get_scores(generated_summaries, target_summaries, avg=True)

# Print the ROUGE scores
print(f"ROUGE-1: {scores['rouge-1']}")
print(f"ROUGE-2: {scores['rouge-2']}")
print(f"ROUGE-L: {scores['rouge-l']}")


ROUGE-1: {'r': 0.6050164339617042, 'p': 0.7729578400679762, 'f': 0.6663070449450748}
ROUGE-2: {'r': 0.5674891937355944, 'p': 0.7377997800442674, 'f': 0.6283404349271963}
ROUGE-L: {'r': 0.6050164339617042, 'p': 0.7729578400679762, 'f': 0.6663070449450748}


**This code block** performs text summarization on a subset of the 20 newsgroups dataset using the T5 model. The dataset is loaded using the fetch_20newsgroups function from the sklearn.datasets module. The subset parameter is set to test, and the remove parameter is set to remove headers, footers, and quotes.

The input texts are extracted from the loaded dataset, and a summary target text is created for each input text by splitting the text at the first occurrence of two consecutive newline characters. Only 1% of the data is selected for processing.

A T5 tokenizer is instantiated with the from_pretrained method using the t5-small model. The batch_encode_plus method is used to tokenize the input texts, and the resulting encodings are assigned to a variable.

A for loop is used to iterate through each encoded input. The encoded input is decoded to a string and passed to the generate method of the T5 model with the num_beams, max_length, and early_stopping parameters set to 4, 50, and True, respectively. The resulting summary is decoded to a string and appended to a list of generated summaries. The target summary for the input is also extracted and appended to a list of target summaries.

The Rouge module is imported from the rouge package. The get_scores method of the Rouge class is used to compute the ROUGE-1, ROUGE-2, and ROUGE-L scores for the generated summaries and target summaries. The resulting scores are printed to the console.

**The below code block** is designed to evaluate the performance of a pre-trained T5 model on the task of summarizing news articles. The first step is to load the 20 newsgroups dataset, which contains a collection of news articles. The subset used here is the test set, and the headers, footers, and quotes are removed. A small percentage (1%) of the data is selected to make the code run efficiently in a standard Google Collaboratory notebook.

The inputs and target texts are then tokenized using the T5Tokenizer from the transformers library, with truncation and padding enabled. The data is then converted into a DataLoader object to facilitate batch processing.

Next, the model is instantiated and its pre-trained weights are loaded. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used as the evaluation metric for the generated summaries.

**In the second code block**, the test DataLoader is converted to a list of inputs to make it easier to loop over. For each input, the input text is decoded to a string, a summary is generated using the model, and the target summary is extracted from the original data. ROUGE scores are then calculated for the generated summary using the scorer object initialized in the previous code block. The results are printed for each input along with the input text, target summary, and generated summary.

In [5]:
# Convert the test DataLoader to a list of inputs
test_inputs = [inputs for inputs in test_loader]

# Loop over the test inputs and generated summaries to calculate ROUGE scores
target_summaries = []
generated_summaries = []
rouge_scores = []
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
for i, input_ids in enumerate(test_inputs):
    # Decode the input_ids to string
    input_str = tokenizer.decode(input_ids['input_ids'][0], skip_special_tokens=True)

    # Generate a summary using the model
    summary_ids = model.generate(input_ids=input_ids['input_ids'], num_beams=4, max_length=50, early_stopping=True)
    summary_str = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Add the target summary and generated summary to the lists
    target_summary = target_texts[i]
    generated_summaries.append(summary_str)
    target_summaries.append(target_summary)

    # Calculate ROUGE scores for the generated summary
    rouge = scorer.score(target_summary, summary_str)
    rouge_scores.append(rouge)
    print(f"Input {i+1}:\n{input_str}\nTarget summary: {target_summary}\nGenerated summary: {summary_str}\nROUGE scores: {rouge}\n")


Input 1:
I said what a SILLY boy i was, now i have zillions of messages like "does that include shipping" "is it scsi" "what rom version is it" "will it work on a maximegalon gargantuabrain 9000" ok, the deal is this - if you live in the twin cities, email me, and set up a time, sure, you can drop round and grab one for a tenner. Else Min order $20 (2 drives) + shipping. No guarantees they are good for any purpose at all (they look newish & clean), no technical negotiations. They are model 525 floppytape, part # 960273-639 revision D. 17 pin floppy style connector on the back Else They go in the bin - life is too short for extended negotiations over $10 items :-) cheers Mike.
Target summary: I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value ca

Visualization Code block 1: Bar chart to visualize top N words in the corpus

**The following code block** is aimed at creating a bar chart of the top N words by frequency in a corpus. The first step is to join the input_texts into a single list of strings called corpus. Then, the CountVectorizer object from scikit-learn is created with English stop words to count the frequency of each word in the corpus. The top N words by frequency are then extracted and sorted in descending order using a lambda function. The words and freqs variables contain the top N words and their frequencies, respectively. A DataFrame object is created to store the top N words and their frequencies. Finally, a bar chart is created using Plotly Express library, where the x-axis corresponds to the words and the y-axis corresponds to their respective frequencies, with the chart title indicating the top N words in the corpus.

In [6]:
# Convert input_texts to a single list of strings
corpus = [' '.join(text.split()) for text in input_texts]

# Create a CountVectorizer object to count the frequency of each word in the corpus
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

# Get the top N words by frequency
N = 20
counts = X.sum(axis=0)
word_freq = [(word, counts[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
word_freq_sorted = sorted(word_freq, key=lambda x: x[1], reverse=True)[:N]
words = [x[0] for x in word_freq_sorted]
freqs = [x[1] for x in word_freq_sorted]

# Create a DataFrame to store the top N words and their frequencies
df = pd.DataFrame({'word': words, 'frequency': freqs})

# Create a bar chart
fig = px.bar(df, x='word', y='frequency', title=f'Top {N} Words in the Corpus')
fig.show()


Visualizaiton Code block 2: Scatter plot to visualize the relationship between summary length and ROUGE scores

**The following code** creates a scatter plot that shows the relationship between the input length and the ROUGE-1 scores of the generated summaries. It starts by creating a pandas DataFrame that stores the ROUGE-1, ROUGE-2, and ROUGE-L scores, as well as the input length for each input in the test set. The scores are extracted from the rouge_scores list created in the previous code block. The tokenizer.decode method is used to decode the input_ids to a string and calculate its length.

Next, the DataFrame is plotted as a scatter plot using the px.scatter function from the plotly.express library. The x argument is set to 'input_length' and the y argument is set to 'rouge1' to plot the ROUGE-1 scores against the input length. The title argument is used to set the title of the plot. The resulting plot shows how the ROUGE-1 scores are distributed across the input lengths of the test set.

In [7]:
# Create a DataFrame to store the ROUGE scores and summary lengths
df = pd.DataFrame({'rouge1': [score['rouge1'].fmeasure for score in rouge_scores],
                   'rouge2': [score['rouge2'].fmeasure for score in rouge_scores],
                   'rougeL': [score['rougeL'].fmeasure for score in rouge_scores],
                   'input_length': [len(tokenizer.decode(input_ids['input_ids'][0], skip_special_tokens=True)) for input_ids in test_inputs]})

# Create a scatter plot
fig = px.scatter(df, x='input_length', y='rouge1', title='ROUGE-1 Scores vs. Input Length')
fig.show()


Visualization Code block 3: Heatmap of Correlation Matrix:

**This final code block** is creating a heatmap to visualize the correlation matrix between ROUGE scores. It starts by creating a Pandas DataFrame df that stores the ROUGE scores for each generated summary. The columns of the DataFrame correspond to the different ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L), and the rows correspond to each generated summary. The ROUGE scores are extracted from the rouge_scores list of dictionaries, which was generated in a previous step.

The corr variable stores the correlation matrix calculated using the corr() function of Pandas. This function returns a correlation matrix that shows the pairwise correlation between each pair of columns in the DataFrame.

Finally, a heatmap is created using Plotly Express px.imshow() function, which takes the correlation matrix as input and maps the values to a color scale. The color scale is chosen to be red and blue (color_continuous_scale='RdBu'), where red represents a positive correlation, and blue represents a negative correlation. The heatmap shows the correlation matrix of the ROUGE scores for each metric, which provides insights into the relationship between the different ROUGE scores.

In [8]:
# Create a DataFrame to store the ROUGE scores
df = pd.DataFrame({'rouge1': [score['rouge1'].fmeasure for score in rouge_scores],
                   'rouge2': [score['rouge2'].fmeasure for score in rouge_scores],
                   'rougeL': [score['rougeL'].fmeasure for score in rouge_scores]})

# Calculate the correlation matrix
corr = df.corr()

# Create a heatmap
fig = px.imshow(corr, x=['ROUGE-1', 'ROUGE-2', 'ROUGE-L'], y=['ROUGE-1', 'ROUGE-2', 'ROUGE-L'], 
                color_continuous_scale='RdBu', title='Correlation Matrix of ROUGE Scores')
fig.show()
