#**T5**
**T5 (Text-To-Text Transfer Transformer)** is a state-of-the-art language model developed by Google, capable of performing various natural language processing tasks, including text summarization. Here are some pros and cons of using T5 for text summarization:



Pros:

* High accuracy: T5 has achieved state-of-the-art results in various natural language processing tasks, including text summarization, making it highly accurate and reliable.
* Customizable: T5 allows customization of the text summarization model based on specific requirements and domains, making it highly adaptable to various use cases.
* Multilingual: T5 can be trained on various languages, making it a valuable tool for summarizing text in multiple languages.
* Abstractive summarization: T5 can perform abstractive summarization, which means it can generate summaries by synthesizing new sentences that are not present in the original text, providing more context and nuance.

Cons:

* Resource-intensive: Training T5 for text summarization requires a considerable amount of computational resources, making it difficult to train and deploy for small-scale projects.
* Technical complexity: T5 is a complex model that requires advanced technical knowledge to set up, train, and deploy, making it less accessible to non-experts.
* Limited interpretability: As with other deep learning models, T5's inner workings can be difficult to interpret, making it challenging to understand why the model produces specific summaries.
* Limited scalability: T5's computational requirements and complexity make it challenging to scale up for large-scale text summarization projects.

These are the scores we achieved:

      ROUGE Score:
      Precision: 0.913
      Recall: 0.417
      F1-Score: 0.573

      BLEU Score: 0.683

**References**
Here are some research papers on text summarization using T5:

1. "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" by Yinhan Liu, et al. This paper presents a method for fine-tuning T5 for text summarization, achieving state-of-the-art results on the CNN/Daily Mail dataset.

2. "Controllable Abstractive Summarization" by Peng Xu, et al. This paper proposes a method for controlling the level of abstraction in T5-generated summaries, improving the quality and fluency of the summaries.

3. "Scalable Neural Methods for Reasoning with a Symbolic Knowledge Graph" by Kelvin Guu, et al. This paper presents a method for summarizing knowledge graphs using T5, achieving state-of-the-art results on multiple datasets.

4. "Pretraining-Based Natural Language Generation for Text Summarization" by Zhe Gan, et al. This paper proposes a method for pretraining T5 for text summarization, improving the quality and diversity of generated summaries.

These are just a few examples of research papers on text summarization using T5. There are many more papers and ongoing research in this field.

In [None]:
!pip install -U transformers
!pip install sentencepiece
!pip install rouge
!pip install nltk
import torch
import nltk
nltk.download('punkt')
import json
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
from rouge import Rouge
import torch
import json
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-large')
tokenizer = T5Tokenizer.from_pretrained('t5-large')
device = torch.device('cpu')

text ="""
 India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.
In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India."""

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)

tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

original text preprocessed: 
 India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vac

In [None]:
summary_ids = model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=30,
                                    max_length=700)

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output)



Summarized text: 
 the move is expected to cover an additional 270 million people. it is a major step towards achieving herd immunity and controlling the spread of the virus in india. india has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the united states.




T5-large-Summarized text:

 the move is expected to cover an additional 270 million people. it is a major step towards achieving herd immunity and controlling the spread of the virus in india. india has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the united states.



T5-small-Summarized text:

 the move is expected to cover an additional 270 million people. decision was taken after a meeting of the national expert group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of vaccination program. the nvc suggested private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-january, starting with healthcare and frontline workers. since then, over 13 million doses have been administered across the country.


In [None]:
rouge = Rouge()
scores = rouge.get_scores(output, text)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 0.925
Recall: 0.245
F1-Score: 0.387


T5-large-ROUGE Score:
Precision: 0.925
Recall: 0.245
F1-Score: 0.387

T5-small-ROUGE Score:
Precision: 0.913
Recall: 0.417
F1-Score: 0.573

In [None]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')

    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]

    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = text
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = output
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

0.7765680128156733


T5-large: 0.7765680128156733

T5-small: 0.6831686514342962

In [None]:
print("BLEU Score T5-large: {:.3f}".format(score))

BLEU Score T5-large: 0.777


# **GRADIO Interface**

In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.31.5-py3-none-any.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.16.4 (from gradio)
  Downloading gradio_client-0.16.4-py3-none-any.whl (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.9/315.9 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━

In [None]:
import gradio as gr

def summarizeText(text):
    preprocess_text = text.strip().replace("\n","")
    tokenized_text = tokenizer.encode("summarize: "+preprocess_text, return_tensors="pt").to(device)
    summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length = 50, max_length=200)

    output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return output

demo = gr.Interface(fn = summarizeText, inputs=["text"], outputs=["text"])

demo.launch(share = True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://284788a9ea4c583420.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




# **Creation of articles dataset**

In [None]:
# df = pd.read_csv('/content/medium.csv')

In [None]:
# df

Unnamed: 0,Title,Text
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o..."
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...
...,...,...
1386,Brain: A Mystery,“The most beautiful experience we can have is ...
1387,Machine Learning: Lincoln Was Ahead of His Time,Photo by Jp Valery on Unsplash\n\nIn the 45th ...
1388,AI and Us — an Opera Experience. In my previou...,EKHO COLLECTIVE: OPERA BEYOND SERIES\n\nIn my ...
1389,Digital Skills as a Service (DSaaS),Have you ever thought about what will be in th...


In [None]:
# df_articles = df.Text

In [None]:
# def len_word(word):
#   return len(word.split())

# # Apply the function to the 'A' column and create a new column called 'len_word'
# df['len_word'] = df['Text'].apply(len_word)

# # Sort the dataframe by the 'len_word' column in ascending order
# df = df.sort_values('len_word', ascending=True)

# # Print the dataframe
# df_articles = df[['Text', 'len_word']]

In [None]:
# df_articles = df_articles[df_articles['len_word'] > 500]

In [None]:
# df_articles = df_articles[df_articles['len_word'] < 1200]

In [None]:
# df_articles

Unnamed: 0,Text,len_word
993,BigQuery ML is surprisingly simple. The first ...,503
436,Support Vector Machines (SVMs)\n\nA Brief Over...,504
973,An intro to Natural Language Processing\n\nUnd...,505
1269,Exploring SG’s Rentals with an Interactive Web...,510
595,Photo by NASA on Unsplash\n\nData visualizatio...,511
...,...,...
802,Backpropagation algorithm is probably the most...,1188
615,Analyzing the Twitter Profile of India’s Newly...,1189
1059,"This is my first article on medium. Here, I’ll...",1190
1025,"Cross-validation partitions a dataset, trains ...",1191


# Summarizing 10 articles using ***`T5-Large`*** and ***`T5-Small`*** models

In [None]:
import pandas as pd
import time

In [None]:
results = pd.read_excel('/content/Summary_Results.xlsx')

In [None]:
start = time.time()
print(results['Article'])
end = time.time()
total = end - start
print('{:0.3f}'.format(total), 's')

0    In recent times we have been seeing an influx ...
1    Linear regression is arguably one of the most ...
2    Security is a growing concern among businesses...
3    Innocent Interpretations for Some Suspicious S...
4    What is Machine Learning? There are millions o...
5    10 Reads for Data Scientists Getting Started w...
6    Facebookâ€™s latest Semi-Weak Supervised Learn...
7    This is my first article on medium. Here, Iâ€™...
8    Augmenting categorical datasets with synthetic...
9    What is Blockchain Getting started with Bitcoi...
Name: Article, dtype: object
0.008 s


### **T5-Large**
Generating summaries (length 32 - 100) of 10 articles ranging from length 526 - 1175

In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-large')
tokenizer = T5Tokenizer.from_pretrained('t5-large')
device = torch.device('cpu')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
def t5_large(text):
    start_time = time.time()
    preprocess_text = text.strip().replace("\n","")
    tokenized_text = tokenizer.encode("summarize: " + preprocess_text, return_tensors="pt").to(device)
    summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length = 50, max_length = 200)

    output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    end_time = time.time()
    time_taken = '{:0.3f}'.format(end_time - start_time)+' s'
    return output, time_taken

In [None]:
summary_list = []
for i in range(len(results.Article)):
    summary_list.append(t5_large(results['Article'][i]))

In [None]:
summary_list_time, summary_list_sum = [], []
for i in range(len(summary_list)):
    summary_list_time.append(summary_list[i][1])
    summary_list_sum.append(summary_list[i][0])

In [None]:
results['T5-Large summariesss'] = summary_list_sum
results['T5-Large summaries time taken'] = summary_list_time

In [None]:
# results.drop('T5', axis = 1, inplace = True)

In [None]:
def len_word(word):
  return len(word.split())

# Apply the function to the 'A' column and create a new column called 'len_word'
results['T5-Large summary length'] = results['T5-Large summaries'].apply(len_word)

In [None]:
results

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Article,Length,T5-Large summaries,T5-Large summary length,T5-Small summaries,T5-Small summary length,BART summaries,BART summary length,T5-Large summariesss,T5-Large summaries time taken
0,0,0,In recent times we have been seeing an influx ...,526,in recent times we have been seeing an influx ...,82,the true essence of creativity lies in repeati...,61,In recent times we have been seeing an influx ...,71,in recent times we have been seeing an influx ...,99.119s
1,1,1,Linear regression is arguably one of the most ...,534,"in this blog post, i will walk you through the...",75,"we will be using the Ames Housing dataset, whi...",60,Linear regression is arguably one of the most ...,61,"in this blog post, i will walk you through the...",74.884s
2,2,2,Security is a growing concern among businesses...,802,security is a growing concern among businesses...,49,security is an arena that DevOps is starting t...,49,Security is a growing concern among businesses...,55,security is a growing concern among businesses...,112.387s
3,3,3,Innocent Interpretations for Some Suspicious S...,858,"some results appear weird, but is there eviden...",48,"some results appear weird, sure, but is there ...",50,Innocent Interpretations for Some Suspicious S...,46,"some results appear weird, but is there eviden...",88.121s
4,4,4,What is Machine Learning? There are millions o...,939,Machine Learning is an area of science that he...,100,machine learning is an area of science that he...,64,Machine Learning is an area of science that he...,66,Machine Learning is an area of science that he...,115.524s
5,5,5,10 Reads for Data Scientists Getting Started w...,1019,business understanding isnâ€™t taught or evang...,61,this post is a great place to start if youâ€™r...,54,10 Reads for Data Scientists Getting Started w...,67,business understanding isnâ€TMt taught or evan...,125.561s
6,6,6,Facebookâ€™s latest Semi-Weak Supervised Learn...,1086,Facebook has been exploring the use of Instagr...,32,researchers at Facebook have been exploring th...,49,Researchers at Facebook have been exploring th...,40,Facebook has been exploring the use of Instagr...,169.256s
7,7,7,"This is my first article on medium. Here, Iâ€™...",1183,dimensionality reduction is the process of red...,53,dimensionality reduction is the process of red...,47,Dimensionality reduction is the process of re...,56,dimension reduction is the process of reducing...,207.354s
8,8,8,Augmenting categorical datasets with synthetic...,1175,a common problem for machine learning is unbal...,41,you need to build a classifier to assign sampl...,86,"In this article, we will look at augmenting ca...",45,a data scientist needs to balance the base rat...,182.349s
9,9,9,What is Blockchain Getting started with Bitcoi...,900,"in this post, i show how to run a query of the...",71,this post examines the usability of Python on ...,58,"In this post, I show a simple and straightforw...",52,"in this post, i show how to run a query of the...",116.093s


### **T5-Small**
Generating summaries (length 47 - 86) of 10 articles ranging from length 526 - 1175

In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
device = torch.device('cpu')

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
def t5_small(text):
    start_time = time.time()
    preprocess_text = text.strip().replace("\n","")
    tokenized_text = tokenizer.encode("summarize: " + preprocess_text, return_tensors="pt").to(device)
    summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length = 50, max_length = 200)

    output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    end_time = time.time()
    time_taken = '{:0.3f}'.format(end_time - start_time)+' s'
    return output, time_taken

In [None]:
summary_list2 = []
for i in range(len(results.Article)):
    summary_list2.append(t5_small(results['Article'][i]))

In [None]:
summary_list2_time, summary_list2_sum = [], []
for i in range(len(summary_list2)):
    summary_list2_time.append(summary_list2[i][1])
    summary_list2_sum.append(summary_list2[i][0])

In [None]:
results['T5-Small summaries'] = str(summary_list2_sum)
results['T5-Small summaries time taken'] = summary_list2_time

In [None]:
results['T5-Small summary length'] = results['T5-Small summaries'].apply(len_word)

In [None]:
results

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Article,Length,T5-Large summaries,T5-Large summary length,T5-Small summaries,T5-Small summary length,BART summaries,BART summary length,T5-Large summariesss,T5-Large summaries time taken,T5-Small summaries time taken
0,0,0,In recent times we have been seeing an influx ...,526,in recent times we have been seeing an influx ...,82,['if an AI artist can get inspired from the hu...,601,In recent times we have been seeing an influx ...,71,in recent times we have been seeing an influx ...,99.119s,6.624 s
1,1,1,Linear regression is arguably one of the most ...,534,"in this blog post, i will walk you through the...",75,['if an AI artist can get inspired from the hu...,601,Linear regression is arguably one of the most ...,61,"in this blog post, i will walk you through the...",74.884s,7.630 s
2,2,2,Security is a growing concern among businesses...,802,security is a growing concern among businesses...,49,['if an AI artist can get inspired from the hu...,601,Security is a growing concern among businesses...,55,security is a growing concern among businesses...,112.387s,13.519 s
3,3,3,Innocent Interpretations for Some Suspicious S...,858,"some results appear weird, but is there eviden...",48,['if an AI artist can get inspired from the hu...,601,Innocent Interpretations for Some Suspicious S...,46,"some results appear weird, but is there eviden...",88.121s,18.381 s
4,4,4,What is Machine Learning? There are millions o...,939,Machine Learning is an area of science that he...,100,['if an AI artist can get inspired from the hu...,601,Machine Learning is an area of science that he...,66,Machine Learning is an area of science that he...,115.524s,12.446 s
5,5,5,10 Reads for Data Scientists Getting Started w...,1019,business understanding isnâ€™t taught or evang...,61,['if an AI artist can get inspired from the hu...,601,10 Reads for Data Scientists Getting Started w...,67,business understanding isnâ€TMt taught or evan...,125.561s,11.989 s
6,6,6,Facebookâ€™s latest Semi-Weak Supervised Learn...,1086,Facebook has been exploring the use of Instagr...,32,['if an AI artist can get inspired from the hu...,601,Researchers at Facebook have been exploring th...,40,Facebook has been exploring the use of Instagr...,169.256s,17.129 s
7,7,7,"This is my first article on medium. Here, Iâ€™...",1183,dimensionality reduction is the process of red...,53,['if an AI artist can get inspired from the hu...,601,Dimensionality reduction is the process of re...,56,dimension reduction is the process of reducing...,207.354s,11.583 s
8,8,8,Augmenting categorical datasets with synthetic...,1175,a common problem for machine learning is unbal...,41,['if an AI artist can get inspired from the hu...,601,"In this article, we will look at augmenting ca...",45,a data scientist needs to balance the base rat...,182.349s,14.452 s
9,9,9,What is Blockchain Getting started with Bitcoi...,900,"in this post, i show how to run a query of the...",71,['if an AI artist can get inspired from the hu...,601,"In this post, I show a simple and straightforw...",52,"in this post, i show how to run a query of the...",116.093s,12.275 s


In [None]:
results.to_csv('Summary_Result.csv')

#Summarizing using ***`BART`*** model

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer

In [None]:
def generate_summary_with_bart(text):
    # Load pre-trained BART model and tokenizer
    model_name = "facebook/bart-large-cnn"
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = BartForConditionalGeneration.from_pretrained(model_name)

    # Tokenize input text
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=142, min_length=56, early_stopping=True)

    # Decode the generated summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

In [None]:
def bart(text):
    start_time = time.time()
    output = generate_summary_with_bart(text)
    end_time = time.time()
    time_taken = '{:0.3f}'.format(end_time - start_time)+' s'
    return output, time_taken

In [None]:
import pandas as pd
results = pd.read_excel('/content/Summary_Results.xlsx')

In [None]:
summary_list3 = []
for i in range(len(results.Article)):
    summary_list3.append(bart(results['Article'][i]))

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [None]:
summary_list3_time, summary_list3_sum = [], []
for i in range(len(summary_list3)):
    summary_list3_time.append(summary_list3[i][1])
    summary_list3_sum.append(summary_list3[i][0])

In [None]:
results['BART summaries'] = summary_list3_sum
results['BART summaries time taken'] = summary_list3_time

def len_word(word):
  return len(word.split())

# Apply the function to the 'A' column and create a new column called 'len_word'
results['BART summary length'] = results['BART summaries'].apply(len_word)

In [None]:
results

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Article,Length,T5-Large summaries,T5-Large summary length,T5-Small summaries,T5-Small summary length,BART summaries,BART summary length,BART summaries time taken
0,0,0,In recent times we have been seeing an influx ...,526,in recent times we have been seeing an influx ...,82,the true essence of creativity lies in repeati...,61,In recent times we have been seeing an influx ...,71,71.604 s
1,1,1,Linear regression is arguably one of the most ...,534,"in this blog post, i will walk you through the...",75,"we will be using the Ames Housing dataset, whi...",60,Linear regression is arguably one of the most ...,61,45.961 s
2,2,2,Security is a growing concern among businesses...,802,security is a growing concern among businesses...,49,security is an arena that DevOps is starting t...,49,Security is a growing concern among businesses...,55,51.703 s
3,3,3,Innocent Interpretations for Some Suspicious S...,858,"some results appear weird, but is there eviden...",48,"some results appear weird, sure, but is there ...",50,Innocent Interpretations for Some Suspicious S...,46,36.362 s
4,4,4,What is Machine Learning? There are millions o...,939,Machine Learning is an area of science that he...,100,machine learning is an area of science that he...,64,Machine Learning is an area of science that he...,66,35.056 s
5,5,5,10 Reads for Data Scientists Getting Started w...,1019,business understanding isnâ€™t taught or evang...,61,this post is a great place to start if youâ€™r...,54,10 Reads for Data Scientists Getting Started w...,67,34.311 s
6,6,6,Facebookâ€™s latest Semi-Weak Supervised Learn...,1086,Facebook has been exploring the use of Instagr...,32,researchers at Facebook have been exploring th...,49,Researchers at Facebook have been exploring th...,40,32.911 s
7,7,7,"This is my first article on medium. Here, Iâ€™...",1183,dimensionality reduction is the process of red...,53,dimensionality reduction is the process of red...,47,Dimensionality reduction is the process of re...,56,31.332 s
8,8,8,Augmenting categorical datasets with synthetic...,1175,a common problem for machine learning is unbal...,41,you need to build a classifier to assign sampl...,86,"In this article, we will look at augmenting ca...",45,29.625 s
9,9,9,What is Blockchain Getting started with Bitcoi...,900,"in this post, i show how to run a query of the...",71,this post examines the usability of Python on ...,58,"In this post, I show a simple and straightforw...",52,32.339 s


In [None]:
results.to_excel('Summary_Results.xlsx')

In [None]:
# rouge = Rouge()
# avg = []
# for i in range(len(summary_list3)):
#     scores = rouge.get_scores(summary_list3[i], results['Article'][i])
#     avg.append(scores)

# print("ROUGE Score:")
# # print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
# # print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
# # print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))
# print(avg)

# results['BART scores'] = avg

ROUGE Score:
[[{'rouge-1': {'r': 0.20205479452054795, 'p': 1.0, 'f': 0.33618233338560566}, 'rouge-2': {'r': 0.14285714285714285, 'p': 1.0, 'f': 0.24999999781250004}, 'rouge-l': {'r': 0.20205479452054795, 'p': 1.0, 'f': 0.33618233338560566}}], [{'rouge-1': {'r': 0.17582417582417584, 'p': 1.0, 'f': 0.29906541801729414}, 'rouge-2': {'r': 0.11847389558232932, 'p': 1.0, 'f': 0.21184919020644713}, 'rouge-l': {'r': 0.17582417582417584, 'p': 1.0, 'f': 0.29906541801729414}}], [{'rouge-1': {'r': 0.11764705882352941, 'p': 1.0, 'f': 0.2105263139058172}, 'rouge-2': {'r': 0.06620689655172414, 'p': 0.9230769230769231, 'f': 0.12355212230321884}, 'rouge-l': {'r': 0.11764705882352941, 'p': 1.0, 'f': 0.2105263139058172}}], [{'rouge-1': {'r': 0.09172259507829977, 'p': 1.0, 'f': 0.16803278534609312}, 'rouge-2': {'r': 0.05520702634880803, 'p': 1.0, 'f': 0.10463733551253322}, 'rouge-l': {'r': 0.09172259507829977, 'p': 1.0, 'f': 0.16803278534609312}}], [{'rouge-1': {'r': 0.12142857142857143, 'p': 0.9622641509

#Summarizing using ***`BigBird`*** model (Not working: Version Issue)

In [None]:
!pip install --upgrade transformers



In [None]:
from transformers import BigBirdForConditionalGeneration, BigBirdTokenizer

ImportError: cannot import name 'BigBirdForConditionalGeneration' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)

In [None]:
def generate_summary_with_bigbird(text):
    # Load pre-trained BigBird model and tokenizer
    model_name = "google/bigbird-base-pubmed"
    tokenizer = BigBirdTokenizer.from_pretrained(model_name)
    model = BigBirdForConditionalGeneration.from_pretrained(model_name)

    # Tokenize input text
    inputs = tokenizer([text], max_length=4096, return_tensors='pt', truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=142, min_length=56, early_stopping=True)

    # Decode the generated summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

In [None]:
def big_bird(text):
    return generate_summary_with_bigbird(text)

# Summaries by **`PEGASUS`**

In [None]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

In [None]:
def generate_summary_with_pegasus(text):
    # Load pre-trained Pegasus model and tokenizer
    model_name = "google/pegasus-large"
    tokenizer = PegasusTokenizer.from_pretrained(model_name)
    model = PegasusForConditionalGeneration.from_pretrained(model_name)

    # Tokenize input text
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=142, min_length=56, early_stopping=True)

    # Decode the generated summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

In [None]:
def pega_sus(text):
    start_time = time.time()
    output = generate_summary_with_pegasus(text)
    end_time = time.time()
    time_taken = '{:0.3f}'.format(end_time - start_time)+' s'
    return output, time_taken

In [None]:
summary_list4 = []
for i in range(len(results.Article)):
    summary_list4.append(pega_sus(results['Article'][i]))

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

In [None]:
summary_list4_time, summary_list4_sum = [], []
for i in range(len(summary_list4)):
    summary_list4_time.append(summary_list4[i][1])
    summary_list4_sum.append(summary_list4[i][0])

In [None]:
results['PEGASUS summaries'] = summary_list4_sum
results['PEGASUS summaries time taken'] = summary_list4_time

def len_word(word):
  return len(word.split())

# Apply the function to the 'A' column and create a new column called 'len_word'
results['PEGASUS summary length'] = results['PEGASUS summaries'].apply(len_word)

In [None]:
# rouge = Rouge()
# avg = []
# for i in range(len(summary_list4)):
#     scores = rouge.get_scores(summary_list4[i], results['Article'][i])
#     avg.append(scores)

# print("ROUGE Score:")
# # print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
# # print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
# # print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))
# print(avg)

# results['PEGASUS scores'] = avg

ROUGE Score:
[[{'rouge-1': {'r': 0.2534246575342466, 'p': 0.9866666666666667, 'f': 0.4032697515164565}, 'rouge-2': {'r': 0.18571428571428572, 'p': 0.9479166666666666, 'f': 0.3105802020384629}, 'rouge-l': {'r': 0.2534246575342466, 'p': 0.9866666666666667, 'f': 0.4032697515164565}}], [{'rouge-1': {'r': 0.19413919413919414, 'p': 0.9814814814814815, 'f': 0.3241590186493842}, 'rouge-2': {'r': 0.12248995983935743, 'p': 0.953125, 'f': 0.21708184851559634}, 'rouge-l': {'r': 0.19413919413919414, 'p': 0.9814814814814815, 'f': 0.3241590186493842}}], [{'rouge-1': {'r': 0.17379679144385027, 'p': 0.9558823529411765, 'f': 0.2941176444552733}, 'rouge-2': {'r': 0.11586206896551725, 'p': 0.9032258064516129, 'f': 0.20537897108981298}, 'rouge-l': {'r': 0.17379679144385027, 'p': 0.9558823529411765, 'f': 0.2941176444552733}}], [{'rouge-1': {'r': 0.015659955257270694, 'p': 0.4666666666666667, 'f': 0.03030302967476248}, 'rouge-2': {'r': 0.006273525721455458, 'p': 0.3125, 'f': 0.012300122615372293}, 'rouge-l':

In [None]:
results

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Article,Length,T5-Large summaries,T5-Large summary length,T5-Small summaries,T5-Small summary length,BART summaries,BART summary length,BART summaries time taken,PEGASUS summaries,PEGASUS summaries time taken,PEGASUS summary length
0,0,0,In recent times we have been seeing an influx ...,526,in recent times we have been seeing an influx ...,82,the true essence of creativity lies in repeati...,61,In recent times we have been seeing an influx ...,71,71.604 s,"In such works, a machine learns and gets inspi...",145.448 s,98
1,1,1,Linear regression is arguably one of the most ...,534,"in this blog post, i will walk you through the...",75,"we will be using the Ames Housing dataset, whi...",60,Linear regression is arguably one of the most ...,61,45.961 s,Calculating Outliers There are multiple ways t...,76.657 s,67
2,2,2,Security is a growing concern among businesses...,802,security is a growing concern among businesses...,49,security is an arena that DevOps is starting t...,49,Security is a growing concern among businesses...,55,51.703 s,In looking at how DevOps can make an impact in...,98.972 s,96
3,3,3,Innocent Interpretations for Some Suspicious S...,858,"some results appear weird, but is there eviden...",48,"some results appear weird, sure, but is there ...",50,Innocent Interpretations for Some Suspicious S...,46,36.362 s,In the next weeks ITMll explore some weird dat...,93.671 s,34
4,4,4,What is Machine Learning? There are millions o...,939,Machine Learning is an area of science that he...,100,machine learning is an area of science that he...,64,Machine Learning is an area of science that he...,66,35.056 s,If you have built-in some rules in the backend...,84.161 s,126
5,5,5,10 Reads for Data Scientists Getting Started w...,1019,business understanding isnâ€™t taught or evang...,61,this post is a great place to start if youâ€™r...,54,10 Reads for Data Scientists Getting Started w...,67,34.311 s,10 Reads for Data Scientists Getting Started w...,72.217 s,84
6,6,6,Facebookâ€™s latest Semi-Weak Supervised Learn...,1086,Facebook has been exploring the use of Instagr...,32,researchers at Facebook have been exploring th...,49,Researchers at Facebook have been exploring th...,40,32.911 s,https://ai.facebook.com/blog/billion-scale-sem...,78.356 s,80
7,7,7,"This is my first article on medium. Here, Iâ€™...",1183,dimensionality reduction is the process of red...,53,dimensionality reduction is the process of red...,47,Dimensionality reduction is the process of re...,56,31.332 s,Your feature set could be a dataset with a hun...,57.397 s,55
8,8,8,Augmenting categorical datasets with synthetic...,1175,a common problem for machine learning is unbal...,41,you need to build a classifier to assign sampl...,86,"In this article, we will look at augmenting ca...",45,29.625 s,One approach is to augment the data and synthe...,72.563 s,110
9,9,9,What is Blockchain Getting started with Bitcoi...,900,"in this post, i show how to run a query of the...",71,this post examines the usability of Python on ...,58,"In this post, I show a simple and straightforw...",52,32.339 s,"For an overview of the project and the posts, ...",69.019 s,83


In [None]:
# results.drop(['PEGASUS summaries', 'PEGASUS summary length', 'PEGASUS scores'], axis = 1, inplace = True)

In [None]:
results.to_excel('Summary_Results.xlsx')

#Misc