#T5

T5 (Text-To-Text Transfer Transformer) is a state-of-the-art language model developed by Google, capable of performing various natural language processing tasks, including text summarization. Here are some pros and cons of using T5 for text summarization:

Pros:

High accuracy: T5 has achieved state-of-the-art results in various natural language processing tasks, including text summarization, making it highly accurate and reliable.
Customizable: T5 allows customization of the text summarization model based on specific requirements and domains, making it highly adaptable to various use cases.
Multilingual: T5 can be trained on various languages, making it a valuable tool for summarizing text in multiple languages.
Abstractive summarization: T5 can perform abstractive summarization, which means it can generate summaries by synthesizing new sentences that are not present in the original text, providing more context and nuance.
Cons:

Resource-intensive: Training T5 for text summarization requires a considerable amount of computational resources, making it difficult to train and deploy for small-scale projects.
Technical complexity: T5 is a complex model that requires advanced technical knowledge to set up, train, and deploy, making it less accessible to non-experts.
Limited interpretability: As with other deep learning models, T5's inner workings can be difficult to interpret, making it challenging to understand why the model produces specific summaries.
Limited scalability: T5's computational requirements and complexity make it challenging to scale up for large-scale text summarization projects.
These are the scores we achieved:

  ROUGE Score:
  Precision: 0.913
  Recall: 0.417
  F1-Score: 0.573

  BLEU Score: 0.683
References
Here are some research papers on text summarization using T5:

"Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" by Yinhan Liu, et al. This paper presents a method for fine-tuning T5 for text summarization, achieving state-of-the-art results on the CNN/Daily Mail dataset.

"Controllable Abstractive Summarization" by Peng Xu, et al. This paper proposes a method for controlling the level of abstraction in T5-generated summaries, improving the quality and fluency of the summaries.

"Scalable Neural Methods for Reasoning with a Symbolic Knowledge Graph" by Kelvin Guu, et al. This paper presents a method for summarizing knowledge graphs using T5, achieving state-of-the-art results on multiple datasets.

"Pretraining-Based Natural Language Generation for Text Summarization" by Zhe Gan, et al. This paper proposes a method for pretraining T5 for text summarization, improving the quality and diversity of generated summaries.

These are just a few examples of research papers on text summarization using T5. There are many more papers and ongoing research in this field.

In [None]:
!pip install -U transformers
!pip install sentencepiece
!pip install rouge
!pip install nltk
import torch
import nltk
nltk.download('punkt')
import json
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
from rouge import Rouge
import torch
import json
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-large')
tokenizer = T5Tokenizer.from_pretrained('t5-large')
device = torch.device('cpu')

text ="""
 India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.
In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India."""

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)

tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

original text preprocessed: 
 India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vac

In [None]:

summary_ids = model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=30,
                                    max_length=700)

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output)



Summarized text: 
 the move is expected to cover an additional 270 million people. it is a major step towards achieving herd immunity and controlling the spread of the virus in india. india has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the united states.


In [None]:
rouge = Rouge()
scores = rouge.get_scores(output, text)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 0.925
Recall: 0.245
F1-Score: 0.387


In [None]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')

    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]

    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = text
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = output
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

0.7765680128156733


In [None]:
print("BLEU Score T5-large: {:.3f}".format(score))

BLEU Score T5-large: 0.777


In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.31.5-py3-none-any.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.16.4 (from gradio)
  Downloading gradio_client-0.16.4-py3-none-any.whl (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.9/315.9 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━

In [None]:
import gradio as gr

def summarizeText(text):
    preprocess_text = text.strip().replace("\n","")
    tokenized_text = tokenizer.encode("summarize: "+preprocess_text, return_tensors="pt").to(device)
    summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length = 50, max_length=700)

    output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return output

demo = gr.Interface(fn = summarizeText, inputs=["text"], outputs=["text"])

demo.launch(share = True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://ee1404bc5eaf25b409.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




# TF-IDF
**TF-IDF (Term Frequency-Inverse Document Frequency)** is a common technique used for information retrieval and text summarization. Here are some advantages and disadvantages of using TF-IDF for text summarization:

### Pros:

* TF-IDF is a simple and computationally efficient method for ranking and summarizing documents based on the importance of their terms.
* TF-IDF takes into account the frequency of a term in a document and across the entire corpus, which can help identify important and unique words for summarization.
* TF-IDF can be customized to weigh certain terms more heavily based on their relevance to the topic, allowing for more targeted and accurate summaries.
* TF-IDF can be easily implemented and requires minimal preprocessing, making it a practical choice for small datasets or simpler NLP tasks.

### Disadvantages:

* TF-IDF only considers the importance of individual terms, without taking into account the relationships between them or the context in which they appear.
* TF-IDF can be sensitive to the length of documents, as longer documents may contain more unique terms and be ranked higher in importance, regardless of their actual relevance to the topic.
* TF-IDF does not capture the semantic meaning of terms, which can lead to inaccurate summaries that miss important concepts or nuances.
* TF-IDF assumes that all terms are equally important within a document, which may not be the case in certain contexts where certain terms carry more weight or have greater impact on the overall meaning.

Overall, TF-IDF can be a useful technique for text summarization in certain contexts, but it has limitations and may not be suitable for all use cases. Its advantages and disadvantages should be carefully considered when selecting a summarization method.

These are the scores we achieved:

    ROUGE Score:
    Precision: 0.787
    Recall: 0.266
    F1-Score: 0.398

    BLEU Score: 0.008

Here are some research papers related to using TF-IDF for text summarization:

1. "Automatic text summarization using TF-IDF weighting scheme" by R. Wan, D. Zhao, and C. Xu, in Proceedings of the 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS)

2. "A comparison study of TF-IDF, LSA and multi-words for text classification" by T. Nasukawa and J. Yi, in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT-NAACL)

3. "Extractive summarization using continuous vector space models" by R. Nallapati, B. Zhou, and C. Gulcehre, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

4. "Text summarization with TF-IDF weighted word embedding" by J. Nam and E. Han, in Proceedings of the 2017 IEEE International Conference on Big Data and Smart Computing (BigComp)

These papers explore different aspects of using TF-IDF for text summarization, such as its effectiveness in producing high-quality summaries, its comparison with other techniques like latent semantic analysis, and its combination with other techniques like continuous vector space models and word embeddings.

The papers suggest that TF-IDF is a simple and effective approach to summarization, particularly for extractive summarization, where sentences are selected from the original document. The use of TF-IDF can help identify the most important words in the document and select the sentences that contain them, leading to a more informative summary.


In [None]:
from nltk.corpus import stopwords
import numpy as np
import pandas
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
!pip install scikit-learn
!pip install rouge
!pip install nltk
from rouge import Rouge
import nltk
import nltk.translate.bleu_score as bleu
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
s = """10 Reads for Data Scientists Getting Started with Business Models If youâ€™re getting started with data science, youâ€™re probably focusing your attention on mostly stats and coding. Thereâ€™s nothing wrong with this, in fact, this is the right move â€” these are essential skills that you need to develop early on in your journey. With this being said, the biggest knowledge gap that Iâ€™ve encountered during my data science journey doesnâ€™t deal with either of these areas. Instead, upon starting my first full-time role as a data scientist, I realized, to my surprise, that I didnâ€™t really understand business. I suspect that this is a common theme. If you studied a technical field in college or picked things up using online courses, itâ€™s unlikely that you ever had any reason to deep dive into business concepts like models, strategy, or important metrics. Adding on to this, I didnâ€™t really come across data science interviews that stress-tested this type of understanding. Plenty of them tried to get a sense of product intuition, but I found that it rarely went beyond that. The fact is that business understanding isnâ€™t taught or evangelized in the data science community to the extent that itâ€™s used in practice. The goal of this post is to help bridge this gap by sharing some of the resources that I found most helpful as I got up to speed on how businesses work from the inside-out. This article from Andreessen Horowitz is a great place to start if youâ€™re trying to get familiar with the slew of metrics and acronyms that get thrown around in a business, whether itâ€™s a startup or not. On a more general note, their posts are consistently high-quality and are almost always worth your time. If you have a larger appetite, check out their follow-up post on 16 more metrics and the thread below for some additional tips on metrics. Some helpful tips on misleading metrics An overall solid resource, the articles at FourWeekMBA are worth exploring at some point. I particularly recommend this for an overview of all the different business models out there. Itâ€™s hard to come away from this without learning something new. For a more practical dive into business models, I also found this post going over how Slack makes money interesting. This one is a bit denser than the previous two, but itâ€™s really excellent. The unmissable Ben Thompson from Stratechery goes over how markets work and why certain companies are dominating their industry. The takeaway from this post is that markets have three components, and the companies that can monopolize two of the three typically win out in a big way. Think Netflix. A lot of what weâ€™ve seen so far has been conceptual, so letâ€™s look at a specific model and analyze why it does and doesnâ€™t work. Another one of my favorite business writers out there, Andrew Chen looks at the dating industry and why most investors donâ€™t find it attractive. Other great essays from the venture capitalist commonly cover things like growth and metrics. More from Ben Thompson, hereâ€™s another great essay from him. This time on how large companies, particularly Facebook and Google, process data from its raw form to something uniquely valuable. Published in Fall 2018, this provides a good early look into the business side of all of the data privacy and regulation concerns weâ€™re seeing now. If youâ€™re not familiar with LTV (lifetime value), then youâ€™ll probably have to get familiar with it at some point. Thereâ€™s plenty of resources out there regarding the metric, but this is probably my favorite go-to on the subject. It clearly explains how to calculate LTV, and why you should think twice before you blindly buy into it without context. This short post focuses on the SaaS (software as a service) business model. The basic idea is outlined quite simply in the picture below, but Iâ€™d still recommend you take the time to read the full write-up. Christoph Janz really does an excellent job of taking a complex question and breaking it down. He also recently updated the chart in a new post. Co-founder and former CEO of StackOverflow, Joel Spolsky hammers home a crucial part-business, part-economics lesson here: Smart companies try to commoditize their productsâ€™ complements. Whether they succeed or not is a very different story, shown here with plenty of examples. We covered a few ways that companies can make money, but this resource takes the most simplistic (and still accurate) approach. It all started with Jim Barksdale at a trade show. As he was heading out the door to catch a flight, he left the audience with one last pearl of wisdom before departing, one that sums up the post quite nicely. â€œGentlemen, thereâ€™s only two ways I know of to make money: bundling and unbundling.â€ Last but not least, if you want to take things a step further, I recommend case studies. You can find a ton of them out there from top universities like Stanford and Harvard for cheap or often no cost at all. Once you have a grasp on the fundamentals, this an excellent way to continue to supplement your learning. This is where Iâ€™m currently at â€” Iâ€™ve challenged myself to take on one case study every two weeks over the summer. Join me on the ride! Wrapping Up That does it for the list. I know all of the above links really helped me out and I hope you take the time to explore them. As you might have noticed, not all of them tie into the day-to-day life of a data scientist â€” thatâ€™s intentional. I said this in my last post, Iâ€™ll say it again â€” data scientists are thinkers. We do our best work when we understand the systems that surround us. This understanding is what sets us up for the cool stuff: exploratory analysis, machine learning, and data visualization. Lay the foundation first and reap the benefits later. Thatâ€™s what itâ€™s all about. The resources selected above were heavily influenced by SVP of Strategy at Squarespace, Andrew Bartholomewâ€™s reading list."""

In [None]:
print ("The Actual length of the article is : ", len(s))

The Actual length of the article is :  5979


In [None]:
sentences = sent_tokenize(s)

In [None]:
dict = {}
text=""
for a in sentences:
    temp = re.sub("[^a-zA-Z]"," ",a)
    temp = temp.lower()
    dict[temp] = a
    text+=temp

In [None]:
text

'   reads for data scientists getting started with business models if you   re getting started with data science  you   re probably focusing your attention on mostly stats and coding there   s nothing wrong with this  in fact  this is the right move     these are essential skills that you need to develop early on in your journey with this being said  the biggest knowledge gap that i   ve encountered during my data science journey doesn   t deal with either of these areas instead  upon starting my first full time role as a data scientist  i realized  to my surprise  that i didn   t really understand business i suspect that this is a common theme if you studied a technical field in college or picked things up using online courses  it   s unlikely that you ever had any reason to deep dive into business concepts like models  strategy  or important metrics adding on to this  i didn   t really come across data science interviews that stress tested this type of understanding plenty of them tr

In [None]:
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
print (len(word_frequencies))

377


In [None]:
max_freq = max(word_frequencies.values())

for w in word_frequencies :
      word_frequencies[w]/=max_freq
print (word_frequencies)

{'reads': 0.09090909090909091, 'data': 1.0, 'scientists': 0.18181818181818182, 'getting': 0.18181818181818182, 'started': 0.2727272727272727, 'business': 1.0, 'models': 0.36363636363636365, 'science': 0.36363636363636365, 'probably': 0.2727272727272727, 'focusing': 0.09090909090909091, 'attention': 0.09090909090909091, 'mostly': 0.09090909090909091, 'stats': 0.09090909090909091, 'coding': 0.09090909090909091, 'nothing': 0.09090909090909091, 'wrong': 0.09090909090909091, 'fact': 0.18181818181818182, 'right': 0.09090909090909091, 'move': 0.09090909090909091, 'essential': 0.09090909090909091, 'skills': 0.09090909090909091, 'need': 0.09090909090909091, 'develop': 0.09090909090909091, 'early': 0.18181818181818182, 'journey': 0.18181818181818182, 'said': 0.18181818181818182, 'biggest': 0.09090909090909091, 'knowledge': 0.09090909090909091, 'gap': 0.18181818181818182, 'encountered': 0.09090909090909091, 'deal': 0.09090909090909091, 'either': 0.09090909090909091, 'areas': 0.09090909090909091, 

In [None]:
sentence_scores = {}
for sent in sentences:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [None]:
import heapq
summary_sentences = heapq.nlargest(3, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)

10 Reads for Data Scientists Getting Started with Business Models If youâ€™re getting started with data science, youâ€™re probably focusing your attention on mostly stats and coding. Instead, upon starting my first full-time role as a data scientist, I realized, to my surprise, that I didnâ€™t really understand business. For a more practical dive into business models, I also found this post going over how Slack makes money interesting.


In [None]:
print ("The Actual length of the article is : ", len(summary))

The Actual length of the article is :  382


In [None]:
rouge = Rouge()
scores = rouge.get_scores(summary, s)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 0.491
Recall: 0.049
F1-Score: 0.089


In [None]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')

    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]

    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = s
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = summary
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

6.02615621310759e-155


In [None]:
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.000


# TextRank
**TextRank** algorithm has its own advantages and disadvantages. Here are some of the pros and cons:

### Pros:

* Automatic: Text summarization using TextRank is an automatic process that does not require human intervention. It can summarize large amounts of text in a very short period of time.

* Unbiased: TextRank algorithm is unbiased and does not take into account the author's opinion or perspective while summarizing the text. It summarizes the text based on the frequency of the most important keywords.

* Saves time: Text summarization using TextRank saves time and effort. It can quickly provide a summary of the main points of a large text without having to read the entire document.

* Consistency: TextRank algorithm provides consistent summaries every time. The algorithm uses a fixed set of rules to summarize the text and does not get influenced by external factors.

* Customizable: TextRank algorithm can be customized to suit specific needs. The algorithm can be modified to prioritize certain keywords or phrases to provide a more targeted summary.

### Cons:

* Limited context: TextRank algorithm focuses on the most important keywords and may miss out on important context that is not captured by those keywords.

* Limited accuracy: TextRank algorithm may not provide accurate summaries if the text is poorly written or has grammatical errors.

* Limited understanding: TextRank algorithm lacks human-like understanding of the text. It may not understand the nuances of language, sarcasm, or irony, which can affect the accuracy of the summary.

* Limited coverage: TextRank algorithm may not be able to summarize all types of text. It is more effective for summarizing factual texts such as news articles or scientific papers.

* Limited creativity: TextRank algorithm cannot provide creative summaries that are outside the scope of the text. It can only summarize what is already present in the text.

These are the scores we achieved:

      ROUGE Score:
      Precision: 1.000
      Recall: 0.414
      F1-Score: 0.586

      BLEU Score: 0.694

## References
Here are a few research papers on text summarization using TextRank:

1. "TextRank: Bringing Order into Texts" by Rada Mihalcea and Paul Tarau (2004)
This paper introduced the TextRank algorithm, which is a graph-based ranking algorithm for text summarization. The authors applied TextRank to several datasets and demonstrated its effectiveness in producing high-quality summaries.

2. "A Comparative Study of Text Summarization Techniques" by G. Pandey and P. Pal (2007)
This paper compares various text summarization techniques, including TextRank, and evaluates their effectiveness on different types of datasets. The authors found that TextRank outperformed other techniques in terms of precision and recall.

3. "An Improved TextRank Algorithm for Text Summarization" by X. Wu et al. (2018)
This paper proposes an improved version of TextRank for text summarization that takes into account sentence length and position in the text. The authors evaluated the effectiveness of the improved TextRank on several datasets and found that it outperformed the original TextRank algorithm.

4. "Text Summarization Using TextRank and Latent Semantic Analysis" by K. Murthy et al. (2020)
This paper combines TextRank with Latent Semantic Analysis (LSA) for text summarization and evaluates its effectiveness on several datasets. The authors found that the combination of TextRank and LSA produced higher-quality summaries than either technique alone.





In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from collections import defaultdict

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
def calculate_similarity(s1, s2):
    """
    Calculates the similarity between two sentences based on the overlap of their words.
    """
    s1 = set(s1)
    s2 = set(s2)
    overlap = len(s1.intersection(s2))
    return overlap / (len(s1) + len(s2))

def summarize(text, num_sentences=3):
    """
    Summarizes the given text using the TextRank algorithm.
    """
    # Tokenize the text into sentences and words
    sentences = sent_tokenize(text)
    words = [word_tokenize(sentence.lower()) for sentence in sentences]

    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('english') + list(punctuation))
    filtered_words = [[word for word in sentence if word not in stop_words] for sentence in words]

    # Create a dictionary to hold the word frequencies
    word_freq = defaultdict(int)
    for sentence in filtered_words:
        for word in sentence:
            word_freq[word] += 1

    # Calculate the sentence scores based on word frequencies and similarity
    sentence_scores = defaultdict(int)
    for i, sentence in enumerate(filtered_words):
        for word in sentence:
            sentence_scores[i] += word_freq[word] / sum(word_freq.values())
    for i, sentence in enumerate(filtered_words):
        for j, other_sentence in enumerate(filtered_words):
            if i == j:
                continue
            similarity = calculate_similarity(sentence, other_sentence)
            sentence_scores[i] += similarity

    # Sort the sentences by score and select the top ones
    top_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)[:num_sentences]
    top_sentences = [sentences[i] for i, score in top_sentences]

    # Combine the top sentences into a summary
    summary = ' '.join(top_sentences)

    return summary

In [None]:
article = """
This is my first article on medium. Here, Iâ€™ll be giving a quick overview of what dimensionality reduction is, why we need it and how to do it. What is Dimensionality Reduction? Dimensionality reduction is simply, the process of reducing the dimension of your feature set. Your feature set could be a dataset with a hundred columns (i.e features) or it could be an array of points that make up a large sphere in the three-dimensional space. Dimensionality reduction is bringing the number of columns down to say, twenty or converting the sphere to a circle in the two-dimensional space. That is all well and good but why should we care? Why would we drop 80 columns off our dataset when we could straight up feed it to our machine learning algorithm and let it do the rest? The Curse of Dimensionality We care because the curse of dimensionality demands that we do. The curse of dimensionality refers to all the problems that arise when working with data in the higher dimensions, that did not exist in the lower dimensions. As the number of features increase, the number of samples also increases proportionally. The more features we have, the more number of samples we will need to have all combinations of feature values well represented in our sample. The Curse of Dimensionality As the number of features increases, the model becomes more complex. The more the number of features, the more the chances of overfitting. A machine learning model that is trained on a large number of features, gets increasingly dependent on the data it was trained on and in turn overfitted, resulting in poor performance on real data, beating the purpose. Avoiding overfitting is a major motivation for performing dimensionality reduction. The fewer features our training data has, the lesser assumptions our model makes and the simpler it will be. But that is not all and dimensionality reduction has a lot more advantages to offer, like Less misleading data means model accuracy improves. Less dimensions mean less computing. Less data means that algorithms train faster. Less data means less storage space required. Less dimensions allow usage of algorithms unfit for a large number of dimensions Removes redundant features and noise. Feature Selection and Feature Engineering for dimensionality reduction Dimensionality reduction could be done by both feature selection methods as well as feature engineering methods. Feature selection is the process of identifying and selecting relevant features for your sample. Feature engineering is manually generating new features from existing features, by applying some transformation or performing some operation on them. Feature selection can be done either manually or programmatically. For example, consider you are trying to build a model which predicts peopleâ€™s weights and you have collected a large corpus of data which describes each person quite thoroughly. If you had a column that described the color of each personâ€™s clothing, would that be much help in predicting their weight? I think we can safely agree it wonâ€™t be. This is something we can drop without further ado. What about a column that described their heights? Thatâ€™s a definite yes. We can make these simple manual feature selections and reduce the dimensionality when the relevance or irrelevance of certain features are obvious or common knowledge. And when its not glaringly obvious, there are a lot of tools we could employ to aid our feature selection. Heatmaps that show the correlation between features is a good idea. So is just visualising the relationship between the features and the target variable by plotting each feature against the target variable. Now let us look at a few programmatic methods for feature selection from the popular machine learning library sci-kit learn, namely, Variance Threshold and Univariate selection. Variance Threshold is a baseline approach to feature selection. As the name suggests, it drops all features where the variance along the column does not exceed a threshold value. The premise is that a feature which doesnâ€™t vary much within itself, has very little predictive power. >>> X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]] >>> selector = VarianceThreshold() >>> selector.fit_transform(X) array([[2, 0], [1, 4], [1, 1]]) Univariate Feature Selection uses statistical tests to select features. Univariate describes a type of data which consists of observations on only a single characteristic or attribute. Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable. Some examples of statistical tests that can be used to evaluate feature relevance are Pearson Correlation, Maximal information coefficient, Distance correlation, ANOVA and Chi-square. Chi-square is used to find the relationship between categorical variables and Anova is preferred when the variables are continuous. Scikit-learn exposes feature selection routines likes SelectKBest, SelectPercentile or GenericUnivariateSelect as objects that implement a transform method based on the score of anova or chi2 or mutual information. Sklearn offers f_regression and mutual_info_regression as the scoring functions for regression and f_classif and mutual_info_classif for classification. F-Test checks for and only captures linear relationships between features and labels. A highly correlated feature is given higher score and less correlated features are given lower score. Correlation is highly deceptive as it doesnâ€™t capture strong non-linear relationships. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation. Feature selection is the simplest of dimensionality reduction methods. We will look at a few feature engineering methods for dimensionality reduction later. Linear Dimensionality Reduction Methods The most common and well known dimensionality reduction methods are the ones that apply linear transformations, like PCA (Principal Component Analysis) : Popularly used for dimensionality reduction in continuous data, PCA rotates and projects data along the direction of increasing variance. The features with the maximum variance are the principal components. Factor Analysis : a technique that is used to reduce a large number of variables into fewer numbers of factors. The values of observed data are expressed as functions of a number of possible causes in order to find which are the most important. The observations are assumed to be caused by a linear transformation of lower dimensional latent factors and added Gaussian noise. LDA (Linear Discriminant Analysis): projects data in a way that the class separability is maximised. Examples from same class are put closely together by the projection. Examples from different classes are placed far apart by the projection PCA orients data along the direction of the component with maximum variance whereas LDA projects the data to signify the class separability Non-linear Dimensionality Reduction Methods Non-linear transformation methods or manifold learning methods are used when the data doesnâ€™t lie on a linear subspace. It is based on the manifold hypothesis which says that in a high dimensional structure, most relevant information is concentrated in small number of low dimensional manifolds. If a linear subspace is a flat sheet of paper, then a rolled up sheet of paper is a simple example of a nonlinear manifold. Informally, this is called a Swiss roll, a canonical problem in the field of non-linear dimensionality reduction.
"""

In [None]:
print ("The Actual length of the article is : ", len(article))

The Actual length of the article is :  7656


In [None]:
# Generating the summary
summary = summarize(article, num_sentences=3)

In [None]:
print ("The length of the summarized article is : ", len(summary))
summary

The length of the summarized article is :  340


'Feature Selection and Feature Engineering for dimensionality reduction Dimensionality reduction could be done by both feature selection methods as well as feature engineering methods. Feature selection is the simplest of dimensionality reduction methods. We will look at a few feature engineering methods for dimensionality reduction later.'

In [None]:
!pip install rouge



In [None]:
from rouge import Rouge
rouge = Rouge()
scores = rouge.get_scores(summary, article)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 1.000
Recall: 0.058
F1-Score: 0.109


In [None]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')

    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]

    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = article
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = summary
predicted_summary = paragraph_to_wordlist(predicted_paragraph)



score = sentence_bleu(reference_summary, predicted_summary)
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.814


# Sentence-Ranking
**Sentence ranking** is a popular approach for text summarization, where sentences are scored based on their importance and the top-ranked sentences are selected to form the summary. Here are some pros and cons of using sentence ranking for text summarization:

## Pros:

* It is a simple and intuitive approach that can be easily implemented.
* It can handle different types of text, such as news articles, scientific papers, and social media posts.
* It can preserve the original structure of the text and provide a coherent summary.
* It can be combined with other techniques, such as sentence clustering and sentence compression, to improve the quality of summaries.
* It can be evaluated using standard metrics, such as ROUGE and BLEU, which allow for objective comparison with other summarization models.

### Cons:

* It can be sensitive to the choice of ranking algorithm and feature set, which can affect the quality of the summary.
* It may not capture the overall meaning of the text and may miss important information.
* It may generate redundant or repetitive information, especially when multiple sentences convey similar information.
* It may not handle text with complex syntax or domain-specific terminology well, which can lead to inaccuracies in the summary.
* It may not be able to generate summaries that are novel or creative, as it relies on the input text for content.

Overall, sentence ranking is a widely used and effective approach for text summarization, but its limitations should be considered when evaluating its performance and potential applications.

These are the scores we achieved:

      ROUGE Score:
      Precision: 0.833
      Recall: 0.331
      F1-Score: 0.474

      BLEU Score: 0.556


Here are some research papers that use sentence ranking for text summarization:

1. "TextRank: Bringing Order into Texts" by R. Mihalcea and P. Tarau. This paper introduces the TextRank algorithm, which is a graph-based approach for sentence ranking and has been widely used for text summarization.

2. "Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization" by J. A. Pérez-Carballo and A. García-Serrano. This paper compares the performance of different graph-based algorithms, including TextRank, for extractive text summarization.

3. "Enhancing Sentence Extraction-Based Single-Document Summarization with Supervised Methods" by D. Das and A. Sarkar. This paper proposes a supervised learning approach for sentence ranking based on features such as sentence length, position, and similarity to the document title.

4. "A Neural Attention Model for Abstractive Sentence Summarization" by A. Rush et al. This paper uses a neural attention model for abstractive text summarization, where sentences are ranked based on their relevance to the summary and the overall coherence of the text.

These papers demonstrate the versatility and effectiveness of sentence ranking for text summarization, and highlight the potential for combining this approach with other techniques to improve the quality of summaries.

In [None]:
!pip install rouge
!pip install nltk
from rouge import Rouge
import nltk
import nltk.translate.bleu_score as bleu
nltk.download('stopwords')
nltk.download('punkt')
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
text ="""
 India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.
In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India."""

In [None]:
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
#Preprocess the text
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
sentences = sent_tokenize(text.lower())
words = word_tokenize(text.lower())

filtered_words = []
for word in words:
    if word not in stop_words:
        stemmed_word = stemmer.stem(word)
        filtered_words.append(stemmed_word)

# Calculate the sentence scores
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

In [None]:
sentence_scores = []
for i in range(len(sentences)):
    sentence_score = 0
    for word in filtered_words:
        if word in vectorizer.get_feature_names_out():
            sentence_score += X[i, vectorizer.vocabulary_[word]]
    sentence_scores.append(sentence_score)

# Sort the sentences
ranked_sentences = sorted(((sentence_scores[i], s) for i, s in enumerate(sentences)), reverse=True)

# Select the top N sentences
top_n = 3
selected_sentences = []
for i in range(top_n):
    selected_sentences.append(ranked_sentences[i][1])


In [None]:
# Generate the summary
summary = " ".join(selected_sentences)
print(summary)

in summary, india's health ministry has announced that the country's covid-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. the decision was taken after a meeting of the national expert group on vaccine administration for covid-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in india. 
 india's health ministry has announced that the country's covid-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities.


In [None]:
rouge = Rouge()
scores = rouge.get_scores(summary, text)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 0.833
Recall: 0.331
F1-Score: 0.474


In [None]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')

    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]

    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = text
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = summary
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

0.5559999307354189


In [None]:
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.556


# Luhn's Model

**The Luhn Model** is a statistical-based text summarization technique that selects the most relevant sentences based on the frequency of important words in the text. Here are some advantages and disadvantages of using the Luhn Model for text summarization:

### Pros:

* Easy to implement: The Luhn Model is a simple algorithm that is easy to implement and requires minimal computational resources.

* No training data needed: The Luhn Model does not require any training data, as it is based on a statistical analysis of the text.

* Good for extractive summarization: The Luhn Model is well-suited for extractive summarization, where the summary is generated by selecting the most relevant sentences from the original text.

* Language-independent: The Luhn Model is language-independent, which means it can be applied to any language.

### Cons:

* Limited to statistical analysis: The Luhn Model relies solely on a statistical analysis of the text and may not be able to capture the semantic meaning of the text.

* Limited context awareness: The Luhn Model does not consider the context in which the sentences are used, which can lead to the selection of irrelevant sentences.

* Over-reliance on word frequency: The Luhn Model relies heavily on word frequency, which may not always be an accurate indicator of the importance of a sentence.

* Limited to single document summarization: The Luhn Model is designed for single document summarization and may not work well for summarizing multiple documents or large sets of data.

These are the scores we achieved:

    ROUGE Score:
    Precision: 0.991
    Recall: 0.742
    F1-Score: 0.848

    BLEU Score: 0.700

## References

Here are some research papers related to Luhn's algorithm for text summarization:

1. "The automatic creation of literature abstracts" by H. P. Luhn, in IBM Journal of Research and Development (1958)

2. "Text summarization using Luhn's algorithm" by H. P. Luhn, in Information Retrieval Techniques for Speech Applications (1996)

3. "Experiments with Luhn's automatic summarizer" by T. F. Sumner, in Journal of the Association for Computing Machinery (1959)

4. "Combining Luhn's algorithm with latent semantic analysis for text summarization" by R. S. Kesavan and S. S. Iyengar, in Proceedings of the 2009 International Conference on Advances in Recent Technologies in Communication and Computing

These papers describe the original Luhn's algorithm for text summarization, its limitations, and its extensions. The algorithm is based on identifying the most frequent words in a document and selecting the sentences that contain them. This approach is simple and can produce reasonable results, but it has some limitations, such as the lack of understanding of the semantic relationships between words.

The later papers explore extensions to the Luhn's algorithm, such as combining it with other techniques, like latent semantic analysis, to improve its performance. These extensions aim to address some of the limitations of the original algorithm and improve its effectiveness in generating high-quality summaries.


In [None]:
from collections import Counter
from nltk.corpus import stopwords
!pip install scikit-learn
!pip install rouge
!pip install nltk
from rouge import Rouge
import nltk
import nltk.translate.bleu_score as bleu
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
def extract_keywords(text, n_keywords=10):
    # Tokenize the text
    tokens = text.lower().split()

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Calculate the frequency of each word
    freq = Counter(tokens)

    # Assign scores to each word based on frequency and position
    scores = {word: freq[word] * (i+1) for i, word in enumerate(tokens)}

    # Sort the words by score and select the top n_keywords
    keywords = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:n_keywords]

    # Return the top keywords
    return [keyword[0] for keyword in keywords]

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
text = """
 India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.
In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India."""

# Extract the top 3 keywords
keywords = extract_keywords(text, n_keywords=3)

# Print the keywords
print('Top keywords:', keywords)

Top keywords: ['vaccination', 'drive', 'million']


In [None]:
# Summarize the text using the top keywords
sentences = text.split('.')
summary = ''
for sentence in sentences:
    for keyword in keywords:
        if keyword in sentence.lower():
            summary += sentence.strip() + '. '
            break

# Print the summary
print('Summary:', summary)

Summary: India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges. The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in Indi

In [None]:
rouge = Rouge()
scores = rouge.get_scores(summary, text)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 0.991
Recall: 0.742
F1-Score: 0.848


In [None]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')

    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]

    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = text
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = summary
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

0.7003175301310649


In [None]:
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.700


In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer

def generate_summary_with_bart(text):
    # Load pre-trained BART model and tokenizer
    model_name = "facebook/bart-large-cnn"
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = BartForConditionalGeneration.from_pretrained(model_name)

    # Tokenize input text
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=142, min_length=56, early_stopping=True)

    # Decode the generated summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example text
input_text = """
The internet, a vast network of interconnected computers, has revolutionized communication, information dissemination, and numerous aspects of daily life. Its origins date back to the 1960s when the United States Department of Defense developed ARPANET (Advanced Research Projects Agency Network) as a means of secure communication. This early network laid the foundation for what would eventually become the modern internet.In the 1970s and 1980s, the development of key technologies such as TCP/IP (Transmission Control Protocol/Internet Protocol) facilitated the growth of interconnected networks. TCP/IP enabled different types of computer networks to communicate with each other, making it possible to create a global network of networks. This period also saw the rise of personal computers, which played a significant role in popularizing the internet among individual users. The launch of the World Wide Web (WWW) in 1991 by British scientist Tim Berners-Lee marked a major milestone in the evolution of the internet. The WWW allowed for the creation and sharing of information through web pages, making it accessible to a wider audience. Berners-Lee's invention of the first web browser, along with the introduction of HTML (HyperText Markup Language), provided the necessary tools for individuals and organizations to create and navigate websites.Throughout the 1990s, the internet experienced rapid growth. The advent of search engines like Yahoo! and Google transformed how users accessed information, making it easier to find relevant content. E-commerce also emerged during this period, with companies like Amazon and eBay pioneering online shopping and changing the retail landscape. Social networking sites, such as Friendster and MySpace, began to appear, laying the groundwork for the social media revolution.The 2000s saw the internet becoming an integral part of everyday life. The rise of high-speed broadband connections and wireless technology made internet access more widespread and convenient. Social media platforms like Facebook, Twitter, and Instagram connected people in unprecedented ways, allowing for instant communication and the sharing of personal experiences. Online video platforms like YouTube provided new avenues for content creation and consumption, giving rise to a new generation of digital influencers and content creators.The internet has also had a profound impact on various industries. In education, online learning platforms and digital resources have made education more accessible, enabling students to learn from anywhere in the world. In healthcare, telemedicine and health information systems have improved patient care and streamlined medical processes. The entertainment industry has been transformed by streaming services like Netflix and Spotify, which offer on-demand access to movies, TV shows, and music. The proliferation of mobile devices, such as smartphones and tablets, has further accelerated the internet's influence. Mobile apps have become a primary means of accessing information and services, from banking and shopping to social networking and entertainment. The integration of the internet into everyday objects, known as the Internet of Things (IoT), has led to the development of smart homes, connected cars, and wearable technology.Despite its many benefits, the internet has also presented challenges. Issues such as cybersecurity threats, privacy concerns, and the spread of misinformation have become increasingly prominent. Governments and organizations around the world are working to address these challenges through regulations, technological advancements, and public awareness campaigns.Looking ahead, the future of the internet holds exciting possibilities. The continued expansion of high-speed internet access and the development of new technologies, such as 5G networks and artificial intelligence, promise to further enhance connectivity and innovation. As the internet continues to evolve, it will undoubtedly shape the way we live, work, and interact in ways we have yet to imagine.
"""

# Generate summary using BART
summary_bart = generate_summary_with_bart(input_text)
print("Generated Summary (using BART):")
print(summary_bart)

Generated Summary (using BART):
The internet has revolutionized communication, information dissemination, and numerous aspects of daily life. Its origins date back to the 1960s when the United States Department of Defense developed ARPANET (Advanced Research Projects Agency Network) The launch of the World Wide Web in 1991 by British scientist Tim Berners-Lee marked a major milestone in the evolution of the internet.


BART (Bidirectional and Auto-Regressive Transformers): BART is another transformer-based model developed by Facebook AI. It's particularly adept at text generation tasks like summarization due to its bidirectional architecture and autoregressive decoding. BART has been shown to perform well on various summarization benchmarks.

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer

def generate_summary_with_bart(text):
    # Load pre-trained BART model and tokenizer
    model_name = "facebook/bart-large-cnn"
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = BartForConditionalGeneration.from_pretrained(model_name)

    # Tokenize input text
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=142, min_length=56, early_stopping=True)

    # Decode the generated summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example text
input_text = """
The internet, a vast network of interconnected computers, has revolutionized communication, information dissemination, and numerous aspects of daily life. Its origins date back to the 1960s when the United States Department of Defense developed ARPANET (Advanced Research Projects Agency Network) as a means of secure communication. This early network laid the foundation for what would eventually become the modern internet.In the 1970s and 1980s, the development of key technologies such as TCP/IP (Transmission Control Protocol/Internet Protocol) facilitated the growth of interconnected networks. TCP/IP enabled different types of computer networks to communicate with each other, making it possible to create a global network of networks. This period also saw the rise of personal computers, which played a significant role in popularizing the internet among individual users. The launch of the World Wide Web (WWW) in 1991 by British scientist Tim Berners-Lee marked a major milestone in the evolution of the internet. The WWW allowed for the creation and sharing of information through web pages, making it accessible to a wider audience. Berners-Lee's invention of the first web browser, along with the introduction of HTML (HyperText Markup Language), provided the necessary tools for individuals and organizations to create and navigate websites.Throughout the 1990s, the internet experienced rapid growth. The advent of search engines like Yahoo! and Google transformed how users accessed information, making it easier to find relevant content. E-commerce also emerged during this period, with companies like Amazon and eBay pioneering online shopping and changing the retail landscape. Social networking sites, such as Friendster and MySpace, began to appear, laying the groundwork for the social media revolution.The 2000s saw the internet becoming an integral part of everyday life. The rise of high-speed broadband connections and wireless technology made internet access more widespread and convenient. Social media platforms like Facebook, Twitter, and Instagram connected people in unprecedented ways, allowing for instant communication and the sharing of personal experiences. Online video platforms like YouTube provided new avenues for content creation and consumption, giving rise to a new generation of digital influencers and content creators.The internet has also had a profound impact on various industries. In education, online learning platforms and digital resources have made education more accessible, enabling students to learn from anywhere in the world. In healthcare, telemedicine and health information systems have improved patient care and streamlined medical processes. The entertainment industry has been transformed by streaming services like Netflix and Spotify, which offer on-demand access to movies, TV shows, and music. The proliferation of mobile devices, such as smartphones and tablets, has further accelerated the internet's influence. Mobile apps have become a primary means of accessing information and services, from banking and shopping to social networking and entertainment. The integration of the internet into everyday objects, known as the Internet of Things (IoT), has led to the development of smart homes, connected cars, and wearable technology.Despite its many benefits, the internet has also presented challenges. Issues such as cybersecurity threats, privacy concerns, and the spread of misinformation have become increasingly prominent. Governments and organizations around the world are working to address these challenges through regulations, technological advancements, and public awareness campaigns.Looking ahead, the future of the internet holds exciting possibilities. The continued expansion of high-speed internet access and the development of new technologies, such as 5G networks and artificial intelligence, promise to further enhance connectivity and innovation. As the internet continues to evolve, it will undoubtedly shape the way we live, work, and interact in ways we have yet to imagine.
"""

# Generate summary using BART
summary_bart = generate_summary_with_bart(input_text)
print("Generated Summary (using BART):")
print(summary_bart)

Generated Summary (using BART):
The internet has revolutionized communication, information dissemination, and numerous aspects of daily life. Its origins date back to the 1960s when the United States Department of Defense developed ARPANET (Advanced Research Projects Agency Network) The launch of the World Wide Web in 1991 by British scientist Tim Berners-Lee marked a major milestone in the evolution of the internet.


In [None]:
print ("The Actual length of the summary_bart is : ", len(summary_bart))

The Actual length of the summary_bart is :  388


Text 1 Example

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer
from rouge import Rouge

def generate_summary_with_bart(text):
    # Load pre-trained BART model and tokenizer
    model_name = "facebook/bart-large-cnn"
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = BartForConditionalGeneration.from_pretrained(model_name)

    # Tokenize input text
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=142, min_length=56, early_stopping=True)

    # Decode the generated summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example text
input_text = """
The internet, a vast network of interconnected computers, has revolutionized communication, information dissemination, and numerous aspects of daily life. Its origins date back to the 1960s when the United States Department of Defense developed ARPANET (Advanced Research Projects Agency Network) as a means of secure communication. This early network laid the foundation for what would eventually become the modern internet.In the 1970s and 1980s, the development of key technologies such as TCP/IP (Transmission Control Protocol/Internet Protocol) facilitated the growth of interconnected networks. TCP/IP enabled different types of computer networks to communicate with each other, making it possible to create a global network of networks. This period also saw the rise of personal computers, which played a significant role in popularizing the internet among individual users. The launch of the World Wide Web (WWW) in 1991 by British scientist Tim Berners-Lee marked a major milestone in the evolution of the internet. The WWW allowed for the creation and sharing of information through web pages, making it accessible to a wider audience. Berners-Lee's invention of the first web browser, along with the introduction of HTML (HyperText Markup Language), provided the necessary tools for individuals and organizations to create and navigate websites.Throughout the 1990s, the internet experienced rapid growth. The advent of search engines like Yahoo! and Google transformed how users accessed information, making it easier to find relevant content. E-commerce also emerged during this period, with companies like Amazon and eBay pioneering online shopping and changing the retail landscape. Social networking sites, such as Friendster and MySpace, began to appear, laying the groundwork for the social media revolution.The 2000s saw the internet becoming an integral part of everyday life. The rise of high-speed broadband connections and wireless technology made internet access more widespread and convenient. Social media platforms like Facebook, Twitter, and Instagram connected people in unprecedented ways, allowing for instant communication and the sharing of personal experiences. Online video platforms like YouTube provided new avenues for content creation and consumption, giving rise to a new generation of digital influencers and content creators.The internet has also had a profound impact on various industries. In education, online learning platforms and digital resources have made education more accessible, enabling students to learn from anywhere in the world. In healthcare, telemedicine and health information systems have improved patient care and streamlined medical processes. The entertainment industry has been transformed by streaming services like Netflix and Spotify, which offer on-demand access to movies, TV shows, and music. The proliferation of mobile devices, such as smartphones and tablets, has further accelerated the internet's influence. Mobile apps have become a primary means of accessing information and services, from banking and shopping to social networking and entertainment. The integration of the internet into everyday objects, known as the Internet of Things (IoT), has led to the development of smart homes, connected cars, and wearable technology.Despite its many benefits, the internet has also presented challenges. Issues such as cybersecurity threats, privacy concerns, and the spread of misinformation have become increasingly prominent. Governments and organizations around the world are working to address these challenges through regulations, technological advancements, and public awareness campaigns.Looking ahead, the future of the internet holds exciting possibilities. The continued expansion of high-speed internet access and the development of new technologies, such as 5G networks and artificial intelligence, promise to further enhance connectivity and innovation. As the internet continues to evolve, it will undoubtedly shape the way we live, work, and interact in ways we have yet to imagine.
"""

# Generate summary using BART
summary_bart = generate_summary_with_bart(input_text)
print("Generated Summary (using BART):")
print(summary_bart)

# Evaluate the summary using ROUGE
rouge = Rouge()
scores = rouge.get_scores(summary_bart, input_text, avg=True)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores['rouge-1']['p']))
print("Recall: {:.3f}".format(scores['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores['rouge-1']['f']))


Generated Summary (using BART):
The internet has revolutionized communication, information dissemination, and numerous aspects of daily life. Its origins date back to the 1960s when the United States Department of Defense developed ARPANET (Advanced Research Projects Agency Network) The launch of the World Wide Web in 1991 by British scientist Tim Berners-Lee marked a major milestone in the evolution of the internet.
ROUGE Score:
Precision: 1.000
Recall: 0.140
F1-Score: 0.245


Text 2 Example

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer
from rouge import Rouge

def generate_summary_with_bart(text):
    # Load pre-trained BART model and tokenizer
    model_name = "facebook/bart-large-cnn"
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = BartForConditionalGeneration.from_pretrained(model_name)

    # Tokenize input text
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=142, min_length=56, early_stopping=True)

    # Decode the generated summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example text
input_text = """
This is my first article on medium. Here, Iâ€™ll be giving a quick overview of what dimensionality reduction is, why we need it and how to do it. What is Dimensionality Reduction? Dimensionality reduction is simply, the process of reducing the dimension of your feature set. Your feature set could be a dataset with a hundred columns (i.e features) or it could be an array of points that make up a large sphere in the three-dimensional space. Dimensionality reduction is bringing the number of columns down to say, twenty or converting the sphere to a circle in the two-dimensional space. That is all well and good but why should we care? Why would we drop 80 columns off our dataset when we could straight up feed it to our machine learning algorithm and let it do the rest? The Curse of Dimensionality We care because the curse of dimensionality demands that we do. The curse of dimensionality refers to all the problems that arise when working with data in the higher dimensions, that did not exist in the lower dimensions. As the number of features increase, the number of samples also increases proportionally. The more features we have, the more number of samples we will need to have all combinations of feature values well represented in our sample. The Curse of Dimensionality As the number of features increases, the model becomes more complex. The more the number of features, the more the chances of overfitting. A machine learning model that is trained on a large number of features, gets increasingly dependent on the data it was trained on and in turn overfitted, resulting in poor performance on real data, beating the purpose. Avoiding overfitting is a major motivation for performing dimensionality reduction. The fewer features our training data has, the lesser assumptions our model makes and the simpler it will be. But that is not all and dimensionality reduction has a lot more advantages to offer, like Less misleading data means model accuracy improves. Less dimensions mean less computing. Less data means that algorithms train faster. Less data means less storage space required. Less dimensions allow usage of algorithms unfit for a large number of dimensions Removes redundant features and noise. Feature Selection and Feature Engineering for dimensionality reduction Dimensionality reduction could be done by both feature selection methods as well as feature engineering methods. Feature selection is the process of identifying and selecting relevant features for your sample. Feature engineering is manually generating new features from existing features, by applying some transformation or performing some operation on them. Feature selection can be done either manually or programmatically. For example, consider you are trying to build a model which predicts peopleâ€™s weights and you have collected a large corpus of data which describes each person quite thoroughly. If you had a column that described the color of each personâ€™s clothing, would that be much help in predicting their weight? I think we can safely agree it wonâ€™t be. This is something we can drop without further ado. What about a column that described their heights? Thatâ€™s a definite yes. We can make these simple manual feature selections and reduce the dimensionality when the relevance or irrelevance of certain features are obvious or common knowledge. And when its not glaringly obvious, there are a lot of tools we could employ to aid our feature selection. Heatmaps that show the correlation between features is a good idea. So is just visualising the relationship between the features and the target variable by plotting each feature against the target variable. Now let us look at a few programmatic methods for feature selection from the popular machine learning library sci-kit learn, namely, Variance Threshold and Univariate selection. Variance Threshold is a baseline approach to feature selection. As the name suggests, it drops all features where the variance along the column does not exceed a threshold value. The premise is that a feature which doesnâ€™t vary much within itself, has very little predictive power. >>> X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]] >>> selector = VarianceThreshold() >>> selector.fit_transform(X) array([[2, 0], [1, 4], [1, 1]]) Univariate Feature Selection uses statistical tests to select features. Univariate describes a type of data which consists of observations on only a single characteristic or attribute. Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable. Some examples of statistical tests that can be used to evaluate feature relevance are Pearson Correlation, Maximal information coefficient, Distance correlation, ANOVA and Chi-square. Chi-square is used to find the relationship between categorical variables and Anova is preferred when the variables are continuous. Scikit-learn exposes feature selection routines likes SelectKBest, SelectPercentile or GenericUnivariateSelect as objects that implement a transform method based on the score of anova or chi2 or mutual information. Sklearn offers f_regression and mutual_info_regression as the scoring functions for regression and f_classif and mutual_info_classif for classification. F-Test checks for and only captures linear relationships between features and labels. A highly correlated feature is given higher score and less correlated features are given lower score. Correlation is highly deceptive as it doesnâ€™t capture strong non-linear relationships. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation. Feature selection is the simplest of dimensionality reduction methods. We will look at a few feature engineering methods for dimensionality reduction later. Linear Dimensionality Reduction Methods The most common and well known dimensionality reduction methods are the ones that apply linear transformations, like PCA (Principal Component Analysis) : Popularly used for dimensionality reduction in continuous data, PCA rotates and projects data along the direction of increasing variance. The features with the maximum variance are the principal components. Factor Analysis : a technique that is used to reduce a large number of variables into fewer numbers of factors. The values of observed data are expressed as functions of a number of possible causes in order to find which are the most important. The observations are assumed to be caused by a linear transformation of lower dimensional latent factors and added Gaussian noise. LDA (Linear Discriminant Analysis): projects data in a way that the class separability is maximised. Examples from same class are put closely together by the projection. Examples from different classes are placed far apart by the projection PCA orients data along the direction of the component with maximum variance whereas LDA projects the data to signify the class separability Non-linear Dimensionality Reduction Methods Non-linear transformation methods or manifold learning methods are used when the data doesnâ€™t lie on a linear subspace. It is based on the manifold hypothesis which says that in a high dimensional structure, most relevant information is concentrated in small number of low dimensional manifolds. If a linear subspace is a flat sheet of paper, then a rolled up sheet of paper is a simple example of a nonlinear manifold. Informally, this is called a Swiss roll, a canonical problem in the field of non-linear dimensionality reduction.
"""

# Generate summary using BART
summary_bart = generate_summary_with_bart(input_text)
print("Generated Summary (using BART):")
print(summary_bart)

# Evaluate the summary using ROUGE
rouge = Rouge()
scores = rouge.get_scores(summary_bart, input_text, avg=True)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores['rouge-1']['p']))
print("Recall: {:.3f}".format(scores['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores['rouge-1']['f']))


Generated Summary (using BART):
 Dimensionality reduction is the process of reducing the dimension of your feature set. As the number of features increases, the model becomes more complex. Less misleading data means model accuracy improves. Less data means that algorithms train faster. Less dimensions allow usage of algorithms unfit for a large number of dimensions. Removes redundant features and noise.
ROUGE Score:
Precision: 1.000
Recall: 0.077
F1-Score: 0.143


In [None]:
print ("The Actual length of the summary_bart is : ", len(summary_bart))

The Actual length of the summary_bart is :  374


Text 3 and 4 Example

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer
from rouge import Rouge

def generate_summary_with_bart(text):
    # Load pre-trained BART model and tokenizer
    model_name = "facebook/bart-large-cnn"
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = BartForConditionalGeneration.from_pretrained(model_name)

    # Tokenize input text
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=142, min_length=56, early_stopping=True)

    # Decode the generated summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example text
input_text = """
Innocent Interpretations for Some Suspicious Statistics; General Election Data Exploration. (part 1) Looking at the 2019 elections in Israel. Some results appear weird, sure, but is there evidence of actual malfeasance, or is there a simpler explanation? Avishalom Shalit Â· Follow Published in Towards Data Science Â· 5 min read Â· Apr 22, 2019 -- Listen Share My favourite analogy in statistics(made by Cassie Kozyrkov) is the analogy to the English/American legal system. The â€œNull Hypothesisâ€ being the presumption of innocence (leading to acquittal) and the rejection of that can only be due to the presentation of evidence of guilt â€œbeyond a reasonable doubt.â€ the P value we select is that level of beyond a reasonable doubt. (which is different depending on the issue at stake) Our story begins with a post on social media. The following â€œanomalyâ€ in a specific polling station was making the rounds. All the parties got votes that were â€œVery Round Numbersâ€â„¢ These are indeed suspicious looking. OK, letâ€™s dive in. TL;DR and disclaimer In this multipart series, in some cases I will be able to present simple, innocent mathematical explanations for ostensibly suspicious results. In other cases I wonâ€™t. Note that there are more than 10,000 ballot boxes, so even a rare numerical anomaly is expected to crop up several times. Getting Started I started looking for other anomalies. Maybe round numbers are too obvious, but what are the odds that a party will get exactly a third of the votes? Half? Histogram Time True Data: Here is a histogram of the vote fraction of a specific party in all polling stations; Note the most common values are reduced small rationals: 1/2, 1/3, 3/7, 1/4, 2/5, 1/5, 4/9. weird, right? Foul play?!? Did we â€œget themâ€? Before I answer, can you think of an explanation of why exactly 1/2 is so popular? [Take a minute] [beat] Well, small rational fractions have a lot more going for them. You can get EXACTLY 1/3 by having 100 out of 300, 101 of 303, etc. but to get 200/601, well, you must have a very specific vote tally, and a very specific vote count for that party. So those values arenâ€™t as common. Is this enough to explain the oddity? Here are the results of a simple model (we pick random numbers for the total number of votes in the polling station, and number of votes for a specific party, given that partyâ€™s national total) The small rationals making an appearance again, even in a random model. (described in appendix B) Non damning histogram So, now that we have a simple explanation for these rational numbers in the data, what would be a good way to look at our data? How about if we let the histogram do what it wanted to from the start, which was to bin the data?If we take 100 bins, we can see if bin 50 is a lot more likely than bin 51 or 49 and thus judge. Binning makes the rational anomalies disappear. Well, it looks like we donâ€™t have enough evidence to convict. In the world where the Null hypothesis is true (no fudging) the evidence presented is actually quite reasonable and not at all surprising. That is not to say that there werenâ€™t instances of a single tally that was fudged and set to exactly 1/2 of the votes for a specific party, it is just not the conspiracy that it appeared to have been (with dozens of polling stations agreeing on the same ratio.) Next up This post will be first in a series. In the next weeks I will explore other anomalies. Iâ€™ll revisit the anomaly that started it all, and Iâ€™ll explore some weird rational relationships between parties. (i.e. one party getting 1/2 or twice that of another party) If youâ€™ve noticed some other weird lines youâ€™d like me to look at, comment and Iâ€™ll have a look. Get Ready Whet your appetite on this row of vote tallies (from one polling station). Note the recurring rational relationships. e.g. 25â€“25â€“75â€“150 ; 27â€“135 it also has the numbers 21,23,24,25,25,26,27. Suspicious, sure. Guilty? Is that weird enough to be suspect or just a coincidence? find out next week. Appendix A â€” code to load data. The results of the tallies per polling station are in here https://media21.bechirot.gov.il/files/expb.csv Here is some boilerplate python for loading that file (11K rows), and renaming the columns to English, So you can see for yourself, and find some more suspicious results. Also includes the code that generates the histograms above. Appendix B simplistic model Fitting a normal distribution to the number of valid votes per polling station (centred at 400 with sigma=100, ignoring the peak at 800) and a poisson distribution for the number of votes per party given the national party fractions, picking lambda from a distribution that matches the national vote fraction per ballot for that party. This is a bit simplistic I know, e.g. a PCA accounting for vote trends for different parties in different municipalities would be better. But this is good enough for now, it produces a viable reason for all those small rational numbers.
"""

# Generate summary using BART
summary_bart = generate_summary_with_bart(input_text)
print("Generated Summary (using BART):")
print(summary_bart)

# Evaluate the summary using ROUGE
rouge = Rouge()
scores = rouge.get_scores(summary_bart, input_text, avg=True)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores['rouge-1']['p']))
print("Recall: {:.3f}".format(scores['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores['rouge-1']['f']))


Generated Summary (using BART):
Innocent Interpretations for Some Suspicious Statistics; General Election Data Exploration. (part 1) Looking at the 2019 elections in Israel. Some results appear weird, sure, but is there evidence of actual malfeasance, or is there a simpler explanation? Avishalom Shalit Â· Follow Published in Towards Data Science.
ROUGE Score:
Precision: 1.000
Recall: 0.092
F1-Score: 0.168


In [None]:
print ("The Actual length of the summary_bart is : ", len(summary_bart))

The Actual length of the summary_bart is :  316


Text 5

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer
from rouge import Rouge

def generate_summary_with_bart(text):
    # Load pre-trained BART model and tokenizer
    model_name = "facebook/bart-large-cnn"
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = BartForConditionalGeneration.from_pretrained(model_name)

    # Tokenize input text
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=142, min_length=56, early_stopping=True)

    # Decode the generated summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example text
input_text = """
10 Reads for Data Scientists Getting Started with Business Models If youâ€™re getting started with data science, youâ€™re probably focusing your attention on mostly stats and coding. Thereâ€™s nothing wrong with this, in fact, this is the right move â€” these are essential skills that you need to develop early on in your journey. With this being said, the biggest knowledge gap that Iâ€™ve encountered during my data science journey doesnâ€™t deal with either of these areas. Instead, upon starting my first full-time role as a data scientist, I realized, to my surprise, that I didnâ€™t really understand business. I suspect that this is a common theme. If you studied a technical field in college or picked things up using online courses, itâ€™s unlikely that you ever had any reason to deep dive into business concepts like models, strategy, or important metrics. Adding on to this, I didnâ€™t really come across data science interviews that stress-tested this type of understanding. Plenty of them tried to get a sense of product intuition, but I found that it rarely went beyond that. The fact is that business understanding isnâ€™t taught or evangelized in the data science community to the extent that itâ€™s used in practice. The goal of this post is to help bridge this gap by sharing some of the resources that I found most helpful as I got up to speed on how businesses work from the inside-out. This article from Andreessen Horowitz is a great place to start if youâ€™re trying to get familiar with the slew of metrics and acronyms that get thrown around in a business, whether itâ€™s a startup or not. On a more general note, their posts are consistently high-quality and are almost always worth your time. If you have a larger appetite, check out their follow-up post on 16 more metrics and the thread below for some additional tips on metrics. Some helpful tips on misleading metrics An overall solid resource, the articles at FourWeekMBA are worth exploring at some point. I particularly recommend this for an overview of all the different business models out there. Itâ€™s hard to come away from this without learning something new. For a more practical dive into business models, I also found this post going over how Slack makes money interesting. This one is a bit denser than the previous two, but itâ€™s really excellent. The unmissable Ben Thompson from Stratechery goes over how markets work and why certain companies are dominating their industry. The takeaway from this post is that markets have three components, and the companies that can monopolize two of the three typically win out in a big way. Think Netflix. A lot of what weâ€™ve seen so far has been conceptual, so letâ€™s look at a specific model and analyze why it does and doesnâ€™t work. Another one of my favorite business writers out there, Andrew Chen looks at the dating industry and why most investors donâ€™t find it attractive. Other great essays from the venture capitalist commonly cover things like growth and metrics. More from Ben Thompson, hereâ€™s another great essay from him. This time on how large companies, particularly Facebook and Google, process data from its raw form to something uniquely valuable. Published in Fall 2018, this provides a good early look into the business side of all of the data privacy and regulation concerns weâ€™re seeing now. If youâ€™re not familiar with LTV (lifetime value), then youâ€™ll probably have to get familiar with it at some point. Thereâ€™s plenty of resources out there regarding the metric, but this is probably my favorite go-to on the subject. It clearly explains how to calculate LTV, and why you should think twice before you blindly buy into it without context. This short post focuses on the SaaS (software as a service) business model. The basic idea is outlined quite simply in the picture below, but Iâ€™d still recommend you take the time to read the full write-up. Christoph Janz really does an excellent job of taking a complex question and breaking it down. He also recently updated the chart in a new post. Co-founder and former CEO of StackOverflow, Joel Spolsky hammers home a crucial part-business, part-economics lesson here: Smart companies try to commoditize their productsâ€™ complements. Whether they succeed or not is a very different story, shown here with plenty of examples. We covered a few ways that companies can make money, but this resource takes the most simplistic (and still accurate) approach. It all started with Jim Barksdale at a trade show. As he was heading out the door to catch a flight, he left the audience with one last pearl of wisdom before departing, one that sums up the post quite nicely. â€œGentlemen, thereâ€™s only two ways I know of to make money: bundling and unbundling.â€ Last but not least, if you want to take things a step further, I recommend case studies. You can find a ton of them out there from top universities like Stanford and Harvard for cheap or often no cost at all. Once you have a grasp on the fundamentals, this an excellent way to continue to supplement your learning. This is where Iâ€™m currently at â€” Iâ€™ve challenged myself to take on one case study every two weeks over the summer. Join me on the ride! Wrapping Up That does it for the list. I know all of the above links really helped me out and I hope you take the time to explore them. As you might have noticed, not all of them tie into the day-to-day life of a data scientist â€” thatâ€™s intentional. I said this in my last post, Iâ€™ll say it again â€” data scientists are thinkers. We do our best work when we understand the systems that surround us. This understanding is what sets us up for the cool stuff: exploratory analysis, machine learning, and data visualization. Lay the foundation first and reap the benefits later. Thatâ€™s what itâ€™s all about. The resources selected above were heavily influenced by SVP of Strategy at Squarespace, Andrew Bartholomewâ€™s reading list.
"""

# Generate summary using BART
summary_bart = generate_summary_with_bart(input_text)
print("Generated Summary (using BART):")
print(summary_bart)

# Evaluate the summary using ROUGE
rouge = Rouge()
scores = rouge.get_scores(summary_bart, input_text, avg=True)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores['rouge-1']['p']))
print("Recall: {:.3f}".format(scores['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores['rouge-1']['f']))


Generated Summary (using BART):
10 Reads for Data Scientists Getting Started with Business Models. The goal of this post is to help bridge this gap by sharing some of the resources that I found most helpful as I got up to speed on how businesses work from the inside-out. We covered a few ways that companies can make money, but this resource takes the most simplistic approach.
ROUGE Score:
Precision: 1.000
Recall: 0.102
F1-Score: 0.184


In [None]:
print ("The Actual length of the summary_bart is : ", len(summary_bart))

The Actual length of the summary_bart is :  346


In [None]:
import time
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the model and tokenizer
model_name = 't5-small'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Example text
text = "The internet, a vast network of interconnected computers, has revolutionized communication, information dissemination, and numerous aspects of daily life. Its origins date back to the 1960s when the United States Department of Defense developed ARPANET (Advanced Research Projects Agency Network) as a means of secure communication. This early network laid the foundation for what would eventually become the modern internet.In the 1970s and 1980s, the development of key technologies such as TCP/IP (Transmission Control Protocol/Internet Protocol) facilitated the growth of interconnected networks. TCP/IP enabled different types of computer networks to communicate with each other, making it possible to create a global network of networks. This period also saw the rise of personal computers, which played a significant role in popularizing the internet among individual users. The launch of the World Wide Web (WWW) in 1991 by British scientist Tim Berners-Lee marked a major milestone in the evolution of the internet. The WWW allowed for the creation and sharing of information through web pages, making it accessible to a wider audience. Berners-Lee's invention of the first web browser, along with the introduction of HTML (HyperText Markup Language), provided the necessary tools for individuals and organizations to create and navigate websites.Throughout the 1990s, the internet experienced rapid growth. The advent of search engines like Yahoo! and Google transformed how users accessed information, making it easier to find relevant content. E-commerce also emerged during this period, with companies like Amazon and eBay pioneering online shopping and changing the retail landscape. Social networking sites, such as Friendster and MySpace, began to appear, laying the groundwork for the social media revolution.The 2000s saw the internet becoming an integral part of everyday life. The rise of high-speed broadband connections and wireless technology made internet access more widespread and convenient. Social media platforms like Facebook, Twitter, and Instagram connected people in unprecedented ways, allowing for instant communication and the sharing of personal experiences. Online video platforms like YouTube provided new avenues for content creation and consumption, giving rise to a new generation of digital influencers and content creators.The internet has also had a profound impact on various industries. In education, online learning platforms and digital resources have made education more accessible, enabling students to learn from anywhere in the world. In healthcare, telemedicine and health information systems have improved patient care and streamlined medical processes. The entertainment industry has been transformed by streaming services like Netflix and Spotify, which offer on-demand access to movies, TV shows, and music. The proliferation of mobile devices, such as smartphones and tablets, has further accelerated the internet's influence. Mobile apps have become a primary means of accessing information and services, from banking and shopping to social networking and entertainment. The integration of the internet into everyday objects, known as the Internet of Things (IoT), has led to the development of smart homes, connected cars, and wearable technology.Despite its many benefits, the internet has also presented challenges. Issues such as cybersecurity threats, privacy concerns, and the spread of misinformation have become increasingly prominent. Governments and organizations around the world are working to address these challenges through regulations, technological advancements, and public awareness campaigns.Looking ahead, the future of the internet holds exciting possibilities. The continued expansion of high-speed internet access and the development of new technologies, such as 5G networks and artificial intelligence, promise to further enhance connectivity and innovation. As the internet continues to evolve, it will undoubtedly shape the way we live, work, and interact in ways we have yet to imagine."

# Measure time for data preprocessing
start_time = time.time()
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
end_time = time.time()
preprocessing_time = end_time - start_time
print(f"Time taken for data preprocessing: {preprocessing_time:.4f} seconds")

# Measure time for model inference
start_time = time.time()
summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
end_time = time.time()
inference_time = end_time - start_time
print(f"Time taken for model inference: {inference_time:.4f} seconds")

# Decode the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", summary)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Time taken for data preprocessing: 0.0166 seconds
Time taken for model inference: 12.9731 seconds
Summary: the internet, a vast network of interconnected computers, has revolutionized communication, information dissemination, and numerous aspects of daily life. it was developed in the 1960s when the united states department of defense developed ARPANET (Advanced Research Projects Agency Network) as a means of secure communication. in the 1970s and 1980s, the development of key technologies such as TCP/IP facilitated the growth of interconnected networks.
