# BERT Extractive Summarisation
**BERT extractive summarization** is a type of text summarization technique that uses pre-trained deep learning models to extract the most important sentences or phrases from a document, in order to create a shorter summary of its content. Here are some potential pros and cons of using this technique specifically for news articles:

### Pros:

* Accuracy: Bert extractive summarization models have been shown to produce high-quality summaries with high levels of accuracy. They are particularly effective at capturing the main points of a news article and ensuring that the summary is comprehensive.

* Efficiency: Bert extractive summarization can help news agencies or journalists save time by quickly generating summaries of articles, allowing them to more easily scan through large amounts of information.

* Scalability: As a machine learning-based technique, Bert extractive summarization can be applied to large volumes of text with relative ease, making it a scalable solution for news organizations.

### Cons:

* Limited Content: As extractive summarization only uses portions of the original article, it can be difficult to include context and other important details that might not be present in the selected sentences. This can lead to a loss of nuance and a less comprehensive understanding of the original article.

* Dependence on Pre-Trained Models: Bert extractive summarization relies on pre-trained deep learning models, which are not always fine-tuned for specific tasks such as summarizing news articles. This can lead to biases or inaccuracies in the resulting summary.

* Difficulty with Non-Standard Content: Bert extractive summarization can have difficulty summarizing non-standard text formats, such as text that contains a lot of jargon or technical terms that are not common in everyday language. In these cases, a more traditional human-driven summarization approach might be more effective.

These are the scores we achieved:

    ROUGE Score:
    Precision: 1.000
    Recall: 0.375
    F1-Score: 0.545

    BLEU Score: 0.677

## References
There have been several scientific papers published on the topic of Bert extractive summarization:

* "Text Summarization Techniques: A Brief Survey" by Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. This paper provides an overview of text summarization techniques, including extractive summarization using machine learning models like Bert.

* "Ranking Sentences for Extractive Summarization with Reinforcement Learning" by Romain Paulus, Caiming Xiong, and Richard Socher. This paper proposes a reinforcement learning approach to extractive summarization, which incorporates the Bert model.

* "Deep Learning for Extractive Summarization of Literary Texts" by Sarah M. Kell and Adam Hammond. This paper explores the use of Bert extractive summarization for literary texts, specifically analyzing the effectiveness of the technique on Shakespearean plays.

* "BERT for Extractive Document Summarization: Evaluation and Analysis" by Fabian Gilke, et al. This paper evaluates the performance of Bert extractive summarization on a large dataset of news articles, analyzing the effectiveness of different variations of the model.

* "Adapting the BERT Model for Extractive Summarization of Online Reviews" by Rana AlTarawneh, et al. This paper applies Bert extractive summarization to a dataset of online reviews, exploring its effectiveness for summarizing user-generated content.

In [26]:
!pip install bert-extractive-summarizer
!pip install spacy
!pip install transformers
!pip install neuralcoref

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting neuralcoref
  Using cached neuralcoref-4.0.tar.gz (368 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: neuralcoref
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for neuralcoref (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for neuralcoref[0m[3

In [27]:
from summarizer import Summarizer

In [28]:
data= """India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.
"""

In [29]:
data = data.replace("\ufeff", "")

In [30]:
model = Summarizer()

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [31]:
result = model(data, num_sentences=4, min_length=60)



In [32]:
full = ''.join(result)

In [33]:
full

"India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world. Since then, over 13 million doses have been administered across the country. In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people."

In [34]:
!pip install scikit-learn
!pip install rouge
!pip install nltk
from rouge import Rouge 
import nltk
import nltk.translate.bleu_score as bleu
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [35]:
rouge = Rouge()
scores = rouge.get_scores(full, data)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 1.000
Recall: 0.375
F1-Score: 0.545


In [36]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summ):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summ.split('.')
    
    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]
    
    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = data
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = full
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

0.6772960828454511


In [37]:
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.677
