# SumBasic
**SumBasic** is a simple yet effective algorithm for text summarization that is based on a probabilistic model of sentence selection. Here are some advantages and disadvantages of the SumBasic algorithm for text summarization:

### Pros:
*	Simplicity: SumBasic is easy to understand and implement, requiring only basic probabilistic modeling and word frequency analysis.
*	Language independence: SumBasic is language-independent and can be applied to texts in any language.
*	Good for extractive summarization: SumBasic is well-suited for extractive summarization, where the summary consists of selected sentences from the original text.
*	Good for single-document summarization: SumBasic is effective at summarizing single documents, and can produce summaries that are accurate and relevant.

### Cons:
*	Limited coverage: SumBasic tends to focus on the most frequent words and sentences, and may miss important details that are less frequent.
*	Lack of coherence: SumBasic may produce summaries that lack coherence, especially when summarizing longer texts.
*	Inability to handle new information: SumBasic does not handle new information that is not present in the original text very well, which can lead to inaccuracies in the summary.
*	Limited customization: SumBasic is a simple algorithm with limited customization options, which may limit its flexibility in certain applications.

Overall, SumBasic is a useful algorithm for extractive summarization of single documents, and is easy to implement and understand. However, it may have limitations in terms of coverage, coherence, and handling new information, and may not be as effective for more complex summarization tasks. Proper tuning and feature selection can help mitigate some of its limitations.

These are the scores we achieved:

    ROUGE Score:
    Precision: 1.000
    Recall: 0.417
    F1-Score: 0.589

    BLEU Score: 0.621

## References

Here are some research papers related to the SumBasic algorithm for text summarization:

1. "Automatic text summarization by sentence extraction" by H. P. Luhn, in IBM Journal of Research and Development (1958)

1. "Sumbasic: A simple yet effective approach to single-document summarization" by A. Nenkova and K. McKeown, in Proceedings of the 2005 Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP)

1. "Sumbasic++: An efficient multi-document summarization approach with topic modeling" by D. Shang, J. Liu, and X. Li, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

These papers discuss various aspects of the SumBasic algorithm, including its effectiveness in producing high-quality summaries, its comparison with other techniques like LexRank and TextRank, and its extension to multi-document summarization using topic modeling.

The SumBasic algorithm is a simple and effective approach to extractive summarization that assigns weights to each sentence in the document based on its frequency in the text. The algorithm iteratively updates the sentence weights and selects the most important sentences for the summary.

The papers suggest that SumBasic is a powerful and computationally efficient approach to automatic text summarization, particularly for single-document summarization tasks. The algorithm's simplicity and intuitive nature make it easy to implement and adapt to different domains and languages.


In [None]:
!pip install rouge
!pip install nltk
from rouge import Rouge 
import nltk
import nltk.translate.bleu_score as bleu
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
def get_word_frequencies(text):
    """
    Calculates the frequency of each word in the text
    """
    stop_words = set(stopwords.words('english'))
    words = [word.lower() for word in word_tokenize(text) if word.isalpha() and word.lower() not in stop_words]
    freq = nltk.FreqDist(words)
    return freq

In [None]:
def get_sentence_scores(text, freq):
    """
    Calculates the score of each sentence in the text
    """
    sentences = sent_tokenize(text)
    scores = []
    for sentence in sentences:
        sentence_score = 0
        sentence_words = [word.lower() for word in word_tokenize(sentence) if word.isalpha()]
        for word in sentence_words:
            sentence_score += freq[word]
        sentence_score /= len(sentence_words)
        scores.append((sentence, sentence_score))
    return scores

In [None]:
def summarize(text, length):
    """
    Summarizes the text to the specified length using the SumBasic algorithm
    """
    freq = get_word_frequencies(text)
    summary = []
    while len(summary) < length:
        sentence_scores = get_sentence_scores(text, freq)
        top_sentence = max(sentence_scores, key=lambda x: x[1])[0]
        summary.append(top_sentence)
        # update frequency distribution by reducing frequency of words in the selected sentence
        for word in word_tokenize(top_sentence):
            if word.isalpha():
                freq[word.lower()] -= 1
    return ' '.join(summary)

In [None]:
text ="""
 India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.
In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India."""

In [None]:
summary = summarize(text, 3)
print(summary)

In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people.


In [None]:
rouge = Rouge()
scores = rouge.get_scores(summary, text)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 1.000
Recall: 0.417
F1-Score: 0.589


In [None]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')
    
    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]
    
    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = text
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = summary
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

0.6209648794317061


In [None]:
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.621
