# TF-IDF
**TF-IDF (Term Frequency-Inverse Document Frequency)** is a common technique used for information retrieval and text summarization. Here are some advantages and disadvantages of using TF-IDF for text summarization:

### Pros:

* TF-IDF is a simple and computationally efficient method for ranking and summarizing documents based on the importance of their terms.
* TF-IDF takes into account the frequency of a term in a document and across the entire corpus, which can help identify important and unique words for summarization.
* TF-IDF can be customized to weigh certain terms more heavily based on their relevance to the topic, allowing for more targeted and accurate summaries.
* TF-IDF can be easily implemented and requires minimal preprocessing, making it a practical choice for small datasets or simpler NLP tasks.

### Disadvantages:

* TF-IDF only considers the importance of individual terms, without taking into account the relationships between them or the context in which they appear.
* TF-IDF can be sensitive to the length of documents, as longer documents may contain more unique terms and be ranked higher in importance, regardless of their actual relevance to the topic.
* TF-IDF does not capture the semantic meaning of terms, which can lead to inaccurate summaries that miss important concepts or nuances.
* TF-IDF assumes that all terms are equally important within a document, which may not be the case in certain contexts where certain terms carry more weight or have greater impact on the overall meaning.

Overall, TF-IDF can be a useful technique for text summarization in certain contexts, but it has limitations and may not be suitable for all use cases. Its advantages and disadvantages should be carefully considered when selecting a summarization method.

These are the scores we achieved:

    ROUGE Score:
    Precision: 0.787
    Recall: 0.266
    F1-Score: 0.398

    BLEU Score: 0.008

Here are some research papers related to using TF-IDF for text summarization:

1. "Automatic text summarization using TF-IDF weighting scheme" by R. Wan, D. Zhao, and C. Xu, in Proceedings of the 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS)

2. "A comparison study of TF-IDF, LSA and multi-words for text classification" by T. Nasukawa and J. Yi, in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT-NAACL)

3. "Extractive summarization using continuous vector space models" by R. Nallapati, B. Zhou, and C. Gulcehre, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

4. "Text summarization with TF-IDF weighted word embedding" by J. Nam and E. Han, in Proceedings of the 2017 IEEE International Conference on Big Data and Smart Computing (BigComp)

These papers explore different aspects of using TF-IDF for text summarization, such as its effectiveness in producing high-quality summaries, its comparison with other techniques like latent semantic analysis, and its combination with other techniques like continuous vector space models and word embeddings.

The papers suggest that TF-IDF is a simple and effective approach to summarization, particularly for extractive summarization, where sentences are selected from the original document. The use of TF-IDF can help identify the most important words in the document and select the sentences that contain them, leading to a more informative summary.





In [1]:
from nltk.corpus import stopwords
import numpy as np
import pandas
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
!pip install scikit-learn
!pip install rouge
!pip install nltk
from rouge import Rouge 
import nltk
import nltk.translate.bleu_score as bleu
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
s = """India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.
In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India."""

In [4]:
sentences = sent_tokenize(s)

In [5]:
dict = {}
text=""
for a in sentences:
    temp = re.sub("[^a-zA-Z]"," ",a)
    temp = temp.lower()
    dict[temp] = a
    text+=temp

In [6]:
text

'india s health ministry has announced that the country s covid    vaccination drive will now be expanded to include people over the age of    and those over    with co morbidities the move is expected to cover an additional     million people  making it one of the largest vaccination drives in the world the decision was taken after a meeting of the national expert group on vaccine administration for covid     negvac   which recommended the expansion of the vaccination program the negvac also suggested that private hospitals may be allowed to administer the vaccine  although the details of this are yet to be finalized india began its vaccination drive in mid january  starting with healthcare and frontline workers since then  over    million doses have been administered across the country however  the pace of the vaccination drive has been slower than expected  with concerns raised over vaccine hesitancy and logistical challenges the expansion of the vaccination drive to include the eld

In [7]:
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
print (len(word_frequencies))

106


In [9]:
max_freq = max(word_frequencies.values())

for w in word_frequencies :
      word_frequencies[w]/=max_freq
print (word_frequencies)

{'india': 0.8571428571428571, 'health': 0.42857142857142855, 'ministry': 0.42857142857142855, 'announced': 0.2857142857142857, 'country': 0.7142857142857143, 'covid': 0.7142857142857143, 'vaccination': 1.0, 'drive': 0.7142857142857143, 'expanded': 0.2857142857142857, 'include': 0.42857142857142855, 'people': 0.5714285714285714, 'age': 0.14285714285714285, 'co': 0.42857142857142855, 'morbidities': 0.42857142857142855, 'move': 0.14285714285714285, 'expected': 0.2857142857142857, 'cover': 0.14285714285714285, 'additional': 0.2857142857142857, 'million': 0.5714285714285714, 'making': 0.2857142857142857, 'one': 0.14285714285714285, 'largest': 0.14285714285714285, 'drives': 0.14285714285714285, 'world': 0.2857142857142857, 'decision': 0.2857142857142857, 'taken': 0.2857142857142857, 'meeting': 0.2857142857142857, 'national': 0.2857142857142857, 'expert': 0.2857142857142857, 'group': 0.2857142857142857, 'vaccine': 0.5714285714285714, 'administration': 0.2857142857142857, 'negvac': 0.285714285

In [10]:
sentence_scores = {}
for sent in sentences:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [11]:
import heapq
summary_sentences = heapq.nlargest(3, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)

India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained. Since then, over 13 million doses have been administered across the country.


In [13]:
rouge = Rouge()
scores = rouge.get_scores(summary, s)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 1.000
Recall: 0.364
F1-Score: 0.534


In [14]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')
    
    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]
    
    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = s
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = summary
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

0.882936957293955


In [15]:
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.883
