#Cluster-Based

**Cluster-based algorithms** for text summarization are a class of unsupervised algorithms that group similar sentences into clusters and then extract summary sentences from these clusters. Here are some advantages and disadvantages of cluster-based algorithms for text summarization:

### Pros:
*	Flexibility: Cluster-based algorithms are very flexible, as they can handle different types of texts, such as news articles, academic papers, and social media posts, among others.
*	Language independence: These algorithms are language-independent, which means they can summarize texts in any language.
*	Efficient: Cluster-based algorithms are relatively fast and can summarize large amounts of text quickly.

### Cons:
*	Clustering errors: The quality of the summary depends heavily on the quality of the clustering, and the clustering may not always be accurate, leading to poor summaries.
*	Lack of coherence: Cluster-based algorithms may extract sentences from different clusters, leading to a lack of coherence in the summary.
*	Limited coverage: Cluster-based algorithms tend to summarize the most important sentences, but may miss some important details that are not explicitly mentioned in the text.
*	Difficulty in determining optimal number of clusters: One of the key challenges in cluster-based summarization is determining the optimal number of clusters, which can be difficult.

Overall, cluster-based algorithms are a useful approach for summarizing text, but they do have limitations that need to be considered when using them.

These are the scores we achieved:

      ROUGE Score:
      Precision: 0.980
      Recall: 0.331
      F1-Score: 0.495

      BLEU Score: 0.896

## References

Here are some research papers on cluster-based text summarization:

1. "Cluster-Based Multi-Document Summarization Using Centroid-Based Clustering" by S. Aravindan and S. Natarajan. This paper proposes a centroid-based clustering approach for multi-document summarization.

2. "Cluster-Based Summarization of Web Documents" by M. Shishibori, Y. Kawai, and M. Ishikawa. This paper presents a cluster-based approach for summarizing web documents.

3. "Summarizing Text Documents by Sentence Extraction Using Latent Semantic Analysis" by J. Steinberger and K. Jezek. This paper proposes a cluster-based approach using Latent Semantic Analysis for sentence extraction in text summarization.

4. "Multi-document Summarization Using Clustering and Sentence Extraction" by C. Wang, Y. Liu, and J. Zhu. This paper proposes a clustering and sentence extraction approach for multi-document summarization.

These papers provide valuable insights into the development and implementation of cluster-based text summarization techniques.

In [None]:
!pip install rouge
!pip install nltk
from rouge import Rouge 
import nltk
import nltk.translate.bleu_score as bleu
nltk.download('punkt')
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
text ="""
 India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.
In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India."""

In [None]:
# Split paragraph into sentences
sentences = text.split('. ')

# Store each sentence as a separate document in the array
documents = []
for sentence in sentences:
    documents.append(sentence.strip())

In [None]:
documents

["India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities",
 'The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program',
 'The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers',
 'Since then, over 13 million doses have been administered across the country',
 'However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive t

In [None]:
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Create document-term matrix
doc_term_matrix = vectorizer.fit_transform(documents)

# Perform K-means clustering
k = 2
km = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1, verbose=False)
km.fit(doc_term_matrix)

KMeans(max_iter=100, n_clusters=2, n_init=1, verbose=False)

In [None]:
# Get cluster labels and centroids
labels = km.labels_
centroids = km.cluster_centers_

# Get representative sentences for each cluster
representative_sentences = []
for i in range(k):
    cluster_indices = np.where(labels == i)[0]
    cluster_sentences = [documents[idx] for idx in cluster_indices]
    cluster_vector = vectorizer.transform(cluster_sentences)
    similarity_scores = np.asarray(cluster_vector.dot(centroids[i].T)).flatten()
    threshold = np.percentile(similarity_scores, 80) # filter out non-representative sentences
    representative_idx = np.argmax(similarity_scores * (similarity_scores > threshold))
    representative_sentence = cluster_sentences[representative_idx]
    representative_sentences.append(representative_sentence)

In [None]:
def listToString(s):
    str1 = ""
    for ele in s:
        str1 += ele
    return str1

In [None]:
# Post-processing: remove redundant sentences
final_summary = list(set(representative_sentences))

# Print the resulting summary
summary=(listToString(final_summary))
print(summary)

India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbiditiesThe NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers


In [None]:
rouge = Rouge()
scores = rouge.get_scores(summary, text)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 0.980
Recall: 0.331
F1-Score: 0.495


In [None]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')
    
    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]
    
    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = text
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = summary
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

0.8956352427165735


In [None]:
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.896
