# CDS
Text summarization using **Connected Dominating Set (CDS)** is a technique that involves selecting the most important sentences from a text document to create a shorter version of the original. Here are some advantages and disadvantages of using CDS for text summarization:

### Pros:

* Good Coverage: CDS-based summarization techniques can cover most of the important topics in a document, as they aim to select the most representative sentences that cover the main themes and ideas.

* Improved Coherence: CDS-based techniques tend to produce summaries that are more coherent than other methods, as they select sentences that are more connected to each other in terms of content.

* Speed: CDS-based techniques are relatively fast and can generate summaries quickly, making them suitable for summarizing large volumes of text.

* Flexibility: CDS-based techniques can be adapted to different types of text documents, including news articles, research papers, and other types of documents.

### Cons:

* Limited Precision: CDS-based summarization techniques may not always select the most important sentences from a document, as they focus more on coverage and coherence rather than precision.

* Subjectivity: CDS-based techniques can be subjective, as the selection of the most important sentences can vary depending on the criteria used to define importance.

* Lack of Context: CDS-based techniques may not take into account the context of a sentence, which can lead to the selection of sentences that are not relevant to the main theme or idea.

* Over-simplification: CDS-based techniques can oversimplify complex documents, as they tend to focus on the most important sentences and may leave out important details or nuances.

These are the scores we achieved:

    ROUGE Score:
    Precision: 1.000
    Recall: 0.430
    F1-Score: 0.602

    BLEU Score: 0.844

## References 

1. "A new approach for text summarization using connected dominating set in graphs" by M. Sadeghi and M. M. Farsangi, in Proceedings of the 2010 International Conference on Computer, Mechatronics, Control and Electronic Engineering (CMCE)

2. "Text summarization using a graph-based method with connected dominating set" by A. E. Bayraktar and F. Can, in Proceedings of the 2012 International Conference on Computer Science and Engineering (UBMK)

3. "Extractive text summarization based on the connected dominating set in a graph representation" by A. E. Bayraktar and F. Can, in Turkish Journal of Electrical Engineering & Computer Sciences

4. "A novel text summarization technique based on connected dominating set in graph" by M. Sadeghi and M. M. Farsangi, in the Journal of Information Science and Engineering

These papers propose using the CDS algorithm to build a graph-based representation of the document and then extracting the summary by selecting the most important sentences or nodes in the CDS. The CDS approach has been shown to be effective in identifying the most important nodes in the graph, and can lead to high-quality summaries.

In [None]:
!pip install rouge
!pip install nltk
import networkx as nx
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from rouge import Rouge 
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import nltk.translate.bleu_score as bleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
def preprocess_text(text):
    """
    Preprocess a given text by tokenizing, removing stop words, and lemmatizing the words.
    """
    # tokenize the text into sentences
    sentences = sent_tokenize(text)

    # remove stop words and lemmatize the words in each sentence
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    preprocessed_sentences = []
    for sentence in sentences:
        words = word_tokenize(sentence.lower())
        filtered_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
        preprocessed_sentence = " ".join(filtered_words)
        preprocessed_sentences.append(preprocessed_sentence)

    return preprocessed_sentences

In [None]:
def compute_similarity(sentence1, sentence2):
    """
    Compute the similarity score between two sentences using TF-IDF.
    """
    tfidf = TfidfVectorizer().fit_transform([sentence1, sentence2])
    similarity_score = (tfidf * tfidf.T).A[0, 1]
    return similarity_score

In [None]:
def find_minimum_cds(graph):
    """
    Find the minimum Connected Dominating Set (CDS) of a graph using a greedy algorithm.
    """
    cds = set() # initialize CDS to empty set
    nodes = set(graph.nodes()) # get all nodes in the graph

    while nodes:
        max_degree_node = max(nodes, key=lambda n: graph.degree(n)) # find node with highest degree
        cds.add(max_degree_node) # add node to CDS
        nodes.discard(max_degree_node) # remove node from remaining nodes
        neighbors = set(graph.neighbors(max_degree_node)) # get all neighbors of the node
        nodes.difference_update(neighbors) # remove neighbors from remaining nodes

    return cds

In [None]:
def summarize_text(text, summary_size, threshold=0.1):
    """
    Summarize a given text using minimum Connected Dominating Set (CDS).
    """
    # preprocess the text
    preprocessed_sentences = preprocess_text(text)

    # create graph from preprocessed sentences
    graph = nx.Graph()
    for i, sentence in enumerate(preprocessed_sentences):
        for j in range(i+1, len(preprocessed_sentences)):
            similarity_score = compute_similarity(sentence, preprocessed_sentences[j]) # compute similarity score between two sentences
            if similarity_score > threshold:
                graph.add_edge(i, j, weight=similarity_score)

    # find minimum CDS of the graph
    cds = find_minimum_cds(graph)

    # sort the CDS nodes based on their occurrence order in the original text
    summary_nodes = sorted(list(cds))

    # create summary by concatenating the selected sentences
    summary = ". ".join([sent_tokenize(text)[i] for i in summary_nodes][:summary_size])

    return summary

In [None]:
text ="""
 India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.
In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India."""

summary_size = 3 # number of sentences in the summary
summary = summarize_text(text, summary_size)

print(summary)

The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program.. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States.


In [None]:
rouge = Rouge()
scores = rouge.get_scores(summary, text)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 1.000
Recall: 0.430
F1-Score: 0.602


In [None]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')
    
    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]
    
    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = text
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = summary
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

0.8435083039960267


In [None]:
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.844
