# CDS
**CDS (Compressive Document Summarization)** is a deep learning-based approach for text summarization that uses a combination of extractive and abstractive techniques. Here are some pros and cons of text summarization of news articles using CDS:

### Pros:

* High compression: CDS can generate highly compressed summaries that retain the most important information from the input text, making it useful for summarizing long documents.

* Combines extractive and abstractive techniques: CDS combines the benefits of extractive and abstractive summarization techniques, resulting in summaries that are both informative and concise.

* High accuracy: CDS has achieved state-of-the-art performance on many benchmark datasets for text summarization, indicating that it can generate high-quality summaries.

* Customizable: CDS can be fine-tuned on specific domains or use cases, allowing users to generate summaries tailored to their needs.

### Cons:

* Resource-intensive: Training and using CDS for text summarization requires significant computational resources, including high-end GPUs, large amounts of memory, and high-speed storage.

* Large model size: CDS is a large model that requires a lot of disk space to store, making it challenging to deploy on devices with limited storage capacity.

* Dependence on training data: CDS's performance is highly dependent on the quality and relevance of the training data used to train the model. If the training data is biased or limited, the quality of the summaries may be compromised.

* Expertise required: Fine-tuning CDS for specific use cases or domains requires expertise in natural language processing and machine learning.

Overall, CDS is a powerful tool for text summarization that can generate highly compressed and informative summaries. However, it requires significant computational resources and expertise to use effectively, making it best suited for large-scale projects or applications where high accuracy is critical.

These are the scores we achieved:

    ROUGE Score:
    Precision: 
    Recall: 
    F1-Score: 

    BLEU Score: 

## References 

1. "A neural attention model for abstractive sentence summarization" by Alexander M. Rush, Sumit Chopra, and Jason Weston, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2. "A deep reinforced model for abstractive summarization" by Romain Paulus, Caiming Xiong, and Richard Socher, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)

3. "Compressive document summarization via sparse optimization" by Wei Shen, Tao Li, and Minyi Guo, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL)

4. "Document summarization with a graph-based attentional neural model" by Rui Yan and Yaowei Wang, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)

5. "Neural document summarization by jointly learning to score and select sentences" by Hong Wang, Xin Wang, and Wenhan Chao, in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)

These papers explore various techniques for Compressive Document Summarization, including neural network-based models and graph-based models, and may provide insights into how to approach this task.






In [1]:
import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
!pip install -U transformers
!pip install sentencepiece
!pip install rouge
!pip install nltk
import torch
import nltk 
nltk.download('punkt')
import json 
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
from rouge import Rouge 
import nltk.translate.bleu_score as bleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m53.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.1-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.1 tokenizers-0.13.2 transformers-4.26.1
Looking in indexes: https://pypi.org/simple, https://us

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
def pagerank(A, eps=0.0001, d=0.85):
    n = A.shape[0]
    P = np.ones(n) / n
    A_norm = A / A.sum(axis=0, keepdims=True) # normalize A
    while True:
        new_P = (1 - d) / n + d * A_norm.T.dot(P)
        delta = abs(new_P - P).sum()
        if delta <= eps:
            return new_P
        P = new_P

In [4]:
def textrank(text, n=3):
    sentences = sent_tokenize(text)
    vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
    X = vectorizer.fit_transform(sentences)
    A = X.dot(X.T).toarray()
    P = pagerank(A)
    idx = P.argsort()[-n:]
    return [sentences[i] for i in idx]

In [5]:
# Example usage
text = """India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.
In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India."""    

summary = textrank(text)
print(summary)

["The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.", 'The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.', "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people."]


In [7]:
rouge = Rouge()
scores = rouge.get_scores(summary, text)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

AssertionError: ignored

In [9]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summary):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summary.split('.')
    
    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]
    
    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = text
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = summary
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

AttributeError: ignored

In [None]:
print("BLEU Score: {:.3f}".format(score))