<a href="https://colab.research.google.com/github/AYUSH-002/Text-Summarization/blob/master/Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Text Summarization with Spacy and Transformers**

**Installing Required Libraries and Models**

In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_sm
!pip install transformers

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


**Importing Libraries**

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest
from transformers import pipeline

**Text Preprocessing Function**

In [None]:
def preprocess_text(text):
    global punctuation
    stopwords = list(STOP_WORDS)
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    tokens = [token.text for token in doc]

    punctuation = punctuation + '\n'

    word_frequencies = {}
    for word in doc:
        if word.text.lower() not in stopwords:
            if word.text.lower() not in punctuation:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1

    max_frequency = max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word] = word_frequencies[word]/max_frequency

    return word_frequencies, doc

**Function for Generating Extractive Summary**

In [None]:
def generate_extractive_summary(word_frequencies, doc, num_sentences):
    sentence_tokens = [sent for sent in doc.sents]
    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]

    select_length = int(len(sentence_tokens) * 0.3)
    summary = nlargest(num_sentences, sentence_scores, key=sentence_scores.get)
    final_summary = [word.text for word in summary]
    return ' '.join(final_summary)


**Function for Generating Abstractive Summary**

In [None]:
def generate_abstractive_summary(text):
    summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")
    summary = summarizer(text, max_length=350, min_length=15, do_sample=False)
    return summary[0]['summary_text']



**Function for Generating Multi-Document Summary**

In [None]:
def generate_multi_document_summary(documents, max_length=200, min_length=50):
    concatenated_docs = " ".join(documents)
    summarizer = pipeline("summarization")
    summary = summarizer(concatenated_docs, max_length=max_length, min_length=min_length, do_sample=False)
    return summary[0]['summary_text']


**Function for Generating Generic Summary**



In [None]:
def generate_generic_summary(text):
    summarizer = pipeline("summarization")
    summary = summarizer(text, max_length=100, min_length=5, do_sample=False)
    return summary[0]['summary_text']


**Function for Generating Domain-specific Summary**

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration

def generate_domain_specific_summary(text, max_length=1024, num_beams=4):
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

    long_text = text * 20
    inputs = tokenizer(long_text, max_length=max_length, return_tensors='pt', truncation=True)
    summary_ids = model.generate(inputs.input_ids, num_beams=num_beams, max_length=max_length, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)

    return summary

**Function for Generating Query-based Summary**

In [None]:
def generate_query_based_summary(text, query, max_length=300, min_length=50):
    word_frequencies, doc = preprocess_text(text)

    query_tokens = query.split()
    relevant_sentences = []
    for sentence in doc.sents:
        if all(word.lower() in sentence.text.lower() for word in query_tokens):
            relevant_sentences.append(sentence.text)

    if not relevant_sentences:
        return "No relevant sentences found for the given query."

    relevant_text = '. '.join(relevant_sentences)

    summarizer = pipeline("summarization")
    summary = summarizer(relevant_text, max_length=max_length, min_length=min_length, do_sample=False)

    return summary[0]['summary_text']


**Main Function for Text Summarization**

In [None]:
def main():
    text = """
    MS Dhoni, also known as Mahendra Singh Dhoni, is one of the most iconic and successful cricketers to have ever represented India. Born on July 7, 1981, in Ranchi, Jharkhand, Dhoni rose from humble beginnings to become one of the most revered figures in Indian cricket history. Fondly called "Captain Cool" by fans and teammates alike, Dhoni's calm demeanor, astute leadership, and remarkable cricketing skills have earned him immense respect and admiration both on and off the field.
Dhoni's journey to cricketing stardom is a testament to his perseverance and determination. Coming from a small town with limited resources, Dhoni's rise through the ranks of Indian cricket was nothing short of extraordinary. He made his international debut for India in 2004 and quickly established himself as a talented wicketkeeper-batsman. However, it was his appointment as the captain of the Indian cricket team in 2007 that marked the beginning of a new era in Indian cricket.
Under Dhoni's captaincy, the Indian cricket team achieved unprecedented success, winning major tournaments including the ICC T20 World Cup in 2007, the ICC Cricket World Cup in 2011, and the ICC Champions Trophy in 2013. Dhoni's leadership style, characterized by composure under pressure, innovative tactics, and a strong belief in his players, played a pivotal role in India's triumphs on the global stage. Beyond his achievements as a captain, Dhoni's batting prowess and lightning-fast reflexes behind the stumps have earned him accolades from cricketing legends and fans worldwide.
Off the field, Dhoni's humility, integrity, and philanthropic efforts have further endeared him to millions of fans. He is known for his down-to-earth nature, modesty, and dedication to his craft. Dhoni's contributions to Indian cricket extend beyond his on-field exploits; he has inspired an entire generation of aspiring cricketers and continues to be a role model for millions across the country. As he gracefully transitioned from international cricket to other pursuits, including the Indian Premier League (IPL) and business ventures, Dhoni's legacy as one of India's greatest cricketing icons remains etched in the annals of history, serving as an inspiration for generations to come.
    """
    text1 = """
    Mahendra Singh Dhoni or MS Dhoni is one of the best finishers in the game of cricket and one of the best captains for the Indian national team. Dhoni took over the ODI captaincy from Rahul Dravid in 2007 after debuting in 2004. He holds multiple captaincy records such as most wins for an Indian captain in Tests and ODIs, and most back-to-back wins by an Indian captain in ODIs. His biggest triumphs as captain have been the 2007 World T20, 2010 Asia Cup, 2011 World Cup and 2013 Champions Trophy. His heroic inning of 91 from 79 balls in the 2011 World Cup is one of his biggest highlights as a batsman and one which rose him to fame for his ability to stay cool under pressure thus earning him the moniker of \'Captain Cool\'. In the IPL, he\'s led Chennai Super Kings (CSK) to titles in 2010 and 2011. Dhoni retired from Test cricket in late 2014 to hand Virat Kohli the baton. He is known for his love for motorcycles and away from cricket, co-owns the Indian Super League (ISL) side Chennaiyin FC. In 2016, a movie was made on his life called \"MS Dhoni: The Untold Story\" in a perfect adaption of someone who has lived a rags-to-riches tale from being a TTE in 2001.
    """

    print("Choose the type of summary you want:")
    print("1. Based on input type")
    print("2. Based on purpose")
    print("3. Based on output type")

    summary_type = int(input("Enter your choice (1/2/3): "))

    if summary_type == 1:
        print("Choose the input type:")
        print("1. Single Document")
        print("2. Multi-Document")
        input_type = int(input("Enter your choice (1/2): "))

        if input_type == 1:
            word_frequencies, doc = preprocess_text(text)
            num_sentences = int(input("Enter the number of sentences for the summary: "))
            summary = generate_extractive_summary(word_frequencies, doc, num_sentences)
            print("\nSummary:")
            print(summary)
        elif input_type == 2:
            documents = [text, text1]
            summary = generate_multi_document_summary(documents)
            print("\nMulti-Document Summary:")
            print(summary)

    elif summary_type == 2:
        print("Choose the purpose of the summary:")
        print("1. Generic")
        print("2. Domain-specific")
        print("3. Query-based")
        purpose = int(input("Enter your choice (1/2/3): "))

        if purpose == 1:
            summary = generate_generic_summary(text)
            print("\nGeneric Summary:")
            print(summary)
        elif purpose == 2:
            summary = generate_domain_specific_summary(text)
            print("\nDomain-specific Summary:")
            print(summary)
        elif purpose == 3:
            query = input("Enter your query: ")
            summary = generate_query_based_summary(text, query)
            print("\nQuery-based Summary:")
            print(summary)

    elif summary_type == 3:
        print("Choose the output type:")
        print("1. Extractive")
        print("2. Abstractive")
        output_type = int(input("Enter your choice (1/2): "))

        if output_type == 1:
            word_frequencies, doc = preprocess_text(text)
            num_sentences = int(input("Enter the number of sentences for the summary: "))
            summary = generate_extractive_summary(word_frequencies, doc, num_sentences)
            print("\nExtractive Summary:")
            print(summary)
        elif output_type == 2:
            summary = generate_abstractive_summary(text)
            print("\nAbstractive Summary:")
            print(summary)

    else:
        print("Invalid choice.")

if __name__ == "__main__":
    main()


Choose the type of summary you want:
1. Based on input type
2. Based on purpose
3. Based on output type
Enter your choice (1/2/3): 2
Choose the purpose of the summary:
1. Generic
2. Domain-specific
3. Query-based
Enter your choice (1/2/3): 3
Enter your query: india


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]


Query-based Summary:
 Mahendra Singh Dhoni is one of the most iconic and successful cricketers to have ever represented India . Born in a small town with limited resources, Dhoni rose from humble beginnings to become one of India's most famous cricketer . His appointment as captain of the Indian cricket team in 2007 marked the beginning of a new era in Indian cricket . Under Dhoni's captaincy, the Indian team achieved unprecedented success, winning major tournaments including the ICC T20 World Cup in 2007 and the ICC Cricket World Cup .
