# IT - 550 Information Retrieval Assignment - 10
### Student ID - 202011032
### Text Summarization using TextRank and LexRank
- Library - sumy
- Dataset - BBC Business News<br>
- Evaluation metrics - ROUGE-1 Score, ROUGE-2 Score<br>
- Sentences Count - 10, 15, 20, 25

## Importing necessary libraries and initializing paths and configurations

In [1]:
import os
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.evaluation.rouge import rouge_1, rouge_2
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

# Initialize paths and configurations
dataset_path = os.path.join("BBC Business News","News Articles","business")
summaries_path = os.path.join("BBC Business News","Summaries","business")

language = "english"
sentence_counts = (10, 15, 20, 25)

## Defining functions for experimenting with textrank and lexrank summarizers

In [5]:
# Implementing LexRank and TextRank summarization techniques for the dataset
def auto_textsummarize_dataset(dataset_path, language, sentence_count, text_or_lex='text'):
    # Initialize dictionary which will store the generated summaries for all the documents in the path
    summary_sentences_list = []
    
    # Initialize summarizer as per the argument, stemmer and stopwords as per the language
    stemmer = Stemmer(language)
    summarizer = TextRankSummarizer(stemmer) if text_or_lex == 'text' else LexRankSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(language)
    
    doc_paths = [os.path.join(dp, fname) for dp, df, filepaths in os.walk(dataset_path) for fname in filepaths]
    
    for doc_path in doc_paths:
        # Parse and tokenize the text in the file
        doc_parser = PlaintextParser.from_file(doc_path, Tokenizer(language))
        
        # Create a list of sentences from the generated summary using the given sentence count
        # and store it in dictionary
        summary_sentences_list.append([sentence for sentence in summarizer(doc_parser.document, sentence_count)])

    return summary_sentences_list

In [3]:
# Evaluation function to evaluate the summaries
def evaluate_rouge_1_2(dataset_path, ref_summaries_path, language, sentence_count, **kwargs):
    summ_sent_list = auto_textsummarize_dataset(dataset_path, language, sentence_count, **kwargs)
    
    avg_rouge_1_score = []
    avg_rouge_2_score = []
    
    ref_summ_paths = [os.path.join(dp, fname) 
                      for dp, df, filepaths in os.walk(ref_summaries_path) 
                      for fname in filepaths]
    
    for ref_summ_path, summ_sent in zip(ref_summ_paths, summ_sent_list):
        ref_summ_parser = PlaintextParser.from_file(ref_summ_path, Tokenizer(language))
        
        avg_rouge_1_score.append(rouge_1(summ_sent, ref_summ_parser.document.sentences))
        avg_rouge_2_score.append(rouge_2(summ_sent, ref_summ_parser.document.sentences))
    
    avg_rouge_1_score = sum(avg_rouge_1_score) / len(avg_rouge_1_score)
    avg_rouge_2_score = sum(avg_rouge_2_score) / len(avg_rouge_2_score)
    
    print(f"For sentence count {sentence_count}:\n")
    print(f"Average Rouge 1 Score: {avg_rouge_1_score}")
    print(f"Average Rouge 2 Score: {avg_rouge_2_score}")
    print("\n")

## Generating summaries and getting Rouge scores
- Summaries are generated from the dataset using TextRank and LexRank summarizers
- Different sentence counts are taken and their average rouge 1 and rouge 2 scores are calculated

### Experimenting with TextRank Summarizer

In [6]:
print("TextRank Summarizer:\n")
for sentence_count in sentence_counts:
    evaluate_rouge_1_2(dataset_path, summaries_path, language, sentence_count, text_or_lex='text')

TextRank Summarizer:

For sentence count 10:

Average Rouge 1 Score: 0.9362589465827755
Average Rouge 2 Score: 0.8757167024716055


For sentence count 15:

Average Rouge 1 Score: 0.9851731870285183
Average Rouge 2 Score: 0.9386496636103454


For sentence count 20:

Average Rouge 1 Score: 0.9949702892728245
Average Rouge 2 Score: 0.9512304456123413


For sentence count 25:

Average Rouge 1 Score: 0.9984427176241695
Average Rouge 2 Score: 0.9559380390789242




### Experimenting with LexRank Summarizer

In [7]:
print("LexRank Summarizer:\n")
for sentence_count in sentence_counts:
    evaluate_rouge_1_2(dataset_path, summaries_path, language, sentence_count, text_or_lex='lex')

LexRank Summarizer:

For sentence count 10:

Average Rouge 1 Score: 0.8040694784872598
Average Rouge 2 Score: 0.6988370935573947


For sentence count 15:

Average Rouge 1 Score: 0.9335514662925445
Average Rouge 2 Score: 0.8672927324190142


For sentence count 20:

Average Rouge 1 Score: 0.9770305291537181
Average Rouge 2 Score: 0.9267516799191188


For sentence count 25:

Average Rouge 1 Score: 0.9921044660033763
Average Rouge 2 Score: 0.9472765991587428


