# Extractive Summarization Algorithm

This algorithm aims to reduce the document size by a certain percentage (e.g., 10%, 20%, 30%) through an extractive approach.

## Steps:

1. **Topic Identification:**
   Identify the topic of the text to be summarized. The topic can be represented as a set of NASARI vectors:
   

2. **Context Creation:**
Create the context by collecting the vectors of terms. This step can be repeated, gradually incorporating the contribution of associated terms at each round.

3. **Sentence Retention:**
Retain paragraphs whose sentences contain the most salient terms. Use the Weighted Overlap (WO) metric to determine the salience of terms in paragraphs. The WO metric is calculated as WO(v1, v2).

4. **Paragraph Reranking:**
Re-rank the retained paragraphs' weight by applying at least one of the following approaches:
- Title-based approach
- Cue-based approach
- Phrase-based approach
- Cohesion-based approach



In [6]:
import os
import pandas as pd
from nltk.corpus import stopwords
import string
from collections import namedtuple
from itertools import product


In [7]:
# Global constants
PUNCTUATION = set(string.punctuation)
compression_rate = [10, 20, 30]


# Extractive Summarization Functions Documentation

This documentation provides an overview and explanation of the various functions used in the extractive summarization process.

## 1. `create_nasari_dict()`

This function creates a Nasari dictionary from a given file containing NASARI vectors. It reads the NASARI file, processes each row to extract the key and corresponding vector values, and constructs a dictionary mapping keys (words) to Nasari vectors.

## 2. `read_document(document)`

This function reads a document file and extracts paragraphs from it. It skips comment lines and short paragraphs to assemble a list of extracted paragraphs, which will be used for further processing.

## 3. `get_topic_words(paragraphs)`

This function extracts topic words from the title and selected paragraphs. It combines words from the specified indices (title, first paragraph, last paragraph) to form a set of topic words that capture the main theme of the document.

## 4. `get_context(topic_words, nasari_dict)`

This function generates a Nasari context for a given set of topic words. It retrieves the Nasari vectors corresponding to the topic words from the Nasari dictionary and constructs a set of Nasari vectors representing the context.

## 5. `rank(word, vector)`

This function ranks a word within a Nasari vector. Given a word and a Nasari vector, it returns the rank of the word in the vector. The rank indicates the position of the word in the vector.

## 6. `weighted_overlap(v1, v2)`

This function calculates the weighted overlap between two Nasari vectors. It takes two Nasari vectors and computes their weighted overlap score, which indicates the similarity between the vectors based on the ranks of overlapping words.

## 7. `rank_paragraphs(paragraphs, context, dict_nasari)`

This function ranks paragraphs based on their similarity to a given context represented by Nasari vectors. It computes a rank score for each paragraph by calculating the weighted overlap between the paragraph and the context. Paragraphs are then ranked based on their rank scores.

## 8. `summarize(document, ranks, title, paragraphs, c_rate)`

This function generates a summary for a document based on the provided ranks and compression rate. It writes the summary to a file, considering the compression rate and the accumulated word count of the selected paragraphs. Additionally, it calculates BLEU and ROUGE metrics to evaluate the quality of the generated summary.

## 9. `bleu_rouge(retrieved_document, c_rate, document)`

This function calculates BLEU and ROUGE metrics to assess the quality of the generated summary. It compares the set of retrieved words in the summary with the set of relevant words from the original document to calculate precision and recall scores.


In [8]:
# Function to create a Nasari dictionary from a given file
def create_nasari_dict():
    nasari_dict = {}
    nasari_file_path = "./dd-small-nasari-15.txt"
    
    with open(nasari_file_path, encoding="utf8") as file:
        for row in file:
            row_parts = row.split(";")
            key = row_parts[1].lower()
            
            if len(row_parts) == 17:
                if key in nasari_dict:
                    key = row_parts[2].lower().split(", ")[0]
                value = row_parts[3:16] + [row_parts[16].strip('\n')]
            else:
                value = row_parts[2:15] + [row_parts[15].strip('\n')]
            
            nasari_dict[key] = value
    
    return nasari_dict

# Function to read and extract paragraphs from a document file
def read_document(document):
    """
    Read and extract paragraphs from a document file.

    Args:
        document: The name of the document file.

    Returns:
        The list of extracted paragraphs.
    """
    paragraphs = []
    with open('../obj/{}'.format(document), encoding='utf-8') as file:
        for paragraph in file:
            if not paragraph.startswith('#') and len(paragraph) > 1:
                paragraphs.append(paragraph)
    return paragraphs

# Function to extract the topic words from title and paragraphs
def get_topic_words(paragraphs):
    """
    Extract the topic words from the title and paragraphs.

    Args:
        paragraphs: The list of paragraphs.

    Returns:
        The set of topic words.
    """
# Function to extract the topic words from title and paragraphs
    topic_words = set()
    indices = [0, 1, -1]
    
    for index in indices:
        paragraph = paragraphs[index].lower().translate(str.maketrans('', '', ''.join(PUNCTUATION)))
        words = [word for word in paragraph.split() if word.isalpha() and word not in stop_words]
        topic_words.update(words)
    
    return topic_words

# Function to get the Nasari context for a given topic
def get_context(topic_words, nasari_dict):
    """
    Get the Nasari context for a given topic.

    Args:
        topic_words: The set of topic words.
        nasari_dict: The Nasari dictionary.

    Returns:
        The set of Nasari vectors representing the context.
    """
    context = set()
    for word in topic_words:
        if word in nasari_dict:
            context.update(nasari_dict[word])
    return context


# Function to rank a word within a Nasari vector
def rank(word, vector):
    """
    Rank a word within a Nasari vector.

    Args:
        word: The word to rank.
        vector: The Nasari vector.

    Returns:
        The rank of the word in the vector.
    """
    return vector.index(word) + 1

# Function to calculate weighted overlap between two Nasari vectors
def weighted_overlap(v1, v2):
    """
    Calculates the weighted overlap between two Nasari vectors.
    
    Parameters:
    v1 (list): The first Nasari vector.
    v2 (list): The second Nasari vector.
    
    Returns:
    float: The weighted overlap score between v1 and v2.
    """
    numerator, denominator = 0, 0
    i = 1
    overlap = set(v1).intersection(set(v2))
    
    for word in overlap:
        numerator += pow(rank(word, v1) + rank(word, v2), -1)
        denominator += pow(2 * i, -1)
        i += 1
        
    if denominator == 0:
        return 0
    else:
        return numerator / denominator

# Function to rank paragraphs based on their similarity to a context
def rank_paragraphs(paragraphs, context,dict_nasari):
    """
    Ranks paragraphs based on their similarity to a given context.
    
    Parameters:
    paragraphs (list): List of paragraphs to be ranked.
    context (set): Set of Nasari vectors representing the context.
    
    Returns:
    list: A list of named tuples containing index, rank score, and paragraph.
    """
    ranks = []
    RankedParagraph = namedtuple('RankedParagraph', ['index', 'rank_score', 'paragraph'])
    
    for index, paragraph in enumerate(paragraphs):  # Exclude the title
        rank_score = 0
        
        # Tokenize and preprocess the paragraph
        tokenized_paragraph = set(paragraph.lower().translate(str.maketrans('', '', ''.join(PUNCTUATION))).split()) - stop_words
        
        for pair in product(set([word for word in tokenized_paragraph]), context):
            if pair[0] in dict_nasari and len(pair[1]) > 0 and pair[1][0] in dict_nasari:
                # Calculate the weighted overlap score and accumulate rank score
                rank_score += weighted_overlap(dict_nasari[pair[0]], dict_nasari[pair[1][0]])
        
        # Store the ranked paragraph information
        ranks.append(RankedParagraph(index, rank_score, paragraph))
    
    # Return the ranked paragraphs sorted by rank score in descending order
    return sorted(ranks, key=lambda rp: rp.rank_score, reverse=True)


# ... (existing code)

# Function to generate a summary for a document
def summarize(document, ranks, title, paragraphs, c_rate):
    """
    Generate a summary for a document.

    Args:
        document: The name of the document.
        ranks: The list of ranked paragraphs.
        title: The title of the document.
        paragraphs: The list of paragraphs.
        c_rate: The compression rate.

    Returns:
        None.
    """    
    file = open('../summaries/{}_{}'.format(c_rate, document), 'w+')
    retrieved_document = set()
    file.write(title + '\n')
    document_words = len(" ".join(paragraphs).split())  # Calculate total words in the document
    max_words = int((document_words * (100 - c_rate)) / 100)
    words = 0
    
    for elem in ranks:
        words += len(paragraphs[elem[0]].split())
        
        if words <= max_words:
            file.write(paragraphs[elem[0]])
            for word in paragraphs[elem[0]].translate(str.maketrans('', '', ''.join(PUNCTUATION))).split():
                retrieved_document.add(word)
        else:
            break
    
    file.close()
    bleu_rouge(retrieved_document, c_rate, document)

# Function to calculate BLEU and ROUGE metrics
def bleu_rouge(retrieved_document, c_rate, document):
    """
    Calculate BLEU and ROUGE metrics.

    Args:
        retrieved_document: The set of retrieved words in the document.
        c_rate: The compression rate.
        document: The name of the document.

    Returns:
        None.
    """    
    file =  open('../target/{}_{}'.format(c_rate, document))
    paragraphs = list()
    relevant_document = set()
    
    for paragraph in file:
        if not paragraph.startswith('#') and len(paragraph) > 1:
            paragraphs.append(paragraph)
    
    file.close()
    
    for paragraph in paragraphs:
        for word in paragraph.translate(str.maketrans('', '', ''.join(PUNCTUATION))).split():
            relevant_document.add(word)
    
    original_document_len = len(relevant_document)
    summarized_document_len = len(retrieved_document)
    
    bleu_precision = len(relevant_document.intersection(retrieved_document)) / len(retrieved_document)
    print('\nDocument', document, ' Compression rate', c_rate)
    print('BLEU Precision:', bleu_precision)
    
    rouge_recall = len(relevant_document.intersection(retrieved_document)) / len(relevant_document)
    print('ROUGE Recall:', rouge_recall)



# Extractive Summarization Process Overview

This overview outlines the steps involved in the extractive summarization process using the provided code block. The process involves extracting key information from a collection of documents and generating concise summaries based on identified context and topic words.

## Initialization and Setup

- The process begins with the setup, where required libraries and modules are imported.
- A list of English stop words is prepared to aid in text processing.
- A Nasari dictionary is created to store word-to-Nasari-vector mappings.

## Document Summarization Loop

- A loop iterates through a set of documents in a designated directory, aiming to generate summaries for each document.

### Key Steps in the Loop:

1. **Document Extraction:**
   - Document contents are extracted into paragraphs using the `read_document()` function.

2. **Topic Identification:**
   - Topic words are extracted from the title and specific paragraphs using the `get_topic_words()` function. These words help capture the document's main theme.

3. **Title Extraction:**
   - The title of the document is extracted and removed from the list of paragraphs.

4. **Context Generation:**
   - A context is generated using the topic words and Nasari vectors through the `get_context()` function.

5. **Paragraph Ranking:**
   - The remaining paragraphs are ranked based on their similarity to the generated context using the `rank_paragraphs()` function.

6. **Summarization at Different Compression Rates:**
   - The script iterates over various compression rates, and for each rate, a summary is generated using the `summarize()` function. The generated summaries are saved for evaluation.

## Evaluation

- The generated summaries generated using https://resoomer.com/en are evaluated for quality using BLEU and ROUGE metrics. These metrics measure the precision and recall of the generated summaries compared to the original document's content.




In [10]:
stop_words = set(stopwords.words('english'))
nasari_dict = create_nasari_dict()
os.chdir("./obj/")
for root, _, files in os.walk(os.getcwd()):
    list_files = files

for document in list_files:
    list_paragraphs = read_document(document)
    topic_words = get_topic_words(list_paragraphs)
    
    title = list_paragraphs[0]
    list_paragraphs.pop(0)
    context = get_context(topic_words, nasari_dict)
    ranks = rank_paragraphs(list_paragraphs, context,nasari_dict)
    for rate in compression_rate: 
        summarize(document, ranks, title, list_paragraphs, rate)


Document Ebola-virus-disease.txt  Compression rate 10
BLEU Precision: 0.9678714859437751
ROUGE Recall: 0.9281129653401797

Document Ebola-virus-disease.txt  Compression rate 20
BLEU Precision: 0.8330871491875923
ROUGE Recall: 0.8545454545454545

Document Ebola-virus-disease.txt  Compression rate 30
BLEU Precision: 0.729235880398671
ROUGE Recall: 0.7453310696095077

Document Napoleon-wiki.txt  Compression rate 10
BLEU Precision: 0.9852941176470589
ROUGE Recall: 0.8427672955974843

Document Napoleon-wiki.txt  Compression rate 20
BLEU Precision: 0.830238726790451
ROUGE Recall: 0.839142091152815

Document Napoleon-wiki.txt  Compression rate 30
BLEU Precision: 0.7521865889212828
ROUGE Recall: 0.7588235294117647

Document Life-indoors.txt  Compression rate 10
BLEU Precision: 0.9434628975265018
ROUGE Recall: 0.8585209003215434

Document Life-indoors.txt  Compression rate 20
BLEU Precision: 0.9442231075697212
ROUGE Recall: 0.855595667870036

Document Life-indoors.txt  Compression rate 30
BLEU

# Summary Results and Potential Enhancements

The provided results showcase the performance of the extractive summarization process across various documents and compression rates. The evaluation is conducted using BLEU and ROUGE metrics to measure the precision and recall of the generated summaries compared to the original document content.

## Observations:

- The BLEU precision scores indicate the level of overlap between the generated summary and the original document, where higher scores signify greater overlap.
- The ROUGE recall scores reflect the ability of the summary to capture important information present in the document, with higher scores indicating better recall.

## Document-Specific Insights:

- For the document "Ebola-virus-disease.txt," at a compression rate of 10%, the generated summary achieves high precision and recall, indicating a well-balanced and informative summary.
- The document "Napoleon-wiki.txt" demonstrates consistently high BLEU precision scores across different compression rates, suggesting that the summary effectively captures the core content.
- The document "Life-indoors.txt" showcases notable precision at 10% compression, while there is a decrease in recall as the compression rate increases to 30%.
- In the case of "Trump-wall.txt," higher compression rates lead to a slight decrease in both BLEU precision and ROUGE recall scores, indicating potential loss of context.
- The document "Andy-Warhol.txt" demonstrates a favorable balance between BLEU precision and ROUGE recall, with improved recall at higher compression rates.

