<h1 style="text-align: center;"><b>Summary Generation Using Naive Approach<br></h1>

---

# <h3 style="text-align: left;"><b> INTRODUCTION

### <b>Objective</b>

The objective of this project is to develop a simplified and efficient summary generation tool based on the **Naive NLP approach**. This approach seeks to extract key information from a text file or PDF document by leveraging a technique known as **Sentence Scoring**. 

Using Sentence Scoring, this tool evaluates each sentence’s importance based on factors such as:
1. **Sentence Position**: Sentences occurring earlier in the text often introduce key ideas, and thus are weighted more heavily.
2. **Sentence Length**: Optimal-length sentences are likely to convey significant information without unnecessary detail, making them ideal for summarization.
3. **Keyword Frequency**: Sentences containing a higher concentration of topic-relevant keywords are given greater importance, as they likely reflect the primary content.

By analyzing and ranking sentences according to these attributes, the project aims to generate concise summaries that maintain the essence of the original content. This Naive Approach provides a foundation for text summarization without relying on advanced machine learning models, making it both computationally efficient and interpretable. 

Through this project, we aim to demonstrate the effectiveness of rule-based summary generation and explore its potential applications in quickly summarizing large volumes of text across various domains.


### <h5> <b> Installing Dependencies

In [1]:
# We are using NLTK for text processing purposes
%pip install nltk 
# Tabulate is used for displaying the results in a tabular format
%pip install tabulate 

# %pip install scikit-learn 
# PDFMiner.six is a tool for extracting information from PDF documents.
%pip install pdfminer.six

# %pip install cryptography

# We are using the Google Gemini API for a tentative QnA support to our project (due to its merge with Deep Learning)
%pip install google.generativeai

Note: you may need to restart the kernel to use updated packages.
Collecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate
Successfully installed tabulate-0.9.0
Note: you may need to restart the kernel to use updated packages.
Collecting pdfminer.six
  Downloading pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Collecting cryptography>=36.0.0 (from pdfminer.six)
  Downloading cryptography-43.0.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (5.4 kB)
Downloading pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading cryptography-43.0.3-cp39-abi3-manylinux_2_28_x86_64.whl (4.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hIns

### <h5> <b>Fetching imports

In [2]:
import nltk

# Regex is used for Pre-Processing
import re

from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.corpus import words as nltk_words


from tabulate import tabulate

from rich.console import Console

from rich.table import Table

from rich import box

from rich import print


from generateData import convert_pdf_to_txt

### <h5><b> Downloading Necessary NLTK models

In [3]:
# This downloads the punkt tokenizer models, its purpose is to break down a text into sentences and words
nltk.download("punkt")
# This downloads the stopwords corpus, which is a list of common words that are not useful for text processing

nltk.download("stopwords")
# This downloads the words corpus, which is a list of English words, which can be used for spell checking and word validation etc

nltk.download("words")

[nltk_data] Downloading package punkt to /home/srajan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/srajan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /home/srajan/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

# <h3 style="text-align: left;"><b> METHODOLOGY

### <h5> <b> Text PreProcessing

<b><i>contractions</i></b> is a dictionary with common english words such as <b>aren't</b> , <b>don't</b> which are contractions paired with thier expanded form which helps in standardizing the words clearly for the model to understand

In [4]:
import contractions

<b>preprocess_text</b> is a function that apply's the preprocessing techniques such as removing unwanted characters (via regex) Sentence and Word Tokenization (via NLTK) Stopword Removal (via NLTK) and Stemming (Using Porter Stemmer Algorithm)

It takes two arguments
1. <b>file_path</b>: which takes the the specific directory path of the file at where it is located <br>
2. <b>pdf=False</b>: which is a parameter to specifiy wheather the provided file is a PDF or not, if not text in the function is extracted differently<br>

In [5]:
def preprocess_text(file_path, pdf=False):
    if pdf:
        # if the file is a pdf than convert it to text
        text = convert_pdf_to_txt(file_path)
    else:
        # if the file is already provided in .txt format
        with open(file_path, "r") as file:
            text = file.read()
    # lowercasing for better processing
    text = text.lower()

    # new
    text = contractions.fix(text)

    # Format words and remove unwanted characters
    # this line matches any URL, any new line characters in the text and replaces it with an empty string
    text = re.sub(r"https?:\/\/.*[\r\n]*", "", text, flags=re.MULTILINE)
    # this line replaces any HTML links with a space
    text = re.sub(r"\<a href", " ", text)
    # this line replaces any HTML entity for ampersand with an empty string
    text = re.sub(r"&amp;", "", text)
    # this line removes any special characters from the text and replaces with a space
    text = re.sub(r'[_"\-;%()|+&=*%,!?:#$@\[\]/]', " ", text)
    # this line removes any HTML line breakers and replaces it with a space
    text = re.sub(r"<br />", " ", text)
    # this line replaces any aphostrophe with a space
    text = re.sub(r"\'", " ", text)

    # Tokenize text
    sentences = sent_tokenize(text)
    # Tokenize each sentence of the text
    words = [word_tokenize(sentence) for sentence in sentences]

    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    words = [
        [word for word in sentence if word.lower() not in stop_words]
        for sentence in words
    ]
    # the reason why we have wrote 'for sentence in words' is because each row in the words array is a list of words that make a single sentence

    # Perform stemming
    # PorterStemmer is a popular stemming algorithm, which is used to remove the commoner morphological and inflexional endings from words. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems. e.g. 'stemming', 'stemmer', 'stemmed' all have similar meanings; hence, they are stemmed to 'stem'."
    stemmer = PorterStemmer()
    words = [[stemmer.stem(word) for word in sentence] for sentence in words]

    return sentences, words

### <h5> <b> Sentence Scoring </b>

This cell introduces the Sentence Scoring technique, a fundamental part of the Naive Approach for text summarization. Sentence scoring assigns each sentence a score based on certain predefined factors, such as its position in the text, length, and the presence of keywords. This scoring provides a quantitative measure of each sentence's relevance to the main topic, forming the basis for further analysis in summary generation.

<b> Importance of TF-IDF scoring in Naive approach:</b>:
* Identify Important Words: TF-IDF helps in identifying the most important words in each sentence by considering both the frequency of the word in the sentence and its rarity across all sentences
* Weighting Terms: It assigns a weight to each term based on its importance, allowing the algorithm to prioritize sentences with higher weighted terms
* Feature Extraction: The TF-IDF matrix serves as a feature representation of the sentences, which can be used for further processing, such as scoring and ranking sentences for summarization.

<b> Importance of Sentence Position in Naive Approach: </b>
* In many texts, especially in structured documents like articles, reports, and academic papers, the most important information is often presented early. The  introduction and first few sentences typically contain key points and summaries.
* By giving higher scores to sentences that appear earlier, the algorithm leverages this common writing pattern to identify potentially important sentences.
* Sentence position scoring is a heuristic that helps in quickly identifying important sentences without deep semantic analysis. It is a simple yet effective way to prioritize certain parts of the text


<b> Importance of Sentence Length in Naive Approach: </b>
* Sentences that are too short may lack sufficient context or information, while sentences that are too long may be overly complex and harder to summarize.
* By focusing on sentences of moderate length (e.g., between 5 and 20 words), the algorithm aims to select sentences that are likely to be informative and concise.
* Sentences of moderate length are generally easier to read and understand. They strike a balance between providing enough information and maintaining clarity.
* This consideration helps in generating summaries that are both informative and easy to read.

In [6]:
def score_sentences(sentences, words):
    # TF-IDF scoring
    # Creating an instance of the 'TfidfVectorizer' class
    tfidf_vectorizer = TfidfVectorizer()
    # here a tfidf matrix is created by fitting the sentences into the vectorizer so that the vectorizer can learn the vocabulary and idf from the sentences
    # each row in the matrix indiciates a sentence and each column indicates a word in that sentence
    tfidf_matrix = tfidf_vectorizer.fit_transform(
        [" ".join(sentence) for sentence in words]
    )

    # Sentence position importance
    # We assign a score to each sentence based on its position in the text
    # The first sentence is assigned a score of 1, the second sentence is assigned a score of 1/2, the third sentence is assigned a score of 1/3, and so on
    position_scores = [1 / (i + 1) for i in range(len(sentences))]

    # Sentence length consideration
    # Here each sentence is given either a '1' or a '0' based on the length of the sentence
    # If the length of the sentence is between 5 and 20 words, it is assigned a '1', otherwise it is assigned a '0'
    length_scores = [
        1 if 5 < len(sentence.split()) < 20 else 0 for sentence in sentences
    ]

    # Combine scores (example: simple sum of scores)
    scores = []
    for i in range(len(sentences)):
        tfidf_score = tfidf_matrix[i].sum()
        score = tfidf_score + position_scores[i] + length_scores[i]
        scores.append(score)

    return scores

### <h5> <b> Feature Extraction</b>

The extract_features function is designed to extract specific features from a list of sentences. These features include the count of proper nouns, numerical data, discourse markers, and title word presence. These features can be used to score and rank sentences for summarization.

<b>Purpose of the steps involved:</b>
* <b>Proper Nouns:</b>
    * Proper nouns often indicate important entities such as names, places, and organizations. Sentences with more proper nouns might contain significant information.
* <b>Numerical Data:</b>
    * Numerical data can be crucial in many contexts, such as statistics, dates, and quantities. Sentences with numerical data might be more informative.
* <b>Discourse Markers:</b>
    * Discourse markers help in understanding the structure and flow of the text. Sentences with these markers might summarize or conclude important points.
    which is helpfull for us in extracting information from such parts of text for summary generation purposes.
* <b>Title Word Presence:</b>
    * Words related to the title or main topics can indicate the relevance of a sentence to the overall content. Sentences with these words might be more central to the main ideas.

In [7]:
def extract_features(sentences):
    # we declare an empty list to store the features of the sentences according to thier idx
    # that is sentence indexed 0 will have its features in index 0 of this array
    features = []
    # We declare an example set of title words which are commonly used
    title_words = set(["example", "title", "words"])
    # iterating over each sentence in the sentences array
    for sentence in sentences:
        # we count the number of proper_nouns in the each sentence
        proper_nouns = len([word for word in word_tokenize(sentence) if word.istitle()])
        # we count for any numerical data (which can be usefull for statistical data) in each sentence
        numerical_data = len(
            [word for word in word_tokenize(sentence) if word.isdigit()]
        )
        # we count for discourse markers in each sentence
        discourse_markers = len(
            [
                word
                for word in word_tokenize(sentence)
                if word.lower()
                in ["in conclusion", "importantly", "overall", "conclusively"]
            ]
        )
        # we count for title words in each sentence
        title_word_presence = len(
            [word for word in word_tokenize(sentence) if word.lower() in title_words]
        )
        # Lastly, we append all the extracted features of a single sentence as a dictionary to the features array
        features.append(
            {
                "proper_nouns": proper_nouns,
                "numerical_data": numerical_data,
                "discourse_markers": discourse_markers,
                "title_word_presence": title_word_presence,
            }
        )

    return features

### <h5> <b> Sentence Ranking</b>

This cell ranks sentences based on their combined scores, which are derived from both predefined scores and extracted features.

<b>Purpose of the steps involved:</b>
* Here we are basically Combining the scores of each sentence and then ranking them against each other on the basis of thier scores and displaying the sentences along with thier scores in a tabular format (via tabulate)

In [8]:
def rank_sentences(sentences, scores, features):
    # We create an empty list to store the combined scores of the sentences
    combined_scores = []
    for i in range(len(sentences)):
        # We sum the scores of each sentence on the basis of the features in the sentence (summation is done because each of the values is a length (check the extract_features function))
        feature_score = sum(features[i].values())
        # We combine the scores from the setence position, sentence length and TF-IDF score along with the features score
        combined_score = scores[i] + feature_score
        combined_scores.append((combined_score, i))
    # Sort sentences by combined score in descending order
    # key = lamda x: x[0] indicates that we consider the first element of the tuple for sorting the combined_scores array
    combined_scores.sort(reverse=True, key=lambda x: x[0])
    # Here we are creating a seperate list of ranked sentences, which will be used to display the results in a tabular format
    ranked_sentences = [
        (i + 1, sentences[idx], score) for i, (score, idx) in enumerate(combined_scores)
    ]
    # This is code to display the sentences in a presentable manner
    table = Table(
        show_header=True,
        header_style="bold turquoise4",
        title_justify="center",
        title="Ranked Sentences",
        box=box.HEAVY_HEAD,
    )
    table.add_column("Rank", style="bold", width=6, justify="center")
    table.add_column("Sentence", width=80, justify="Full")
    table.add_column("Score", style="bold light_slate_blue", width=10, justify="center")
    for rank, sentence, score in ranked_sentences:
        table.add_row(str(rank), sentence, f"{score:.2f}")
    console = Console()
    console.print(table)
    return combined_scores

### <h5> <b> Summary Generation </b>

The generate_summary function creates a summary by selecting the top-ranked sentences from a list of ranked sentences. It sorts these top sentences based on their original order in the text, extracts the corresponding sentences, and joins them into a single summary string. The function returns this summary text.

In [9]:
def generate_summary(sentences, ranked_sentences, summary_length=7):
    # it takes the number of 'summary_length' sentences from the ranked_sentences array
    top_sentences = sorted(ranked_sentences[:summary_length], key=lambda x: x[1])
    # it appends those sentences to the summary array
    summary = [sentences[i] for _, i in top_sentences]
    # joins all the top ranked sentences to formulate a summary
    summary_text = " ".join(summary).replace("\n", " ").replace("  ", " ")
    return summary_text

# <h3 style="text-align: left;"><b> RESULTS

### <h5> <b> Summary Generation</b>

* This cell demonstrates the usage of the functions in our notebook 
* In the output we can observe the table which contains the information about which sentence is of the highest score and it shows the ranking of each of the sentences that have been extracted from the text 

In [10]:
# Speicifying the file path
file_path = "output.txt"
# extracting the sentences and words from the text provided

sentences, words = preprocess_text(file_path)
# extracting the scores of the sentences in the text given the words and the sentences

scores = score_sentences(sentences=sentences, words=words)
# extracting the features of the sentences in the text

features = extract_features(sentences)

print("[bold]Features:[/bold]", features[0])

ranked_sentences = rank_sentences(sentences, scores, features)

summary = generate_summary(sentences, ranked_sentences, summary_length=8)

print("[bold]Summary:[/bold]", summary)

In [11]:
len(summary)

1118

### **Bleu & Rouge Scores**

In [12]:
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

# Download necessary NLTK data
nltk.download("punkt")


def compute_bleu(reference, candidate):
    reference_tokens = nltk.word_tokenize(reference)
    candidate_tokens = nltk.word_tokenize(candidate)
    smoothie = SmoothingFunction().method1
    score = sentence_bleu(
        [reference_tokens], candidate_tokens, smoothing_function=smoothie
    )
    return score


def compute_rouge(reference, candidate):
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return scores

[nltk_data] Downloading package punkt to /home/srajan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
# Define the reference summary
reference_summary = """
Artificial Intelligence (AI) has transformed modern healthcare by enhancing diagnostic accuracy, improving surgical precision, accelerating drug discovery, and empowering patients. AI-driven tools assist medical professionals in predicting patient outcomes, identifying diseases early, personalizing treatment plans, and providing real-time support through virtual health assistants. Despite these benefits, challenges such as data privacy, security, and the need for transparent AI systems must be addressed to ensure ethical and effective implementation.
"""

# Compute BLEU score
bleu_score = compute_bleu(reference_summary, summary)
print("BLEU Score:", bleu_score)

# Compute ROUGE scores
rouge_scores = compute_rouge(reference_summary, summary)
print("ROUGE Scores:", rouge_scores)

In [None]:
# from chatbot import Chatbot

In [15]:
# question = input("How can I help you?\n")
# chatbot = Chatbot()
# response = chatbot.respond(question, summary)
# print("[bold]Question:[/bold]", question)
# print("[bold]Response:[/bold]", response)

In [None]:
# question = input("How can I help you?\n")
# chatbot = Chatbot()
# response = chatbot.respond(question, summary)
# print("[bold]Question:[/bold]", question)
# print("[bold]Response:[/bold]", response)

# <h3 style="text-align: left;"><b> CONCLUSION </b>

This notebook presents a Naive NLP-based approach for text summarization, emphasizing Sentence Scoring to efficiently extract essential information from a document. The process involves three primary stages:

* <b>Sentence Scoring:</b> Each sentence is evaluated based on position, length, and keyword frequency, allowing for a quantitative assessment of its relevance to the main content.

* <b>Feature Extraction:</b> Specific features such as sentence position, optimal length, and keyword density are extracted to inform scoring and enhance the accuracy of summarization.

* <b>Sentence Ranking:</b> Sentences are ranked according to their scores, with higher-ranked sentences prioritized in the final summary. This ensures that only the most informative sentences are included, providing a concise yet comprehensive summary.

The resulting summary captures the document's core ideas without requiring advanced machine learning models, making it an efficient, interpretable solution for text summarization. This method demonstrates how rule-based techniques can effectively condense large texts, offering practical applications across fields that require quick text analysis and summarization.
