   Copyright 2024 Mohan Krishna G R

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


# Extractive summary
### Rule based approach (src/Extractive_Summarization.ipynb)

Implemented works:
- Data Preprocessing:
    - Lowercasing
    - Stop Words Removal.
    - Lemmatization.
    - Tokenization.
    - POS Tagging.
- TF-IDF Vectorization.
    -  matrix where each sentence is represented as a vector of TF-IDF scores
- KMeans Clustering
    - TF-IDF matrix is fed into the KMeans algorithm.
    - Each sentence is assigned a cluster label.
- Selecting Representative Sentences
    - The sentence closest to the centroid (based on Euclidean distance) is selected as the representative sentence for that cluster.

- Evaluating Summaries
    - ROUGE scores as performance metrics

### Results:

Rule-based approach for extractive summarization was implemented and evaluated successfully.

ROUGE1 (fmeasure) = 24.72



In [9]:
import pandas as pd
import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from datasets import load_metric

nltk.download('punkt') # for sentence tokenization
nltk.download('stopwords') # for removing stopwords
nltk.download('wordnet') # for lemmatization
nltk.download('averaged_perceptron_tagger') # for POS tagging

stop_words = set(stopwords.words('english')) # stop words
lemmatizer = WordNetLemmatizer() # lemmatizer

def preprocess(text):
    text = text.lower() # Lowercasing 
    tokens = word_tokenize(text) # Tokenizing 
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words] # Removing stopwords and lemmatizing
    processed_text = ' '.join(tokens) # single string
    sentences = nltk.sent_tokenize(processed_text) # Tokenizing into sentences
    pos_tagged_sentences = [' '.join([word for word, pos in pos_tag(word_tokenize(sentence))]) for sentence in sentences] # POS tagging 
    return pos_tagged_sentences # Returning the POS tagged sentences

def generate_summary(text, num_clusters=3):
    sentences = preprocess(text) # Preprocessing the input text

    if len(sentences) < num_clusters: 
        num_clusters = 2  # Adjust number of clusters 
        if len(sentences) < num_clusters:
            num_clusters = 1 # Adjust number of clusters 

    vectorizer = TfidfVectorizer(stop_words='english') # TF-IDF Vectorizer with stopwords
    tfidf_matrix = vectorizer.fit_transform(sentences) # Fitting and transforming sentences into TF-IDF matrix

    kmeans = KMeans(n_clusters=num_clusters) # KMeans 
    kmeans.fit(tfidf_matrix) # Fitting to TF-IDF matrix
    labels = kmeans.labels_ # the cluster labels for each sentence
    
    representative_sentences = [] # Initializing list to store representative sentences

    for i in range(num_clusters): # Iterating over each cluster
        cluster_indices = np.where(labels == i)[0] # Finding indices of sentences in the current cluster
        cluster_vectors = tfidf_matrix[cluster_indices].toarray() # Getting TF-IDF vectors of the current cluster
        centroid = np.mean(cluster_vectors, axis=0) # Calculating the centroid of the cluster
        closest_index = np.argmin(np.linalg.norm(cluster_vectors - centroid, axis=1)) # Finding the index of the sentence closest to the centroid
        representative_sentences.append(sentences[cluster_indices[closest_index]]) # Adding the representative sentence to the list
    return ' '.join(representative_sentences) # Joining and returning the representative sentences as a summary

def preprocess_validation_data(texts):
    return [' '.join(preprocess(text)) for text in texts] # Preprocess each text and join the sentences

# Load validation data
validation_data = pd.read_csv('/home/mohan/infy/data/fined/validation.csv')
input_texts = validation_data['text'].tolist()
target_texts = validation_data['summary'].tolist()

# Preprocess input texts and target texts
preprocessed_input_texts = preprocess_validation_data(input_texts)
preprocessed_target_texts = preprocess_validation_data(target_texts)

# Generate summaries for the validation set
generated_summaries = [generate_summary(text) for text in preprocessed_input_texts]
validation_data['generated_summary'] = generated_summaries

# Evaluate summaries using ROUGE
rouge = load_metric('rouge')
rouge_scores = rouge.compute(predictions=generated_summaries, references=preprocessed_target_texts)
print(rouge_scores)


[nltk_data] Downloading package punkt to /home/mohan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/mohan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/mohan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/mohan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
  rouge = load_metric('rouge')
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


{'rouge1': AggregateScore(low=Score(precision=0.32992749398823934, recall=0.24625671174307656, fmeasure=0.24315282916961736), mid=Score(precision=0.3355847726701653, recall=0.2515741685849237, fmeasure=0.24716109515113482), high=Score(precision=0.34076098510111485, recall=0.25621987369010696, fmeasure=0.2509056630676471)), 'rouge2': AggregateScore(low=Score(precision=0.12599312296199636, recall=0.08964834164444435, fmeasure=0.08850973675523258), mid=Score(precision=0.12983989638364446, recall=0.09289135432441595, fmeasure=0.09146407636882564), high=Score(precision=0.1338382414644077, recall=0.09585184961127763, fmeasure=0.09413455831524503)), 'rougeL': AggregateScore(low=Score(precision=0.244552430523978, recall=0.17475109214152934, fmeasure=0.17388626267245735), mid=Score(precision=0.24888545943215054, recall=0.17862081625856158, fmeasure=0.17692926793944208), high=Score(precision=0.2531846351622453, recall=0.1821262280925005, fmeasure=0.17980853790592133)), 'rougeLsum': AggregateScor

In [14]:
validation_data 

Unnamed: 0,text,summary,generated_summary
0,Vladimir Putin is 'alive' but 'neutralised' as...,Vladimir Putin is supposed to hold public meet...,also pointed recent visit patrushev - head put...
1,#Person1#: How old is Keith?\n#Person2#: He's ...,#Person1# and #Person2# talk about the age of ...,"# person2 # : 's father ? # person2 # : , want..."
2,(CNN)The United States has seemingly erupted t...,A Native American from a tribe not recognized ...,"fact , federal rfra designed partly protect na..."
3,#Person1#: When do you want to have the open h...,#Person1# and #Person2# are planning an open h...,# person1 # : 'm excited . # person1 # : like ...
4,neutral pion photoproduction on the proton at ...,we investigate the neutral pion photoproductio...,"-g . @ xcite . close threshold , charged pion ..."
...,...,...,...
5011,The New York Police Department is searching fo...,Driver was at intersection of Broadway and Wes...,woman ran away bus throwing coffee . 45-year-o...
5012,SECTION 1. SHORT TITLE.\n\n This Act may be...,"Preservation of Localism, Program Diversity, a...",`` . 2. finding ; purpose . 3. national televi...
5013,vortex instabilities are observed throughout t...,vortices have been postulated at a range of si...,observational evidence interaction ism agb win...
5014,Daniel Levy reportedly told the Tottenham Hots...,Tottenham Hotspur chairman Daniel Levy has tol...,"elite club , challenge biggest trophy , buy on..."


In [12]:
rouge_scores

{'rouge1': AggregateScore(low=Score(precision=0.32992749398823934, recall=0.24625671174307656, fmeasure=0.24315282916961736), mid=Score(precision=0.3355847726701653, recall=0.2515741685849237, fmeasure=0.24716109515113482), high=Score(precision=0.34076098510111485, recall=0.25621987369010696, fmeasure=0.2509056630676471)),
 'rouge2': AggregateScore(low=Score(precision=0.12599312296199636, recall=0.08964834164444435, fmeasure=0.08850973675523258), mid=Score(precision=0.12983989638364446, recall=0.09289135432441595, fmeasure=0.09146407636882564), high=Score(precision=0.1338382414644077, recall=0.09585184961127763, fmeasure=0.09413455831524503)),
 'rougeL': AggregateScore(low=Score(precision=0.244552430523978, recall=0.17475109214152934, fmeasure=0.17388626267245735), mid=Score(precision=0.24888545943215054, recall=0.17862081625856158, fmeasure=0.17692926793944208), high=Score(precision=0.2531846351622453, recall=0.1821262280925005, fmeasure=0.17980853790592133)),
 'rougeLsum': AggregateS

In [10]:
for key, value in rouge_scores.items():
    print(f"{key}: {value.mid}")

rouge1: Score(precision=0.3355847726701653, recall=0.2515741685849237, fmeasure=0.24716109515113482)
rouge2: Score(precision=0.12983989638364446, recall=0.09289135432441595, fmeasure=0.09146407636882564)
rougeL: Score(precision=0.24888545943215054, recall=0.17862081625856158, fmeasure=0.17692926793944208)
rougeLsum: Score(precision=0.2488984317892282, recall=0.17844838544509925, fmeasure=0.17679462513212732)


In [11]:
text = '''
At least 49 migrant workers, including around 40 Indian citizens, have died in a deadly fire that devastated a building in Kuwait’s southern district of Al-Mangaf. The fire that broke out in the apartment building located in Kuwait’s Al Ahmadi Governorate early on Wednesday also left more than a dozen injured, who were admitted to nearby hospitals, reported the Kuwait News Agency (KUNA). Prime Minister Narendra Modi and External Affairs Minister S. Jaishankar expressed shock over the incident, and Congress leader Rahul Gandhi expressed ‘serious concern’ about the condition of Indians in the Gulf region.  “My thoughts are with all those who have lost their near and dear ones. I pray that the injured recover at the earliest. The Indian Embassy in Kuwait is closely monitoring the situation and working with the authorities there to assist the affected,” said Mr. Modi in a message. Mr. Modi held a review meeting on Wednesday evening about the condition of the affected Indians in Kuwait. He deputed Minister of Stat
'''
summary = generate_summary(text, num_clusters=2)
print(summary)

mr. modi held review meeting wednesday evening condition affected indian kuwait . least 49 migrant worker , including around 40 indian citizen , died deadly fire devastated building kuwait ’ southern district al-mangaf .


In [13]:
text= '''
The recent advancements in big data and natural language processing (NLP) have necessitated proficient text mining (TM) schemes that can interpret and analyze voluminous textual data. Text summarization (TS) acts as an essential pillar within recommendation engines. Despite the prevalent use of abstractive techniques in TS, an anticipated shift towards a graph-based extractive TS (ETS) scheme is becoming apparent. The models, although simpler and less resource-intensive, are key in assessing reviews and feedback on products or services. Nonetheless, current methodologies have not fully resolved concerns surrounding complexity, adaptability, and computational demands. Thus, we propose our scheme, GETS, utilizing a graph-based model to forge connections among words and sentences through statistical procedures. The structure encompasses a post-processing stage that includes graph-based sentence clustering. Employing the Apache Spark framework, the scheme is designed for parallel execution, making it adaptable to real-world applications. For evaluation, we selected 500 documents from the WikiHow and Opinosis datasets, categorized them into five classes, and applied the recall-oriented understudying gisting evaluation (ROUGE) parameters for comparison with measures ROUGE-1, 2, and L. The results include recall scores of 0.3942, 0.0952, and 0.3436 for ROUGE-1, 2, and L, respectively (when using the clustered approach). Through a juxtaposition with existing models such as BERTEXT (with 3-gram, 4-gram) and MATCHSUM, our scheme has demonstrated notable improvements, substantiating its applicability and effectiveness in real-world scenarios.
Keywords: text mining; extractive text summarization; sentence scoring scheme; graph analytics; graph-based clustering; opinion mining
1. Introduction
Modern Internet communication has shifted towards microblogging, community formation (on Twitter, Facebook, and other sites), and opinion-based polling. Most of the generated information falls under the wider umbrella of text mining (TM) schemes, where the data or information are gathered from devices and distributed to different users for consumption [1]. Another field is crowdsourcing [2], which includes the involvement of public opinions, work distribution, and community shares from a large group of persons. Mostly, the information is generated by mobile devices and the content is user-generated, with a significant amount of textual information. In social communities, many people post raw data, which need to be analyzed for opinions in real time to make informed decisions. The problem is considered challenging in TM applications as a micro-blog post is usually very short and colloquial, and traditional opinion mining algorithms in the TM scope do not cater to a one-fit-all scheme in such scenarios.
Thus, effective text analytics schemes have been widely studied over the last decade [3,4,5,6], which has improved interactions in human life. Social media, blogs, and e-shopping are those domains that have benefited from this study. A rich environment for expressing ideas about a variety of topics, including service providers, logistics, and e-commerce platforms, has been created within the social community landscape thanks to the growth in feedback channels, customer reviews, and polling systems [7]. The tremendous amount of data that have been produced by this ecosystem makes it difficult to sort through. Specialized mining approaches are needed to analyze these data to find patterns, preferences, and behavior. Deploying artificial intelligence (AI) models is also essential in light of the emergence of open application programming interfaces (APIs), which enable the seamless exchange of information across social media and e-commerce platforms. These models assist in trend forecasting and allow for a more complex comprehension of consumer behavior in the digital market. The concern, however, remains that the data are shared over open public channels (Internet), and, thus, at the security front, crypto-primitives and privacy preservation techniques are important [8]. Once data are collected, different statistical and machine learning (ML) models are built for text summarization (TS), from which effective predictions can be made, which facilitates business decisions in smart communities [9,10].
Considering the scenario of e-commerce portals, the text document generally includes reviews, comments, and feedback provided by users on the purchased product [11]. In such cases, document summarization becomes an effective tool for analysis [12] as it filters out important and necessary information from multiple text documents. However, on the downside, as the number (frequency) of documents generated increases, the prediction analytics models become bulkier and costlier, which limits organizations to implement them due to budgetary constraints [13]. This limitation presents the requirement of robust algorithms for abstractive text summarization (ATS) [14] systems, which generate summaries of the documents based on the provided ML algorithm [15]. Figure 1 shows the classification taxonomy of ATS, where a three-tier classification is presented, namely based on the file size, the summarization approach, and the summarization algorithm approach [14].
Input-based ATS systems are further classified into two types, single-document and multi-document classification [16]. Single-document ATS, as the name suggests, summarizes the content of one document, whereas multiple-document ATS takes many documents in parallel and provides the summarization output [13]. Based on the document retrieval and semantic value obtained from the summary, an ATS is classified as an extractive or abstractive ATS, respectively [14]. An experimental review comparing abstractive and extractive summarization methods is presented in Refs. [17,18]. The review likely examines their effectiveness and performance in generating summaries. Computationally, Extractive ATSs are less computationally expensive but suffer from semantic misinterpretation. An ATS involves semantic nets and grammar rules, which renders it ineffective for real-time applications. It is useful in cases where the quality of information is more important and thus accurate summary generation is highly required. Further, the classifications can be both supervised and unsupervised for the extraction of relationships from the main text document [19,20,21]. The primary goal of this study is to implement extractive text summarization (ETS) in a big data analytics environment. Based on a comprehensive literature survey, we have considered a substantial volume of text for summarization while simultaneously reducing computational expenses. As a result, this paper advocates for a statistical analysis approach to text summarization in contrast to abstractive text summarization. The discussion on this subject is also prominently highlighted in the paper.
Mostly in big data applications, owing to the text content bulk, an extractive ATS is a preferred choice. Figure 2 presents a generic overview of the extractive ATS, which can be further classified into three phases.
'''
summary = generate_summary(text, num_clusters=2)
print(summary)

based document retrieval semantic value obtained summary , at classified extractive abstractive at , respectively [ 14 ] . generated information fall wider umbrella text mining ( tm ) scheme , data information gathered device distributed different user consumption [ 1 ] .
