**Installing required Libraries**

In [1]:
# Install required libraries
! pip install -U sentence-transformers
! pip install rouge-score
! pip install datasets

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.0.1
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=b24397acb3322d95201712969f18f532b594e70236816875a1280bb08f3e50ad
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Suc

**Importing the libraries**

In [2]:
import nltk
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer 
from nltk.cluster import KMeansClusterer
from datasets import load_dataset
from rouge_score import rouge_scorer
from sklearn.feature_extraction.text import TfidfVectorizer

  from tqdm.autonotebook import tqdm, trange
2024-06-28 10:42:00.810579: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-28 10:42:00.810679: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-28 10:42:00.946964: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


**Loading Huggingface Multi_News Dataset**

In [3]:
# Load the dataset
dataset = load_dataset("alexfabbri/multi_news", split='train')

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.83k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/58.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/66.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.30M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/69.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.31M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44972 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5622 [00:00<?, ? examples/s]

In [5]:
article = dataset[0]['document']
reference_summary = dataset[0]['summary']

**Convert our article into a list of sentences using nltk tokenizer.**

In [6]:
# Download punkt for sentence tokenization
nltk.download('punkt')

# Tokenize the article into sentences
sentences = nltk.sent_tokenize(article)

# Strip white spaces (leading & trailing)
sentences = [sentence.strip() for sentence in sentences]

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Check the tokenized sentences**

In [7]:
sentences

['National Archives \n \n Yes, it’s that time again, folks.',
 'It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs.',
 'A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month.',
 'Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February.',
 'The unemployment rate is expected to hold steady at 8.3%.',
 'Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires.',
 'Feel free to weigh-in yourself, via the comments section.',
 'And while you’re here, why don’t you sign up to follow us on Twitter.',
 'Enjoy the show.',
 '||||| Employers pulled back sharply on hiring last month, a reminder that the U.S. ec

**Convert above list of sentences to a pandas data frame.**

In [8]:
# Create a DataFrame with the sentences
df = pd.DataFrame(sentences, columns=['sentences'])
df.head()

Unnamed: 0,sentences
0,"National Archives \n \n Yes, it’s that time ag..."
1,"It’s the first Friday of the month, when for o..."
2,A fresh update on the U.S. employment situatio...
3,"Expectations are for 203,000 new jobs to be cr..."
4,The unemployment rate is expected to hold stea...


In [25]:
print(f'Number of sentences in our article : {len(df)}')

Number of sentences in our article : 17


**Implementing TF-IDF Vectorization**

In [10]:
# Compute TF-IDF scores
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['sentences'])
tfidf_scores = np.sum(tfidf_matrix, axis=1)

In [11]:
# Normalize TF-IDF scores
normalized_tfidf_scores = tfidf_scores / np.sum(tfidf_scores)

df['tfidf_score'] = normalized_tfidf_scores
df.head()

Unnamed: 0,sentences,tfidf_score
0,"National Archives \n \n Yes, it’s that time ag...",0.044904
1,"It’s the first Friday of the month, when for o...",0.076553
2,A fresh update on the U.S. employment situatio...,0.080641
3,"Expectations are for 203,000 new jobs to be cr...",0.072773
4,The unemployment rate is expected to hold stea...,0.046236


**Initialize the Sentence Transformer with STS (Sentence Text Similarity) model.**

In [12]:
# Initialize Sentence Transformer model
model = SentenceTransformer('stsb-roberta-base')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/672 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**Function that takes input as sentence & returns the dense vectors.**

In [14]:
# Function to get sentence embeddings
def get_sent_embeddings(sent):
    embeddings = model.encode([sent])
    return embeddings[0]

**Applying the above function to get embeddings for each sentence**

In [15]:
# Get embeddings for each sentence
df['embeddings'] = df['sentences'].apply(get_sent_embeddings)
df.head()

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,sentences,tfidf_score,embeddings
0,"National Archives \n \n Yes, it’s that time ag...",0.044904,"[0.8734846, -0.6017922, 0.71979344, 0.41540766..."
1,"It’s the first Friday of the month, when for o...",0.076553,"[-0.36445275, -0.44911194, 0.12615317, -1.0521..."
2,A fresh update on the U.S. employment situatio...,0.080641,"[-0.2598091, -0.3050253, 0.5102481, -0.7973806..."
3,"Expectations are for 203,000 new jobs to be cr...",0.072773,"[0.18289545, -0.4410061, 0.12599786, 0.2161035..."
4,The unemployment rate is expected to hold stea...,0.046236,"[-0.09652451, 0.0826514, -0.13469806, -0.56875..."


**Clustering text embeddings using nltk's KMeansCluster.**

In [16]:
# Set the number of clusters (summary sentences) and iterations
n_clusters = 10
iterations = 25

# Convert embeddings into numpy array
X = np.array(df['embeddings'].tolist())

# Perform clustering
kcluster = KMeansClusterer(n_clusters, distance=nltk.cluster.util.cosine_distance, repeats=iterations, avoid_empty_clusters=True)
assigned_clusters = kcluster.cluster(X, assign_clusters=True)

In [17]:
assigned_clusters

[7, 4, 4, 4, 0, 8, 6, 8, 9, 3, 3, 1, 2, 1, 4, 2, 5]

**Computing the distance between sentence embedding & centroid for each cluster.**

In [18]:
# Assign clusters and centroids to the DataFrame
df['Cluster'] = assigned_clusters
df['Centroid'] = df['Cluster'].apply(lambda x: kcluster.means()[x])
df.head()

Unnamed: 0,sentences,tfidf_score,embeddings,Cluster,Centroid
0,"National Archives \n \n Yes, it’s that time ag...",0.044904,"[0.8734846, -0.6017922, 0.71979344, 0.41540766...",7,"[0.8734846, -0.6017922, 0.71979344, 0.41540766..."
1,"It’s the first Friday of the month, when for o...",0.076553,"[-0.36445275, -0.44911194, 0.12615317, -1.0521...",4,"[-0.3181181, -0.3859373, 0.10603756, -0.718429..."
2,A fresh update on the U.S. employment situatio...,0.080641,"[-0.2598091, -0.3050253, 0.5102481, -0.7973806...",4,"[-0.3181181, -0.3859373, 0.10603756, -0.718429..."
3,"Expectations are for 203,000 new jobs to be cr...",0.072773,"[0.18289545, -0.4410061, 0.12599786, 0.2161035...",4,"[-0.3181181, -0.3859373, 0.10603756, -0.718429..."
4,The unemployment rate is expected to hold stea...,0.046236,"[-0.09652451, 0.0826514, -0.13469806, -0.56875...",0,"[-0.09652451, 0.0826514, -0.13469806, -0.56875..."


**To Compute the distance, scipy's distance_matrix function is used.**

In [19]:
# Function to calculate distance from centroid
from scipy.spatial import distance_matrix
def distance_from_centroid(row):
    dist_matrix = distance_matrix([row['embeddings']], [row['Centroid'].tolist()])[0][0]
    return dist_matrix

**Combining the TF-IDF scores and distance from centroid score to get more accurate result**

In [20]:
# Calculate distance from centroid for each sentence
df['distance_from_centroid'] = df.apply(distance_from_centroid, axis=1)

# Sort sentences by combined score of distance from centroid and TF-IDF score
df['combined_score'] = df['distance_from_centroid'] * df['tfidf_score']
df.head()

Unnamed: 0,sentences,tfidf_score,embeddings,Cluster,Centroid,distance_from_centroid,combined_score
0,"National Archives \n \n Yes, it’s that time ag...",0.044904,"[0.8734846, -0.6017922, 0.71979344, 0.41540766...",7,"[0.8734846, -0.6017922, 0.71979344, 0.41540766...",0.0,0.0
1,"It’s the first Friday of the month, when for o...",0.076553,"[-0.36445275, -0.44911194, 0.12615317, -1.0521...",4,"[-0.3181181, -0.3859373, 0.10603756, -0.718429...",10.502183,0.80397
2,A fresh update on the U.S. employment situatio...,0.080641,"[-0.2598091, -0.3050253, 0.5102481, -0.7973806...",4,"[-0.3181181, -0.3859373, 0.10603756, -0.718429...",10.22682,0.824706
3,"Expectations are for 203,000 new jobs to be cr...",0.072773,"[0.18289545, -0.4410061, 0.12599786, 0.2161035...",4,"[-0.3181181, -0.3859373, 0.10603756, -0.718429...",12.137827,0.883303
4,The unemployment rate is expected to hold stea...,0.046236,"[-0.09652451, 0.0826514, -0.13469806, -0.56875...",0,"[-0.09652451, 0.0826514, -0.13469806, -0.56875...",0.0,0.0


**The final step is to generate summary. This can be done by following steps:**

1.Group the sentences based on Combined column.

2.Sort the group ascending order based on combined_score column & select the first row.

3.Sort the sentences based on their sequence in the original text.

In [21]:
# Select top sentence from each cluster based on combined score
sents = df.sort_values(by='combined_score', ascending=True).groupby('Cluster').head(1)['sentences'].tolist()
sents

['National Archives \n \n Yes, it’s that time again, folks.',
 'The unemployment rate is expected to hold steady at 8.3%.',
 'Feel free to weigh-in yourself, via the comments section.',
 'Enjoy the show.',
 'But Federal Reserve Chairman Ben Bernanke has cautioned that the current hiring pace is unlikely to continue without more consumer spending.',
 'The unemployment rate dipped, but mostly because more Americans stopped looking for work.',
 'The unemployment rate fell to 8.2 percent, the lowest since January 2009.',
 'The rate dropped because fewer people searched for jobs.',
 'And while you’re here, why don’t you sign up to follow us on Twitter.',
 'The official unemployment tally only includes those seeking work.']

**Generating the Final Summary**

In [22]:
# Create the final summary
summary = ' '.join(sents)
print("Generated Summary:")
print(summary)

Generated Summary:
National Archives 
 
 Yes, it’s that time again, folks. The unemployment rate is expected to hold steady at 8.3%. Feel free to weigh-in yourself, via the comments section. Enjoy the show. But Federal Reserve Chairman Ben Bernanke has cautioned that the current hiring pace is unlikely to continue without more consumer spending. The unemployment rate dipped, but mostly because more Americans stopped looking for work. The unemployment rate fell to 8.2 percent, the lowest since January 2009. The rate dropped because fewer people searched for jobs. And while you’re here, why don’t you sign up to follow us on Twitter. The official unemployment tally only includes those seeking work.


**Evaluation using Rouge Score**

In [23]:
# Calculate ROUGE scores
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference_summary, summary)

In [24]:
# Print ROGUE scores
rouge1_score = scores['rouge1'].fmeasure
rouge2_score = scores['rouge2'].fmeasure
rougeL_score = scores['rougeL'].fmeasure

print("ROUGE-1 Score: ", rouge1_score)
print("ROUGE-2 Score: ", rouge2_score)
print("ROUGE-L Score: ", rougeL_score)

ROUGE-1 Score:  0.4020618556701031
ROUGE-2 Score:  0.15625000000000003
ROUGE-L Score:  0.2268041237113402
