<a href="https://colab.research.google.com/github/FalineRezvani/simpleInsightTools/blob/main/languageModeling/sentenceTransformersWithLexRank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Summarization with Sentence Transformers and LexRank

2025-04-15

Author: Faline Rezvani

This notebook will bring in a csv file, tokenize text, extract meaning, and return the most significant sentence.

In [152]:
import tensorflow as tf # Dependency for Sentence Transformer
import torch # Dependency for Sentence Transformer
from sentence_transformers import SentenceTransformer, util
import logging # Dependency for LexRank
import scipy # Dependency for LexRank (pip install scipy)
from scipy.sparse.csgraph import connected_components
from scipy.special import softmax
import sklearn # Dependency for LexRank
from sumy.summarizers.lex_rank import LexRankSummarizer # Dependency for LexRank (pip install sumy)
from sumy.parsers.plaintext import PlaintextParser # Understanding structure and relationships
from sumy.nlp.tokenizers import Tokenizer
from lexrank import LexRank # Text summarization (pip install LexRank)
import nltk # May need nltk.download('punkt_tab')
from nltk.corpus import stopwords # Words commonly used in human language, yet not useful to meaning. May need nltk.download('stopwords')
from nltk.stem.porter import PorterStemmer # The Porter stemming algorithm to strip suffixes
import re, string # Regular expressions for text processing
import numpy as np
import pandas as pd
import csv

In [153]:
#from google.colab import drive
#drive.mount('/content/drive')

In [154]:
# Reading csv file from Google Drive
df = pd.read_csv(r'/content/drive/MyDrive/YourFilePathHere')

In [155]:
df.head()

Unnamed: 0,description
0,Reinforcement learning - GeeksforGeeks\r\nSep ...
1,Machine Learning Tutorial - GeeksforGeeks\r\n5...
2,A Beginner's Guide to Deep Reinforcement Learn...
3,Upper Confidence Bound Algorithm in Reinforcem...
4,On-policy vs off-policy methods Reinforcement ...


Regular Expressions

In [156]:
# Creating list of strings from dataframe
corpus = []

for i in range(0, 9):
  description = re.compile('<*?>').sub(repl=' ', string=df.iloc[:,0][i]) # Locating and substituting HTML markup with a space
  description = re.compile('[0-9\r\n,-]').sub(' ', description)
  description = description.lower() # Converting to lowercase
  corpus.append(description) # Placing in empty string

In [157]:
# Creating single string out of list of strings
corpus = ''.join(corpus)

In [158]:
#print(corpus)

LexRank Summarizer

In [159]:
# Instantiating summarizer
summarizer_lex = LexRankSummarizer()

In [160]:
# Extracting grammatical structure
parser = PlaintextParser.from_string(corpus, Tokenizer("english"))

In [161]:
# Summarize using sumy LexRank, getting top two sentences
summary = summarizer_lex(parser.document, 2)

In [162]:
# Returns top two sentences
for sentence in summary:
  print(sentence)

reinforcement learning   geeksforgeeks  sep         ... reinforcement learning (rl) is a branch of machine learning focused on making decisions to maximize cumulative rewards in a given situation.machine learning tutorial   geeksforgeeks    days ago ... machine learning  a subset of artificial intelligence  enables computers to learn from data and make predictions through various techniques ...a beginner's guide to deep reinforcement learning   geeksforgeeks  sep          ... a computer science portal for geeks.
it contains well written  well thought and well explained computer science and programming articles  ...upper confidence bound algorithm in reinforcement learning ...  feb          ... in reinforcement learning  we use multi armed bandit problem to formalize the notion of decision making under uncertainty using k armed bandits.on policy vs off policy methods reinforcement learning ...  dec          ... this tutorial aims to demystify the concepts  providing a solid foundation f

NLTK

In [163]:
# Removing 'not' from stop words, returns true if 'not' in stop word list
stop_words = set(stopwords.words('english'))
stop_words.remove('not')
'not' in stop_words

False

In [164]:
# Split the document into sentences using NLTK sentence tokenizer
sentences = nltk.sent_tokenize(corpus)
#print("Num sentences:", len(sentences))

Sentence Transformers

In [165]:
# Instatiating Sentence Transformers model
# Get the path of this mini model here: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

In [166]:
# Compute the sentence embeddings with Sentence Transformers model
embeddings = model.encode(sentences)

In [167]:
# Compute the similarity scores
similarity_scores = model.similarity(embeddings, embeddings).numpy()

LexRank Similarity Scores

In [168]:
# LexRank implementation found here: https://github.com/crabcamp/lexrank/tree/dev
logger = logging.getLogger(__name__)


# Compute degree centrality scores
def degree_centrality_scores(
    similarity_matrix,
    threshold=None,
    increase_power=True,
):
    if not (threshold is None or isinstance(threshold, float) and 0 <= threshold < 1):
        raise ValueError(
            "'threshold' should be a floating-point number " "from the interval [0, 1) or None",
        )

    if threshold is None:
        markov_matrix = create_markov_matrix(similarity_matrix)

    else:
        markov_matrix = create_markov_matrix_discrete(
            similarity_matrix,
            threshold,
        )

    scores = stationary_distribution(
        markov_matrix,
        increase_power=increase_power,
        normalized=False,
    )

    return scores


def _power_method(transition_matrix, increase_power=True, max_iter=10000):
    eigenvector = np.ones(len(transition_matrix))

    if len(eigenvector) == 1:
        return eigenvector

    transition = transition_matrix.transpose()

    for _ in range(max_iter):
        eigenvector_next = np.dot(transition, eigenvector)

        if np.allclose(eigenvector_next, eigenvector):
            return eigenvector_next

        eigenvector = eigenvector_next

        if increase_power:
            transition = np.dot(transition, transition)

    logger.warning("Maximum number of iterations for power method exceeded without convergence!")
    return eigenvector_next


def connected_nodes(matrix):
    _, labels = connected_components(matrix)

    groups = []

    for tag in np.unique(labels):
        group = np.where(labels == tag)[0]
        groups.append(group)

    return groups


def create_markov_matrix(weights_matrix):
    n_1, n_2 = weights_matrix.shape
    if n_1 != n_2:
        raise ValueError("'weights_matrix' should be square")

    row_sum = weights_matrix.sum(axis=1, keepdims=True)

    # Normalize probability distribution differently if we have negative transition values
    if np.min(weights_matrix) <= 0:
        return softmax(weights_matrix, axis=1)

    return weights_matrix / row_sum


def create_markov_matrix_discrete(weights_matrix, threshold):
    discrete_weights_matrix = np.zeros(weights_matrix.shape)
    ixs = np.where(weights_matrix >= threshold)
    discrete_weights_matrix[ixs] = 1

    return create_markov_matrix(discrete_weights_matrix)


def stationary_distribution(
    transition_matrix,
    increase_power=True,
    normalized=True,
):
    n_1, n_2 = transition_matrix.shape
    if n_1 != n_2:
        raise ValueError("'transition_matrix' should be square")

    distribution = np.zeros(n_1)

    grouped_indices = connected_nodes(transition_matrix)

    for group in grouped_indices:
        t_matrix = transition_matrix[np.ix_(group, group)]
        eigenvector = _power_method(t_matrix, increase_power=increase_power)
        distribution[group] = eigenvector

    if normalized:
        distribution /= n_1

    return distribution

In [169]:
# Compute the centrality for each sentence using LexRank
# Takes similarity scores computed from Sentence Transformers
centrality_scores = degree_centrality_scores(similarity_scores, threshold=None)

In [170]:
# Argsort to put sentence with the highest score as first element
most_central_sentence_indices = np.argsort(-centrality_scores)


# Print sentence with the highest score
print("\n\nSummary:")
for idx in most_central_sentence_indices[0:1]:
    print(sentences[idx].strip())



Summary:
the epsilon ...differences between model free and model based reinforcement ...  jun          ... reinforcement learning (rl) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to ...introduction to machine learning: what is and its applications ...    days ago ... machine learning enables computers to learn from data  identify patterns  and make predictions  driving efficiency and personalization ...


The model returns a sentence containing the most significant information related to reinforcement learning.