### Extractive Summarization using Textrank

Using Numpy graph structure where the sentences are nodes and the cosine similarity 
between sentences form the edges

In [1]:
! pwd

/Users/shreyasingh/PycharmProjects/NewsSnap_quick_digestable_news_summaries


In [1]:
# Create the .kaggle directory
!mkdir -p ~/.kaggle

# Copy kaggle.json to the .kaggle directory
!cp kaggle.json ~/.kaggle/

# Set permissions for the Kaggle API token
!chmod 600 ~/.kaggle/kaggle.json

# Confirm Kaggle API setup
!kaggle datasets list -s "bbc-news-summary"

ref                                                           title                                                 size  lastUpdated          downloadCount  voteCount  usabilityRating  
------------------------------------------------------------  --------------------------------------------------  ------  -------------------  -------------  ---------  ---------------  
pariza/bbc-news-summary                                       BBC News Summary                                       9MB  2018-05-06 11:08:19          14963        177  0.75             
jacopoferretti/bbc-articles-dataset                           BBC Articles Dataset with Extra Features               4MB  2024-11-11 17:50:09           1235         34  1.0              
bhavikjikadara/bbc-news-articles                              BBC News Articles                                      3MB  2024-07-04 08:45:12            564         17  1.0              
dignity45/bbc-news-summarycsv-format                          BBC

In [2]:
!kaggle datasets download -d pariza/bbc-news-summary
!unzip bbc-news-summary.zip -d bbc-news-summary

Dataset URL: https://www.kaggle.com/datasets/pariza/bbc-news-summary
License(s): CC0-1.0
Downloading bbc-news-summary.zip to /content
  0% 0.00/8.91M [00:00<?, ?B/s]
100% 8.91M/8.91M [00:00<00:00, 97.1MB/s]


In [5]:
! pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=72e7a736ab9840266c7f009ef870b1a2a95bd09729f755bf4ac363049c1e94a7
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


### Extractive summarization

In [6]:
import os
import time
import numpy as np
import networkx as nx
import nltk, re
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
from nltk.tokenize import sent_tokenize
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

### Directory paths

In [7]:
articles_dir = "/content/bbc-news-summary/BBC News Summary/News Articles/"
summaries_dir = "/content/bbc-news-summary/BBC News Summary/Summaries/"

### Read one text file under the business folder

In [8]:
# Open the file in read mode
with open("/content/bbc-news-summary/BBC News Summary/News Articles/business/001.txt", "r") as file:
    content = file.read()

# Print the content
print(content)

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL

### Load articles and summaries


In [10]:
# Function to load all articles or summaries from multiple subfolders
def load_text_files_from_all_categories(base_directory):
    data = {}
    for category in os.listdir(base_directory):  # Loop through categories
        category_path = os.path.join(base_directory, category)
        if os.path.isdir(category_path):  # Check if it's a directory
            for file in os.listdir(category_path):  # Loop through files
                file_path = os.path.join(category_path, file)
                with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                    data[f"{category}/{file}"] = f.read().strip()
    return data

# Load all articles and summaries
articles = load_text_files_from_all_categories(articles_dir)
summaries = load_text_files_from_all_categories(summaries_dir)

print(f"Loaded {len(articles)} articles and {len(summaries)} summaries from all categories.")

Loaded 2225 articles and 2225 summaries from all categories.


In [12]:
def read_text(txt: str = ""):
  sentences = []
  sentences = sent_tokenize(txt)
  for sentence in sentences:
    # remove everthing in the text that is not alphanumeric i.e. letters or numbers
    sentence.replace("[^a-zA-Z0-9]", " ")
  return sentences

### Example summary text

In [13]:
# summarize_me = """The term Data Science was created in the early 1960s to describe a new profession that would support the understanding and interpretation of the large amounts of data which was being amassed at the time. At the time, there was no way of predicting the truly massive amounts of data over the next fifty years. Data Science continues to evolve as a discipline using computer science and statistical methodology to make useful predictions and gain insights in a wide range of fields. While Data Science is used in areas such as astronomy and medicine, it is also used in business to help make smarter decisions.
# Statistics, and the use of statistical models, are deeply rooted within the field of Data Science. Data Science started with statistics and has evolved to include concepts/practices such as artificial intelligence, machine learning, and the Internet of Things, to name a few. As more and more data has become available, first by way of recorded shopping behaviors and trends, businesses have been collecting and storing it in ever greater amounts. With the growth of the Internet, the Internet of Things, and the exponential growth of data volumes available to enterprises, there has been a flood of new information or big data. Once the doors were opened by businesses seeking to increase profits and drive better decision-making, the use of big data started being applied to other fields, such as medicine, engineering, and social sciences.
# The term Data Science was created in the early 1960s to describe a new profession that would support the understanding and interpretation of the large amounts of data which was being amassed at the time. At the time, there was no way of predicting the truly massive amounts of data over the next fifty years. Data Science continues to evolve as a discipline using computer science and statistical methodology to make useful predictions and gain insights in a wide range of fields. While Data Science is used in areas such as astronomy and medicine, it is also used in business to help make smarter decisions.
# Statistics, and the use of statistical models, are deeply rooted within the field of Data Science. Data Science started with statistics and has evolved to include concepts/practices such as artificial intelligence, machine learning, and the Internet of Things, to name a few. As more and more data has become available, first by way of recorded shopping behaviors and trends, businesses have been collecting and storing it in ever greater amounts. With the growth of the Internet, the Internet of Things, and the exponential growth of data volumes available to enterprises, there has been a flood of new information or big data. Once the doors were opened by businesses seeking to increase profits and drive better decision-making, the use of big data started being applied to other fields, such as medicine, engineering, and social sciences.
# A functional data scientist, as opposed to a general statistician, has a good understanding of software architecture and understands multiple programming languages. The data scientist defines the problem, identifies the key sources of information, and designs the framework for collecting and screening the needed data. Software is typically responsible for collecting, processing, and modeling the data. They use the principles of Data Science, and all the related sub-fields and practices encompassed within Data Science, to gain deeper insight into the data assets under review.
# There are many different dates and timelines that can be used to trace the slow growth of Data Science and its current impact on the Data Management industry, some of the more significant ones are outlined below."""


Cosine Distance:

    Cosine distance is often used as a dissimilarity measure. It's defined as:

Cosine Distance=1−Cosine Similarity
Cosine Distance=1−Cosine Similarity

    This transformation shifts the range of the cosine similarity to [0, 2]:
        0 means the vectors are identical (perfect match),
        1 means the vectors are orthogonal (no similarity),
        2 means the vectors are opposite (completely dissimilar).

In [14]:
def sentence_similarity(sentence1, sentence2, stopwords = []):
  """
  This function computes the cosine similarity between two sentences
  by representing them as vectors of word occurrences.
  """
  sentence1 = [word.lower() for word in sentence1]
  sentence2 = [word.lower() for word in sentence2]
  all_words = list(set(sentence1 + sentence2))
  vector1 = [0] * len(all_words)
  vector2 = [0] * len(all_words)
  # First sentence vector
  for word in sentence1:
    if not word in stopwords:
      vector1[all_words.index(word)] += 1
  # Second sentence vector
  for word in sentence2:
    if not word in stopwords:
      vector2[all_words.index(word)] += 1
  # Vectors cosine similarity
  return 1 - cosine_distance(vector1, vector2)

In [15]:
def sentences_similarity_matrix(sentences, stopwords_):
  """
  This function will output a matrix where each element (i,j)(i,j) represents the similarity score
  between the i-th and j-th sentences, excluding the specified stopwords.
  We can use this matrix to observe sentences which have the highest similarity scores.
  """
  similarity_matrix = np.zeros((len(sentences), len(sentences))) # N on N
  for i in range(len(sentences)):
      for j in range(len(sentences)):
        if i != j:
          similarity_matrix[i][j] = sentence_similarity(sentences[i], sentences[j], stopwords_)
  return similarity_matrix

In [16]:
def sentence_similarity_after_stopword_removal(txt):
  nltk.download('stopwords')
  nltk.download('punkt_tab')
  final_stopwords = stopwords.words('english')
  # Read and tokenize txt
  sentences = read_text(txt)
  # Get similarity matrix by passing the stopwords
  sentence_similarity_matrix = sentences_similarity_matrix(sentences, final_stopwords)
  return sentence_similarity_matrix, sentences

In [None]:
sent_sim_matrix, original_sentences = sentence_similarity_after_stopword_removal(summarize_me)
sent_sim_matrix.shape

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


(23, 23)

In [None]:
sent_sim_matrix

In [None]:
def summarize(sentence_similarity_matrix, top_n, sentences):
  """
  The provided code snippet implements an extractive text summarization technique
  using a graph-based ranking algorithm TextRank:
  In this graph, each node represents a sentence,
  and edges between nodes are weighted by the similarity scores from the matrix.
  """
  summarized_text = []
  # Rank sentences in the given similarity matrix
  # convert similarity matrix into a graph structure using numpy
  sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_matrix)
  # rank sentences based on their similarity to other sentences using TextRank
  scores = nx.pagerank(sentence_similarity_graph)
  # Sort the rank of top sentences
  ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse = True)
  # Get the top n number of sentences based on rank
  for i in range(top_n):
    summarized_text.append(ranked_sentences[i][1])
  # Output the summarized version
  summary = " ".join(summarized_text)
  return summary, len(sentences), ranked_sentences


In [None]:
final_summary, sent_len, ranked_sentences = summarize(sent_sim_matrix, 3, original_sentences)

In [None]:
final_summary

'There are many different dates and timelines that can be used to trace the slow growth of Data Science and its current impact on the Data Management industry, some of the more significant ones are outlined below. They use the principles of Data Science, and all the related sub-fields and practices encompassed within Data Science, to gain deeper insight into the data assets under review. The term Data Science was created in the early 1960s to describe a new profession that would support the understanding and interpretation of the large amounts of data which was being amassed at the time.'

In [None]:
type(final_summary), type(summarize_me)

(str, str)

In [None]:
ranked_sentences[1][1]

'They use the principles of Data Science, and all the related sub-fields and practices encompassed within Data Science, to gain deeper insight into the data assets under review.'

In [None]:
# convert from tuple to list
final_ranked_sentences = []
# Take top 3 sentences of the summary
for final_ranked in ranked_sentences[:3]:
  final_ranked_sentences.append(final_ranked[1])

In [None]:
# Function to calculate BLEU using unigrams
def calculate_bleu_for_summary_unigram(reference, candidate):
    smoothing = SmoothingFunction().method1
    total_score = 0
    for cand_sentence in candidate:
        tokenized_cand = cand_sentence.split()
        # set weights to (1, 0, 0, 0) for unigram
        sentence_scores = [
            sentence_bleu([ref.split()], tokenized_cand, smoothing_function=smoothing,
                          weights=(1, 0, 0, 0))
            for ref in reference
        ]
        best_score = max(sentence_scores)  # Take the best match
        total_score += best_score
    average_score = total_score / len(candidate)  # Normalize by the number of summary sentences
    return average_score

In [None]:
# Unigram check
summary_bleu_unigram = calculate_bleu_for_summary_unigram(original_sentences, final_ranked_sentences)
summary_bleu_unigram

1.0

In [None]:
# Function to calculate BLEU using unigrams
def calculate_bleu_for_summary_4gram(reference, candidate):
    smoothing = SmoothingFunction().method1
    total_score = 0
    for cand_sentence in candidate:
        tokenized_cand = cand_sentence.split()
        # set weights to (1, 0, 0, 0) for unigram
        sentence_scores = [
            sentence_bleu([ref.split()], tokenized_cand, smoothing_function=smoothing,
                          weights=(0.25, 0.25, 0.25, 0.25))
            for ref in reference
        ]
        best_score = max(sentence_scores)  # Take the best match
        total_score += best_score
    average_score = total_score / len(candidate)  # Normalize by the number of summary sentences
    return average_score

In [None]:
# Calculate BLEU
summary_bleu_4gram = calculate_bleu_for_summary_4gram(original_sentences, final_ranked_sentences)

# Display results
print(f"BLEU Score for the summarized text: {summary_bleu_4gram:.4f}")

BLEU Score for the summarized text: 1.0000


### Rouge score

In [None]:
# Combine sentences for single comparison
reference_summary = " ".join(original_sentences)
candidate_summary = " ".join(final_ranked_sentences)

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Compute ROUGE scores
scores = scorer.score(reference_summary, candidate_summary)

# Display results
for metric, score in scores.items():
    print(f"{metric.upper()} - Precision: {score.precision:.4f}, Recall: {score.recall:.4f}, F1: {score.fmeasure:.4f}")


ROUGE1 - Precision: 1.0000, Recall: 0.1672, F1: 0.2865
ROUGE2 - Precision: 0.9800, Recall: 0.1625, F1: 0.2788
ROUGEL - Precision: 0.5644, Recall: 0.0944, F1: 0.1617


In [None]:
# Function to process all articles
def process_all_articles(articles, summaries, top_n=3):
    stopwords_list = stopwords.words('english')
    final_results = []
    total_bleu = 0
    for file, article in articles.items():
        st = time.time()
         # Tokenize reference summary
        reference_summary = summaries.get(file, "").splitlines()
        # Tokenize and calculate similarity matrix
        sentences = sent_tokenize(article)
        similarity_matrix = sentences_similarity_matrix(sentences, stopwords_list)

        # Generate summary
        final_summary, sent_len, ranked_sentences = summarize(similarity_matrix, top_n, sentences)

        # Calculate BLEU
        bleu_score = calculate_bleu_for_summary_4gram(reference_summary, final_summary.splitlines())

        # Store results
        final_results.append({
            "file": file,
            "summary": final_summary,
            "bleu": bleu_score
        })

        # Accumulate BLEU scores
        total_bleu += bleu_score
        print("Time taken is :", time.time() - st)

    # Compute average BLEU
    average_bleu = total_bleu / len(final_results)

    return final_results, average_bleu

In [None]:
# Process all articles
results, avg_bleu = process_all_articles(articles, summaries)

# Display the results
print(f"Processed {len(results)} articles.")
print(f"Average BLEU Score: {avg_bleu:.4f}")

Time taken is : 0.7870628833770752
Time taken is : 0.09772658348083496
Time taken is : 0.3102295398712158
Time taken is : 0.6910796165466309
Time taken is : 11.563915729522705
Time taken is : 0.22063708305358887
Time taken is : 0.13146448135375977
Time taken is : 0.10208725929260254
Time taken is : 0.3287646770477295
Time taken is : 0.05009317398071289
Time taken is : 0.08936476707458496
Time taken is : 0.06034374237060547
Time taken is : 0.218735933303833
Time taken is : 0.09930658340454102
Time taken is : 0.1765279769897461
Time taken is : 0.13111305236816406
Time taken is : 0.3057713508605957
Time taken is : 0.11138033866882324
Time taken is : 0.25009679794311523
Time taken is : 0.3295712471008301
Time taken is : 0.1159367561340332
Time taken is : 0.2402036190032959
Time taken is : 0.3251063823699951
Time taken is : 0.08764076232910156
Time taken is : 0.10184669494628906
Time taken is : 0.26579928398132324
Time taken is : 0.19997477531433105
Time taken is : 0.45491981506347656
Time 

In [None]:
# # Example: Access individual summaries and BLEU scores
# for result in results[:5]:  # Show first 5 summaries
#     print(f"File: {result['file']}")
#     print(f"Generated Summary: {result['summary']}")
#     print(f"BLEU Score: {result['bleu']:.4f}")
#     print("-" * 50)
