# News Text Summarization (Extractive) using TextRank Algorithm
Dataset used: newsarticles.csv containing news text from same news story from multiple publishers using news links in
https://www.kaggle.com/datasets/uciml/news-aggregator-dataset

In [None]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt') # one time execution
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Load Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Read data

In [None]:
df = pd.read_csv("/content/drive/MyDrive/newsaggr/newsarticles.csv",encoding = 'unicode-escape')

In [None]:
df

Unnamed: 0,ID,TITLE,TEXT,PUBLISHER
0,1,"Fed official says weak data caused by weather,...",Bad weather is largely responsible for some re...,Los Angeles Times
1,2,Fed's Charles Plosser sees high bar for change...,Federal Reserve Bank of Philadelphia president...,Livemint
2,3,Fed's Plosser: Taper pace may be too slow,The Federal Reserve may have to accelerate the...,MarketWatch


In [None]:
Split text into sentences

## Create Word Embeddings

In [None]:
from nltk.tokenize import sent_tokenize
sentences = []
for s in df['TEXT']:
  sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x]
print(sentences[:3])

['Bad weather is largely responsible for some recent weak economic data and should not lead the Federal Reserve to stop reducing a key stimulus program, a top central bank official said Monday.', 'Instead, with economic growth still forecast to pick up this year, the Fed might need to quicken the pullback of its monthly bond-buying program, said Charles Plosser, president of the Federal Reserve Bank of Philadelphia.', '"In recent weeks, there has been a blizzard of economic reports, which have come in weaker than expected," Plosser said in a speech in Paris.']


In [None]:
# Download GloVe Word Embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2023-08-18 16:56:07--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-08-18 16:56:07--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-08-18 16:56:07--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2

In [None]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()
len(word_embeddings)

In [None]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

  clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")


In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [None]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [None]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [None]:
# Create vectors for our sentences. First fetch vectors (each of size 100 elements) for the constituent words in a sentence and then
# take mean/average of those vectors to arrive at a consolidated vector for the sentence.

sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

## Create Garph of Similarity Matrix Scrores

In [None]:
# Initialize Similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

In [None]:
# Create Similarity Matrix
from sklearn.metrics.pairwise import cosine_similarity
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

Convert the similarity matrix sim_mat into a graph. The nodes of this graph will represent the sentences and the edges will represent the similarity scores between the sentences. On this graph, we will apply the PageRank algorithm to arrive at the sentence rankings

In [None]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

## Create Extractive Summary using Top-ranked Pages in the Graph

In [None]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# Extract top 10 sentences as the summary
for i in range(10):
  print(ranked_sentences[i][1])

Instead, with economic growth still forecast to pick up this year, the Fed might need to quicken the pullback of its monthly bond-buying program, said Charles Plosser, president of the Federal Reserve Bank of Philadelphia.
Bad weather is largely responsible for some recent weak economic data and should not lead the Federal Reserve to stop reducing a key stimulus program, a top central bank official said Monday.
The Federal Reserve may have to accelerate the pace of tapering to take into account the economic pickup currently ongoing in the U.S. and the improving forecast for the near future, Federal Reserve Bank of Philadelphia President Charles Plosser said Monday.
"Reducing the pace of asset purchases in measured steps is moving in the right direction, but the pace may leave us well behind the curve if the economy continues to play out according to the FOMC forecasts," he said.
New York Fed president William C. Dudley said on 7 March he sees a "reasonably favorable" outlook for the ec