#Extractive Summarization
The extractive text summarization
where important sentences and words from the given
text document are identified and those are combined
into the summary in a meaningful way.
To achieve extractive summarization in this discussion,
we will use the TextRank algorithm. The similarity
scores (the similarities between sentence vectors) are
tabulated and stored in the similarity matrix. The
similarity matrix is then converted into a graph. The
sentences are considered vertices and similarity scores
as edges. Finally, the sentences that are top in the
rankings appear in the required summary.


##Import Modules

In [5]:
import numpy as np
import pandas as pd
import nltk
import re
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

nltk.download('punkt') # one time execution
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

##Convert string to list of sentences

In [6]:
def convertToList(text):
    senlist=[]
    s=''
    for ch in text:
        if ch=='.':
            s=s+ch
            s=s.lstrip()
            senlist.append(s)
            s=''
        else:
            s=s+ch
    return senlist

##Import Glove Embeddings

In [9]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2023-11-29 14:49:18--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-11-29 14:49:18--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-11-29 14:49:19--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

##Extractive Summarize Function

In [10]:
def extractiveSumm(article):

    senlist=convertToList(article)
    print(senlist)
    sentences = []
    for s in senlist:
        sentences.append(sent_tokenize(s))
    sentences = [y for x in sentences for y in x]
    print()

    # Extract word vectors
    word_embeddings = {}
    f = open('glove.6B.100d.txt', encoding='utf-8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        word_embeddings[word] = coefs
    f.close()

    #remove punctuations, numbers and special characters
    clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ",regex=False)

    # make alphabets lowercase
    clean_sentences = [s.lower() for s in clean_sentences]

    stop_words = stopwords.words('english')
    # function to remove stopwords
    def remove_stopwords(sen):
        sen_new = " ".join([i for i in sen if i not in stop_words])
        return sen_new
    # remove stopwords from the sentences
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    print(clean_sentences)

    # Extract word vectors
    word_embeddings = {}
    f = open('glove.6B.100d.txt', encoding='utf-8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        word_embeddings[word] = coefs
    f.close()

    sentence_vectors = []
    for i in clean_sentences:
        if len(i) != 0:
            v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
        else:
            v = np.zeros((100,))
        sentence_vectors.append(v)

    # similarity matrix
    sim_mat = np.zeros([len(sentences), len(sentences)])
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
    print()
    print(sim_mat)
    nx_graph = nx.from_numpy_array(sim_mat)

    print()
    print(nx_graph)
    scores = nx.pagerank(nx_graph)
    print()
    print(scores)


    print()
    summlist=[]
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(senlist)), reverse=True)
    for i in range(len(ranked_sentences)//2):
        summlist.append(ranked_sentences[i][1])
        print(ranked_sentences[i][1], end=" ")
    summary=' '.join([str(elem) for i,elem in enumerate(summlist)])
    return summary

##Article to test Extractive Summarization

In [11]:
article="Gautam Adani has surged back into the top 20 wealthiest individuals globally, propelled by a consecutive market rally that boosted the combined market value of his enterprises by 1.33 lakh crore. Currently occupying the 19th position on the Bloomberg Billionaires Index, Adani has seen his overall net worth rise by $6.5 billion, according to the latest update from Bloomberg. Nonetheless, his total net worth for the year-to-date period remains $53.8 billion lower, as reported by ET."
summary=extractiveSumm(article)
print(summary)

['Gautam Adani has surged back into the top 20 wealthiest individuals globally, propelled by a consecutive market rally that boosted the combined market value of his enterprises by 1.', '33 lakh crore.', 'Currently occupying the 19th position on the Bloomberg Billionaires Index, Adani has seen his overall net worth rise by $6.', '5 billion, according to the latest update from Bloomberg.', 'Nonetheless, his total net worth for the year-to-date period remains $53.', '8 billion lower, as reported by ET.']

['gautam adani surged back top 20 wealthiest individuals globally, propelled consecutive market rally boosted combined market value enterprises 1.', '33 lakh crore.', 'currently occupying 19th position bloomberg billionaires index, adani seen overall net worth rise $6.', '5 billion, according latest update bloomberg.', 'nonetheless, total net worth year-to-date period remains $53.', '8 billion lower, reported et.']

[[0.         0.46055224 0.87585657 0.70120997 0.82233381 0.74823283]
 [