# Extractive Summarization
1. Using the spaCy library
2. Using GloVe embeddings and cosine similarity

In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx


2023-12-03 20:29:41.368367: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-03 20:29:41.406073: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-03 20:29:41.406621: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-03 20:29:45.875088: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


In [2]:
from utils import process_text, extract_text_from_url
from spacy_helper import summarize

### Get and process text

In [3]:
url = "https://en.wikipedia.org/wiki/Natural_language_processing"
text = extract_text_from_url(url)

In [4]:
sentences = nltk.sent_tokenize(text)

In [5]:
clean_sentences = [process_text(s) for s in sentences]

In [6]:
len(clean_sentences)

45

## Using spaCy

### spaCy's approach
- Tokenize the sentence 
- Produce a word count and normalize over total words
- Calculate the sum of the normalized count for each sentence.
- Percentage of these sentences form the summary

In [7]:
# the helper function contains the details of matrix building,
# the notebook is for presentation
# the second argument (0.1) returns the top 10% most similar sentences 
sum_text = summarize(text, 0.1)

In [8]:
st_sent = nltk.sent_tokenize(sum_text)
len(st_sent)

4

In [9]:
st_sent

['That popularity was due partly to a flurry of results showing that such techniques[10][11] can achieve state-of-the-art results in many natural language tasks, e.g., in language modeling[12] and parsing.',
 '[13][14] This is increasingly important in medicine and healthcare, where NLP helps analyze notes and text in electronic health records that would otherwise be inaccessible for study when seeking to improve care[15] or protect patient privacy.',
 '[16]\n Symbolic approach, i.e., the hand-coding of a set of rules for manipulating symbols, coupled with a dictionary lookup, was historically the first approach used both by AI in general and by NLP in particular:[17][18] such as by writing grammars or devising heuristic rules for stemming.In 2003, word n-gram model, at the time the best statistical algorithm, was overperformed by a multi-layer perceptron (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in language 

## Using word embeddings

### Using the Stanford GloVe embeddings: https://github.com/stanfordnlp/GloVe

In [10]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2023-12-03 20:29:47--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-12-03 20:29:47--  https://nlp.stanford.edu/data/glove.6B.zip
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-12-03 20:29:48--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8621826

In [11]:
# Extract word vectors; using (100,) sized vectors 
word_embeddings = {}

f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [12]:
sentence_vectors = []

for sent in clean_sentences:
    sent_words = sent.split()
    if len(sent) != 0:
        # sum the word embeddings for each word of the sentence
        # normalize across the sentence with laplacian smoothing
        vec = sum([word_embeddings.get(word, np.zeros((100, )))
                   for word in sent_words])/(len(sent_words) + 0.001)
    else:
        vec = np.zeros((100, ))
    sentence_vectors.append(vec)

In [13]:
simularity_matrix = np.zeros([len(clean_sentences), len(clean_sentences)])

In [14]:
for i in range(len(clean_sentences)):
  for j in range(len(clean_sentences)):
    if i != j:
      simularity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

In [15]:
# building a graph enables identification of the key nodes (sentences) in the document
nx_graph = nx.from_numpy_array(simularity_matrix)

# pagerank algorithm scores sentences by number of connections/similarity 
scores = nx.pagerank(nx_graph)

In [16]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(clean_sentences)), reverse=True)
len(ranked_sentences)

45

In [17]:
for idx in range(5):
    print(ranked_sentences[idx][1])

as example george lakoff offers methodology build natural language processing nlp algorithms perspective cognitive science along findings cognitive linguistics 47 two defining aspects ties cognitive linguistics part historical heritage nlp less frequently addressed since statistical turn 1990s
that popularity due partly flurry results showing techniques 10 11 achieve state-of-the-art results many natural language tasks e.g. language modeling 12 parsing
machine learning approaches include statistical neural networks hand many advantages symbolic approach although rule-based systems manipulating symbols still use 2020 become mostly obsolete advance llms 2023
in 2010s representation learning deep neural network-style featuring many hidden layers machine learning methods became widespread natural language processing
53 likewise ideas cognitive nlp inherent neural models multimodal nlp although rarely made explicit 54 developments artificial intelligence specifically tools technologies usin