# First approach;  summarise text using the TextRank algorithm

This is adapted from the tutorial at https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/.  This is a single-domain-multiple-documents summarization task, it assumes that similar documents have already been isolated/grouped, and it is only seeing a single cohesive group.

To be tried in future:
- FastText rather than GloVe word embeddings
- Some kind of native sentence embedding
- The clustering-based extractive summarization at link below

https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1

In [1]:
import numpy as np
import pandas as pd
import networkx as nx

import nltk
import re

from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from gensim.models import Doc2Vec

from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Download various corpora/dictionaries for nltk
nltk.download('punkt')
nltk.download('stopwords')

# Download the glove word embeddings
# Only do this once!
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip glove*.zip

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Martin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Martin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

## 1. Load & clean text article

Idea for the particular problem/source and specific regexes taken from https://stackabuse.com/text-summarization-with-nltk-in-python/.

Article on articifial intelligence from Wiki https://en.wikipedia.org/wiki/Artificial_intelligence.  I just copy-pasted chunks, because I didn't want to complicate this trial with the BeautifulSoup details.

In [11]:
with open("wiki_ai.txt", "r", encoding="latin-1") as f:
    article_text = f.read()

article_text[0:400]

'In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.[1] Colloq'

In [12]:
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
article_text = re.sub(r'"', '', article_text)
article_text[0:400]

'In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. Leading AI textbooks define the field as the study of intelligent agents: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquiall'

## 2. Separate and clean sentences

In [80]:
sentences = sent_tokenize(article_text)
print("Number of retrieved sentences: ", len(sentences))
sentences[0:3]

Number of retrieved sentences:  264


['In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans.',
 'Leading AI textbooks define the field as the study of intelligent agents: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.',
 'Colloquially, the term artificial intelligence is often used to describe machines (or computers) that mimic cognitive functions that humans associate with the human mind, such as learning and problem solving.']

## 3.  Insert "marker" sentences

In [81]:
sentences = ['This can be summarised as', 'Describe summarise conclude.', 'To summarize.'] + sentences
len(sentences)

267

In [82]:
def remove_stopwords(sen):
    return( " ".join([word for word in sen.split() if word not in stopwords.words('english')]))

clean_sentences = [remove_stopwords( s.replace('[^a-zA-Z]', ' ').strip(".").lower() ) for s in sentences]
clean_sentences[0:3]

['summarised', 'describe summarise conclude', 'summarize']

## 3. Generate sentence vectors using glove word embeddings

This is for measuring sentence similarity - it works by taking keywords and finding the word embeddings, then summing all the word embeddings within a given sentence, to create a sentence embedding.

In [83]:
word_embeddings = {}

with open('D:/Martin/Documents/GitHub/news_crow/lib/Glove/glove.6B.100d.txt', encoding='utf-8') as f:
    
    for line in f:
        values = line.split()
        
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        
        word_embeddings[word] = coefs

In [84]:
# An example word embedding
word_embeddings['goodbye']

array([ 0.49707  ,  0.23149  ,  0.40713  , -0.45075  , -0.19791  ,
        0.30654  , -0.063001 ,  0.27542  , -0.15643  , -0.47526  ,
        0.41297  ,  0.27763  ,  0.29307  ,  0.030136 ,  0.29642  ,
       -0.057653 ,  0.33991  , -0.10233  ,  0.4065   ,  0.7054   ,
        0.034193 ,  0.14666  , -0.81687  ,  0.08946  ,  0.7575   ,
        0.65597  , -0.73024  ,  0.032863 ,  1.3157   , -0.043748 ,
        0.028642 ,  0.48142  ,  1.0793   ,  0.21798  ,  0.0014403,
       -0.12771  ,  0.33855  , -0.3514   ,  0.41824  , -0.78994  ,
       -0.0030977, -0.33855  , -0.099491 , -0.092215 , -0.41304  ,
        0.16718  , -0.29054  ,  0.2469   ,  0.21102  , -0.61423  ,
       -0.34532  , -0.12433  ,  0.67826  ,  0.12531  , -0.26019  ,
       -1.0047   , -0.21648  ,  0.61789  ,  0.04159  ,  0.13253  ,
       -0.10514  ,  0.74716  , -0.57906  , -0.8061   ,  0.081409 ,
       -0.19144  ,  0.08183  ,  0.44171  , -0.11134  , -1.1417   ,
       -0.21043  ,  0.077252 ,  0.12823  , -0.79143  ,  0.0737

In [85]:
clean_sentence_vectors = []

for s in clean_sentences:
    if len(s) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in s.split()]) / ( len(s.split()) + 0.001 )
    
    else:
        v = np.zeros((100, ))
    
    clean_sentence_vectors.append(v)

## 3b.  Create Doc2Vec embeddings for each sentence using a pre-trained model

I'll first be using a Doc2Vec model trained on the english wikipedia available from https://ibm.ent.box.com/s/3f160t4xpuya9an935k84ig465gvymm2.  Note;  there's a chance this AI demo page is in the model - but that's ok, this is just a demo.  Model was created for publication Han and Baldwin, 2016, "An empirical evaluation of Doc2Vec with practical insights into document embedding generation", https://arxiv.org/abs/1607.05368.

I've taken the liberty of resaving the model, to future-proof changing formats/commands.

The inferred vectors are of size 300.

In [70]:
dbow = Doc2Vec.load("./enwiki_dbow/doc2vec.bin")

FileNotFoundError: [Errno 2] No such file or directory: './enwiki_dbow/doc2vec.bin'

In [None]:
dbow.save("./enwiki_dbow/doc2vec2.bin")

In [None]:
#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

clean_doc_vectors = []

for s in clean_sentences:
    if len(s) != 0:
        v = dbow.infer_vector(s.split(), alpha=start_alpha, steps=infer_epoch)
    
    else:
        v = np.zeros((dbow.vector_size, ))
    
    clean_doc_vectors.append(v)

## 4. Extract sentences most similar to artificial "summary type" sentences

In [86]:
# Get the cosine similarity between pairs of sentences
sim_mat = cosine_similarity(clean_sentence_vectors)

sim_mat.shape

(267, 267)

In [87]:
np.argmax(sim_mat[0][1:])

1

In [96]:
index = 2
print(sentences[index],
      np.argmax(sim_mat[index][3:]),
      "\n\n",
      sentences[np.argmax(sim_mat[index][3:]) + 3])

To summarize. 159 

 Computational learning theory can assess learners by computational complexity, by sample complexity (how much data is required), or by other notions of optimization.
