# First approach;  summarise text using the TextRank algorithm

This is adapted from the tutorial at https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/.  This is a single-domain-multiple-documents summarization task, it assumes that similar documents have already been isolated/grouped, and it is only seeing a single cohesive group.

To be tried in future:
- FastText rather than GloVe word embeddings
- Some kind of native sentence embedding
- The clustering-based extractive summarization at link below

https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1

In [6]:
import numpy as np
import pandas as pd
import networkx as nx

import nltk
import re

from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from gensim.models import Doc2Vec

from sklearn.metrics.pairwise import cosine_similarity

In [58]:
# Download various corpora/dictionaries for nltk
nltk.download('punkt')
nltk.download('stopwords')

# Download the glove word embeddings
# Only do this once!
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip glove*.zip

[nltk_data] Downloading package punkt to /home/martin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/martin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 1. Load & clean text article

Idea for the particular problem/source and specific regexes taken from https://stackabuse.com/text-summarization-with-nltk-in-python/.

Article on articifial intelligence from Wiki https://en.wikipedia.org/wiki/Artificial_intelligence.  I just copy-pasted chunks, because I didn't want to complicate this trial with the BeautifulSoup details.

In [47]:
with open("wiki_ai.txt", "r") as f:
    article_text = f.read()

article_text[0:400]

'In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. Computer science defines AI research as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its'

In [48]:
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
article_text = re.sub(r'"', '', article_text)
article_text[0:400]

'In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. Computer science defines AI research as the study of intelligent agents: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its g'

## 2. Separate and clean sentences

In [49]:
sentences = sent_tokenize(article_text)
print("Number of retrieved sentences: ", len(sentences))
sentences[0:3]

Number of retrieved sentences:  444


['In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.',
 'Computer science defines AI research as the study of intelligent agents: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.',
 'More in detail, Kaplan and Haenlein define AI as “a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation”.']

In [50]:
def remove_stopwords(sen):
    return( " ".join([word for word in sen.split() if word not in stopwords.words('english')]))

clean_sentences = [remove_stopwords( s.replace('[^a-zA-Z]', ' ').lower() ) for s in sentences]
clean_sentences[0:3]

['computer science, artificial intelligence (ai), sometimes called machine intelligence, intelligence demonstrated machines, contrast natural intelligence displayed humans animals.',
 'computer science defines ai research study intelligent agents: device perceives environment takes actions maximize chance successfully achieving goals.',
 'detail, kaplan haenlein define ai “a system’s ability correctly interpret external data, learn data, use learnings achieve specific goals tasks flexible adaptation”.']

## 3. Generate sentence vectors using glove word embeddings

This is for measuring sentence similarity - it works by taking keywords and finding the word embeddings, then summing all the word embeddings within a given sentence, to create a sentence embedding.

In [27]:
word_embeddings = {}

with open('glove.6B.100d.txt', encoding='utf-8') as f:
    
    for line in f:
        values = line.split()
        
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        
        word_embeddings[word] = coefs

In [28]:
# An example word embedding
word_embeddings['goodbye']

array([ 0.49707  ,  0.23149  ,  0.40713  , -0.45075  , -0.19791  ,
        0.30654  , -0.063001 ,  0.27542  , -0.15643  , -0.47526  ,
        0.41297  ,  0.27763  ,  0.29307  ,  0.030136 ,  0.29642  ,
       -0.057653 ,  0.33991  , -0.10233  ,  0.4065   ,  0.7054   ,
        0.034193 ,  0.14666  , -0.81687  ,  0.08946  ,  0.7575   ,
        0.65597  , -0.73024  ,  0.032863 ,  1.3157   , -0.043748 ,
        0.028642 ,  0.48142  ,  1.0793   ,  0.21798  ,  0.0014403,
       -0.12771  ,  0.33855  , -0.3514   ,  0.41824  , -0.78994  ,
       -0.0030977, -0.33855  , -0.099491 , -0.092215 , -0.41304  ,
        0.16718  , -0.29054  ,  0.2469   ,  0.21102  , -0.61423  ,
       -0.34532  , -0.12433  ,  0.67826  ,  0.12531  , -0.26019  ,
       -1.0047   , -0.21648  ,  0.61789  ,  0.04159  ,  0.13253  ,
       -0.10514  ,  0.74716  , -0.57906  , -0.8061   ,  0.081409 ,
       -0.19144  ,  0.08183  ,  0.44171  , -0.11134  , -1.1417   ,
       -0.21043  ,  0.077252 ,  0.12823  , -0.79143  ,  0.0737

In [29]:
clean_sentence_vectors = []

for s in clean_sentences:
    if len(s) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in s.split()]) / ( len(s.split()) + 0.001 )
    
    else:
        v = np.zeros((100, ))
    
    clean_sentence_vectors.append(v)

In [42]:
clean_sentence_vectors[0]

array([-0.10143692,  0.1497839 ,  0.20410766,  0.05169257,  0.07655364,
       -0.1374847 , -0.04935081, -0.14452541,  0.01069668,  0.13761435,
        0.03155492, -0.27092878,  0.30302594,  0.04063247,  0.18868743,
        0.22012094,  0.17044342,  0.11369723, -0.21930726,  0.00589068,
       -0.01255586, -0.16933864,  0.1564537 , -0.28184678, -0.11547442,
        0.00393545, -0.03242698,  0.0220843 , -0.02674257,  0.0818908 ,
        0.192553  ,  0.28410211, -0.39706817, -0.13418555,  0.2363252 ,
       -0.03458164, -0.19452069,  0.15324204,  0.1308446 , -0.01333038,
       -0.10377957, -0.15500956, -0.1518354 , -0.00741159,  0.01889252,
       -0.10405   ,  0.16874841,  0.00645233, -0.12718376, -0.31850214,
        0.32251486,  0.0064132 ,  0.30856701,  0.69100994,  0.092877  ,
       -1.1390456 ,  0.03049675, -0.09518538,  0.78377701,  0.17820066,
        0.05150691,  0.51316704,  0.06775952, -0.07132871,  0.29580481,
        0.05377457,  0.11555269, -0.12284428,  0.30812029,  0.21

## 3b.  Create Doc2Vec embeddings for each sentence using a pre-trained model

I'll first be using a Doc2Vec model trained on the english wikipedia available from https://ibm.ent.box.com/s/3f160t4xpuya9an935k84ig465gvymm2.  Note;  there's a chance this AI demo page is in the model - but that's ok, this is just a demo.  Model was created for publication Han and Baldwin, 2016, "An empirical evaluation of Doc2Vec with practical insights into document embedding generation", https://arxiv.org/abs/1607.05368.

I've taken the liberty of resaving the model, to future-proof changing formats/commands.

The inferred vectors are of size 300.

In [7]:
dbow = Doc2Vec.load("./enwiki_dbow/doc2vec.bin")



In [8]:
dbow.save("./enwiki_dbow/doc2vec2.bin")

In [31]:
#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

clean_doc_vectors = []

for s in clean_sentences:
    if len(s) != 0:
        v = dbow.infer_vector(s.split(), alpha=start_alpha, steps=infer_epoch)
    
    else:
        v = np.zeros((dbow.vector_size, ))
    
    clean_doc_vectors.append(v)

## 4. TextRank Algorithm (applied to word2vec model)

In [32]:
# Get the cosine similarity between pairs of sentences
sim_mat = cosine_similarity(clean_sentence_vectors)

sim_mat.shape

(444, 444)

In [33]:
# Build the similarity graph
sim_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(sim_graph)

In [34]:
ranked_sentences = sorted(((scores[i], s) for i,s in enumerate(sentences)), reverse=True)

In [35]:
# The most representative 10 sentences
ranked_sentences[0:7]

[(0.0025384314329625793,
  'Many researchers predict that such narrow AI work in different individual domains will eventually be incorporated into a machine with artificial general intelligence (AGI), combining most of the narrow skills mentioned in this article and at some point even exceeding human ability in most or all these areas.'),
 (0.00250385628270578,
  'Some of the learners described below, including Bayesian networks, decision trees, and nearest-neighbor, could theoretically, if given infinite data, time, and memory, learn to approximate any function, including whatever combination of mathematical functions would best describe the entire world.'),
 (0.002497360627519849,
  'The increased successes with real-world data led to increasing emphasis on comparing different approaches against shared test data to see which approach performed best in a broader context than that provided by idiosyncratic toy models; AI research was becoming more scientific.'),
 (0.002494813489506159,

In [36]:
# The least representative 10 sentences
ranked_sentences[-7:]

[(0.0015251645880050766,
  'RNNs can be trained by gradient descent but suffer from the vanishing gradient problem.'),
 (0.001491896620595838,
  'In 1989, Yann LeCun and colleagues applied backpropagation to such an architecture.'),
 (0.001410829874108621,
  'Attendees Allen Newell (CMU), Herbert Simon (CMU), John McCarthy (MIT), Marvin Minsky (MIT) and Arthur Samuel (IBM) became the founders and leaders of AI research.'),
 (0.001361117484945754,
  'Early pioneers also include Alexey Grigorevich Ivakhnenko, Teuvo Kohonen, Stephen Grossberg, Kunihiko Fukushima, Christoph von der Malsburg, David Willshaw, Shun-Ichi Amari, Bernard Widrow, John Hopfield, Eduardo R. Caianiello, and others.'),
 (0.0013017724882350284,
  'champions, Brad Rutter and Ken Jennings, by a significant margin.'),
 (0.001292366421668597,
  'This includes embodied, situated, behavior-based, and nouvelle AI.'),
 (0.00033848585343657686, 'In 2011, a Jeopardy!')]

## 4b. TextRank Algorithm (applied to doc2vec model)

In [37]:
# Get the cosine similarity between pairs of sentences
sim_mat = cosine_similarity(clean_doc_vectors)

sim_mat.shape

(444, 444)

In [38]:
# Build the similarity graph
sim_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(sim_graph)

In [39]:
ranked_sentences = sorted(((scores[i], s) for i,s in enumerate(sentences)), reverse=True)

In [40]:
# The most representative 10 sentences
ranked_sentences[0:7]

[(0.002965890800008051,
  'They solve most of their problems using fast, intuitive judgements.'),
 (0.002906246989735876,
  'Emergent behavior such as this is used by evolutionary algorithms and swarm intelligence.'),
 (0.0028779946842064025,
  'Many learning algorithms use search algorithms based on optimization.'),
 (0.0028546425361181127,
  'Computer vision is the ability to analyze visual input.'),
 (0.0028227424624113094,
  'Several different forms of logic are used in AI research.'),
 (0.0027896476788019206,
  'Evolutionary computation uses a form of optimization search.'),
 (0.0027880423325487443,
  'robotics or machine learning), the use of particular tools (logic or artificial neural networks), or deep philosophical differences.')]

In [41]:
# The least representative 10 sentences
ranked_sentences[-7:]

[(0.0017463320009924639,
  'The use of AI in banking can be traced back to 1987 when Security Pacific National Bank in US set-up a Fraud Prevention Task force to counter the unauthorised use of debit cards.'),
 (0.0017421548157783777,
  'The third major approach, extremely popular in routine business AI applications, are analogizers such as SVM and nearest-neighbor: After examining the records of known past patients whose temperature, symptoms, age, and other factors mostly match the current patient, X% of those patients turned out to have influenza.'),
 (0.0017413034886258158,
  'One project that is being worked on at the moment is fighting myeloid leukemia, a fatal cancer where the treatment has not improved in decades.'),
 (0.0017166138626555172,
  'Progress slowed and in 1974, in response to the criticism of Sir James Lighthill and ongoing pressure from the US Congress to fund more productive projects, both the U.S. and British governments cut off exploratory research in AI.'),
 (0

# Conclusions on extractive summarisation approach

With the word2vec representation, the sentences most picked up on are, counter-intuitively, NOT those that give the most general overall description of AI.  Bearing in mind the full article content, which goes into reasonable if math-less technical detail, the top 10 sentences have instead picked out several important general technical points. The least representative sentences are generally shorter and contain more specific terminology/names.

The doc2vec representation favours shorter sentences with more general terms, possibly document vectors represent unique meaning better than summed word vectors, which lead to long sentences incorporating many key words being favoured. The least representative sentences are generally longer but as with the word2vec approach go in to specific details that are not needed for a summary/overview.  Given our ultimate task, "extract general descriptions of an unsupervised cluster's topic", the second approach using doc2vec is probably wiser.  Neither method accurately produces what one might term an abstract or summary.


### *Opening text of the Wiki Artificial Intelligence article, representing a "human" summary*

> *In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. Computer science defines AI research as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.[1] More in detail, Kaplan and Haenlein define AI as “a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation”.[2] Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving".[3]*
*The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip in Tesler's Theorem, "AI is whatever hasn't been done yet."[4] For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology.[5] Modern machine capabilities generally classified as AI include successfully understanding human speech,[6] competing at the highest level in strategic game systems (such as chess and Go),[7] autonomously operating cars, and intelligent routing in content delivery networks and military simulations.*