Centroids - most relevant tokens; tokens that contain the same meaning
1. Sum up vector representation of words that are part of a centroid => get embedding representation of the centroid.
2. Every sentence is scored (cosine similarity) based on how similar they are to the centroid embedding.
3. Select sentences based on their score until a certain number of words (hyperparameter) is reached
4. Avoid redundancy - if a chosen sentence is too similar to the ones in the already produced summary, don't add it (cosine similarity + predefined threshold)

https://aclanthology.org/W17-1003.pdf

https://arxiv.org/pdf/1707.02268v3.pdf

News headlines

Web snippets from search results

Below text is from https://en.wikipedia.org/wiki/ReBoot

In [20]:
text = '''
Development
ReBoot was initially conceived in 1984 by the British creative collective The Hub, made up of John Grace, Ian Pearson, Gavin Blair and Phil Mitchell. After about 8 years of development Pearson, Blair and Mitchell moved to Vancouver, British Columbia to produce the series. Pearson and Blair by this time had created some of the first widely seen CGI characters, in the Dire Straits music video "Money for Nothing".[16] However, technology was not yet advanced enough to make the show in the desired way. 3D animation tests began in earnest in 1990 and ReBoot had achieved its detailed look by 1991. Production continued on future episodes and the show aired in 1994 after enough episodes had been produced. This was a painstaking process, as no other company had at this time worked on a 3D animation project of this scale. Furthermore, the software used was new to all in the company.

ReBoot was created on Silicon Graphics workstations using Softimage Creative Environment software.

Network censorship
The show's early jokes at the expense of Board of Standards and Practices (BS&P) came from frustration encountered by the show's makers brought about by an abundance of script and editing changes that were imposed upon Mainframe before episodes were allowed to air. These changes were all aimed at making the show "appropriate" for kids, and to prevent even the slightest appearance of "inappropriate" content, imitable violence or sexuality.[17]

The character Dot was considered too sexualized by the BS&P even though she was "never one to expose much cleavage" so the animators were forced to make her breasts less curvy and form them into a lumpy "monobreast", as lightly referred to by the staff. However, starting with season three, after severing ties with ABC, the "monobreasts" of all adult female characters were replaced with more anatomically correct versions. In another case, the word "hockey", as well as the sport itself, was cut in some countries as it was supposedly used as a vulgar slang term there. In the episode "Talent Night", one scene of Dot giving her brother Enzo "a sisterly kiss on the chin" was cut due to BS&P's fear of promoting incest, an insinuation which Pearson described as "one of the sickest things I've heard."[17]

Episodes
Main article: List of ReBoot episodes
Season 1
Each installment of the first season was a self-contained episode except for the two-part finale. When the User loads a game, a game cube drops on a random location in Mainframe, sealing it off from the rest of the system and turning it into a gamescape. Bob frequently enters the games, reboots to become a game character, and fights the User's character to save the sector. If the User wins a game, the sector the cube fell in is destroyed, and the sprites and binomes who were caught within are turned into energy-draining, worm-like parasites called nulls. When this happens, they are said to be "nullified".


'''

In [21]:
import nltk
from nltk.corpus import stopwords
import numpy as np
import re
import string

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from typing import List
from functools import reduce
import operator

In [22]:
STOP_WORDS = set(stopwords.words('english'))

In [48]:
vector = List[float]

def dot(v: vector, w: vector):
    return sum([vi * wi for vi, wi in zip(v,w)])

def cos_sim(v: vector, w: vector):
    return dot(v, w) / (dot(v,v) * dot(w,w)) ** .5


In [49]:
class Preprocessing(object):
    def __init__(self, text):
        self.text = text
        self.oryg = text

    def lower(self):
        self.text = self.text.lower() 
        return self.text
    
    def remove_punctuation(self):
        self.text = self.text.translate(self.text.maketrans('', '', string.punctuation.replace('.', '')))
        return self.text 
    
    def remove_stop_words(self):
        self.text = ' '.join([word for word in self.text.split() if word not in STOP_WORDS])
        return self.text
    
    def remove_digits(self):
        self.text = re.sub(r'[\d+]', '', self.text)
        return self.text
    
    def sentence_tokenize(self):
        self.text = sent_tokenize(self.text)
        self.text = [sent.replace('.','') for sent in self.text]
        return self.text
    
    def basic_pipeline(self):
        self.lower()
        self.remove_digits()
        self.remove_punctuation()
        self.remove_stop_words()
        self.sentence_tokenize()
        return self.text

    def __call__(self):
        return self.text

In [50]:
cleaned_text = Preprocessing(text)
sentences = cleaned_text.basic_pipeline()

In [121]:
len(sentences)

20

In [51]:
sentences

['development reboot initially conceived british creative collective hub made john grace ian pearson gavin blair phil mitchell',
 'years development pearson blair mitchell moved vancouver british columbia produce series',
 'pearson blair time created first widely seen cgi characters dire straits music video money nothing',
 'however technology yet advanced enough make show desired way',
 'animation tests began earnest reboot achieved detailed look ',
 'production continued future episodes show aired enough episodes produced',
 'painstaking process company time worked animation project scale',
 'furthermore software used new company',
 'reboot created silicon graphics workstations using softimage creative environment software',
 'network censorship shows early jokes expense board standards practices bsp came frustration encountered shows makers brought abundance script editing changes imposed upon mainframe episodes allowed air',
 'changes aimed making show appropriate kids prevent even

In [52]:
tfidf = Pipeline([
    ('count', CountVectorizer()),
    ('tfidf', TfidfTransformer(norm = None, sublinear_tf = False, smooth_idf = False))
])

In [53]:
centroid_vector_all = tfidf.fit_transform(sentences).toarray().sum(axis = 0)
centroid_vector_all = np.divide(centroid_vector_all, centroid_vector_all.max())

In [54]:
relevant_vector_indices = np.where(centroid_vector_all > 0.3)[0]

In [55]:
features = tfidf['count'].get_feature_names_out()
word_list = list(np.array(features)[relevant_vector_indices])

In [56]:
from gensim.models import Word2Vec

In [57]:
word_list

['animation',
 'blair',
 'british',
 'bsp',
 'changes',
 'character',
 'characters',
 'company',
 'created',
 'creative',
 'cube',
 'cut',
 'development',
 'dot',
 'enough',
 'episode',
 'episodes',
 'even',
 'first',
 'game',
 'however',
 'mainframe',
 'make',
 'mitchell',
 'one',
 'pearson',
 'reboot',
 'season',
 'sector',
 'show',
 'shows',
 'software',
 'time',
 'used',
 'user']

In [58]:
all_words = [word_tokenize(sent) for sent in sentences]

all_words_flattened = reduce(operator.concat, all_words)

model = Word2Vec(all_words, window=2, size=100, sg = 1, min_count=1)

model_lookup = dict()
for word in all_words_flattened:
    model_lookup[word] = model.wv[word]

In [59]:
def make_vector_representation(words, model_lookup, model):
    representation = np.zeros(model.vector_size, dtype='float32')

    for word in words:
        representation += model_lookup[word]
    
    representation = np.divide(representation, len(words))

    return representation

In [60]:
centroid_vector = make_vector_representation(word_list, model_lookup, model)

In [61]:
representation = make_vector_representation(all_words_flattened, model_lookup, model)

In [62]:
representation

array([ 1.54214897e-04, -5.09227211e-05,  1.61043630e-04,  2.79348460e-04,
        1.75944559e-04, -1.27278166e-04, -8.18354602e-05, -1.47147352e-04,
        3.15203419e-04, -4.56113921e-05,  1.54879570e-04, -2.16522036e-04,
       -1.50882624e-04,  2.80765555e-04,  1.12474503e-04,  6.38427591e-05,
        1.38025483e-04, -1.50617794e-04, -3.53881449e-04, -9.96003364e-05,
        1.06661784e-04,  1.87515907e-04,  2.32009508e-04,  2.38673310e-04,
       -1.49481406e-04, -1.23086560e-04, -8.60589862e-05,  6.74915373e-06,
        4.36868228e-04,  7.42512784e-05, -2.88902374e-04, -2.28701771e-04,
        3.47384950e-04,  1.62585682e-04,  6.26627007e-05, -2.72104022e-04,
        2.72039179e-05, -1.96584966e-04, -4.81053867e-04, -3.18602659e-04,
       -2.22123039e-04,  6.35146353e-05, -2.93927267e-04, -1.27480176e-04,
        3.92917776e-04, -1.42523641e-04, -3.24721332e-04, -3.14922436e-05,
       -2.51001504e-04,  1.95287852e-04,  1.18146250e-04, -5.57147468e-05,
       -1.41681172e-04, -

In [72]:
sent_scores = dict()
for n, sentence in enumerate(sentences):
    words = sentence.split()

    sentence_vector = make_vector_representation(words, model_lookup, model)

    score = cos_sim(sentence_vector, centroid_vector)
    sent_scores[n] = [score, sentences[n], sentence_vector]

sent_scores_sort = sorted(sent_scores.items(), key = lambda item: item[1][0], reverse=True)

In [73]:
sent_scores_sort

[(0,
  [0.42333227205055174,
   'development reboot initially conceived british creative collective hub made john grace ian pearson gavin blair phil mitchell',
   array([-8.97875056e-04, -1.59399922e-03, -8.30540666e-04, -2.15693115e-04,
           6.99957833e-04,  5.57818275e-04, -9.68134787e-04,  5.63212125e-05,
          -6.31759351e-04,  7.06171268e-05,  5.09757141e-04, -1.06370739e-04,
          -1.50449236e-03,  1.12850976e-03,  5.44895243e-04,  1.75481860e-03,
           7.07819010e-04, -1.10720226e-04, -4.12370602e-04,  5.36377716e-04,
           1.61970477e-03, -4.33805806e-04,  1.24530619e-04,  7.90588849e-04,
           8.08596436e-04, -7.01658952e-04, -1.94271107e-03,  6.15618133e-04,
           9.98917851e-04, -4.30525339e-04, -1.42853369e-03,  5.82319219e-04,
           3.58103367e-04,  3.25415749e-04,  3.45172884e-04, -1.04325428e-03,
           2.56861298e-04,  1.26588822e-03, -6.15392812e-04,  7.22540892e-04,
          -2.42707043e-04,  3.75346979e-04, -8.79562722e-05,

In [101]:
for s in sent_scores_sort:
    count = 0
    sentences_summary = []
    #Handle redundancy
    for s in sent_scores_sort:
        if count > 100:
            break
        include_flag = True
        for ps in sentences_summary:
            sim = cos_sim(s[1][2], ps[1][2])
            if sim > 0.95:
                include_flag = False
        if include_flag:
            sentences_summary.append(s)
            count += len(s[1][1].split())
    
    sentences_summary = sorted(sentences_summary, key=lambda el: el[0], reverse=False)

In [113]:
for n, sent in enumerate(sentences_summary):
    print(sentences_summary[n][1][1])

development reboot initially conceived british creative collective hub made john grace ian pearson gavin blair phil mitchell
years development pearson blair mitchell moved vancouver british columbia produce series
pearson blair time created first widely seen cgi characters dire straits music video money nothing
reboot created silicon graphics workstations using softimage creative environment software
episode talent night one scene dot giving brother enzo sisterly kiss chin cut due bsps fear promoting incest insinuation pearson described one sickest things ive heard
episodes main article list reboot episodes season installment first season selfcontained episode except twopart finale
bob frequently enters games reboots become game character fights users character save sector
user wins game sector cube fell destroyed sprites binomes caught within turned energydraining wormlike parasites called nulls
