In [1]:
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import pairwise_distances_argmin_min
import nltk
nltk.download('punkt')
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/austinkrause/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
#set columns to display all
pd.options.display.max_columns = 1000

<h3>Load in csv

In [3]:
df = pd.read_csv('../Data/df_with_gensim_summaries.csv')

In [4]:
#drop unnecessary columns
df = df.drop(['Unnamed: 0', 'Unnamed: 0.1.1'], axis = 1)

<h3>Preview of dataframe

In [33]:
df.head()

Unnamed: 0,title,content,category,gensim_summary,first_100,sent_tokenized
0,Agent Cooper in Twin Peaks is the audience: on...,And never more so than in Showtime’s new...,Longform,"In the second season finale, back in 1991, the...",And never more so than in Showtime’s new serie...,[' And never more so than in Showtime’s n...
1,"AI, the humanity!",AlphaGo’s victory isn’t a defeat for hum...,Longform,When speaking to DeepMind and Google developer...,AlphaGo’s victory isn’t a defeat for humans — ...,[' AlphaGo’s victory isn’t a defeat for h...
2,Massive attack,How a weapon against war became a weapon...,Longform,International visitors for the event are commo...,How a weapon against war became a weapon again...,[' How a weapon against war became a weap...
3,Brain drain,Genius quietly laid off a bunch of its e...,Longform,"In a post on the Genius blog at the time, co-f...",Genius quietly laid off a bunch of its enginee...,[' Genius quietly laid off a bunch of its...
4,Facebook takes flight,Inside the test flight of Facebook’s fir...,Longform,But if your goal is to stay in the air for a l...,Inside the test flight of Facebook’s first int...,[' Inside the test flight of Facebook’s f...


<h3>Finding Cosine Similarity Between All Sentences

In [5]:
sample = """Drivers don’t always realize that they may be overpaying for car insurance. If you haven't compared quotes
recently, even if you have a low rate, you could still be paying too much. Fortunately, millions of smart drivers have
used EverQuote™'s free service to save hundreds on their insurance bills. It’s really no wonder that with so many 
drivers saving money, EverQuote™ is gaining momentum. EverQuote™ is an efficient source that tries to give consumers
the lowest rates with tools you can trust. Just imagine what you could do with the money you save!"""

In [6]:
sample_2 = """President Donald Trump and his Polish counterpart Andrzej Duda were to announce higher US troop levels in Poland 
on Wednesday, with the main question being whether Washington will defy Russian objections to establish an American 
base in the NATO country. A senior Trump administration official said the White House meeting would see the two 
leaders make a significant announcement." Whether Trump will risk irritating Moscow with a base or take the simpler 
option of adding more troops to the current non-permanent force was unclear. Located deep in what used to be 
Soviet-dominated eastern Europe, Poland is a member of NATO but has long wanted deeper US commitment. Spooked by 
resurgent Russia's seizing control of territory in Georgia and Ukraine over the last decade, Duda has tried to charm 
the US president, even touting the idea of Poland building a "Fort Trump" to house thousands of US soldiers.
Krzysztof Szczerski, an adviser to the Polish president, said the general concept of a "Fort Trump" was on the 
agenda Wednesday and that the US presence "will increase both in quality as well as quantity." """

This function will determine which sentences to extract from the article's text by finding the cosine similarity between all tf-idf transformed sentences. The extracted sentences will have the highest average cosine similarity to the remaining sentences. By doing this, the summary should include sentences that show the highest importance to the article.

In [24]:
def find_similarities(text):
    #tokenize sentences
    sentences = sent_tokenize(text, language = 'en')
    #set stop words
    stops = list(set(stopwords.words('english'))) + list(punctuation)
    
    #vectorize sentences and remove stop words
    vectorizer = TfidfVectorizer(stop_words = stops)
    #transform using TFIDF vectorizer
    trsfm=vectorizer.fit_transform(sentences)
    
    #creat df for input article
    text_df = pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names(),index=sentences)
    
    #declare how many sentences to use in summary
    num_sentences = text_df.shape[0]
    num_summary_sentences = int(np.ceil(num_sentences**.5))
        
    #find cosine similarity for all sentence pairs
    similarities = cosine_similarity(trsfm, trsfm)
    
    #create list to hold avg cosine similarities for each sentence
    avgs = []
    for i in similarities:
        avgs.append(i.mean())
     
    #find index values of the sentences to be used for summary
    top_idx = np.argsort(avgs)[-num_summary_sentences:]
    
    return top_idx

Use sample text to determine the sentences to be extracted for the summary

In [25]:
find_similarities(sample)

array([3, 5, 2])

This function will call upon the find_similarities() function and will then arrange the sentences in the proper order.

In [2]:
def build_summary(text):
    #find sentences to extract for summary
    sents_for_sum = find_similarities(text)
    #sort the sentences
    sort = sorted(sents_for_sum)
    #display which sentences have been selected
    print(sort)
    
    sent_list = sent_tokenize(text)
    #print number of sentences in full article
    print(len(sent_list))
    
    
    #extract the selected sentences from the original text
    sents = []
    for i in sort:
        sents.append(sent_list[i].replace('\n', ''))
    
    #join sentences together for final output
    summary = ' '.join(sents)
    return summary

<h1>Examples

In [27]:
build_summary(sample)

[2, 3, 5]
6


"Fortunately, millions of smart drivers haveused EverQuote™'s free service to save hundreds on their insurance bills. It’s really no wonder that with so many drivers saving money, EverQuote™ is gaining momentum. Just imagine what you could do with the money you save!"

In [19]:
build_summary(sample_2)

[1, 2, 3]
6


'A senior Trump administration official said the White House meeting would see the two leaders make a significant announcement." Whether Trump will risk irritating Moscow with a base or take the simpler option of adding more troops to the current non-permanent force was unclear. Located deep in what used to be Soviet-dominated eastern Europe, Poland is a member of NATO but has long wanted deeper US commitment.'

In [13]:
build_summary(df.content[0])

[0, 1, 2, 13, 15, 32, 46, 74, 76, 84]
85


'      And never more so than in Showtime’s new series revival Some spoilers ahead through episode 4 of season 3 of Twin Peaks. On May 21st, Showtime brought back David Lynch’s groundbreaking TV series Twin Peaks, and fulfilled a prophecy in the process. In the second season finale, back in 1991, the spirit of series-defining murder victim Laura Palmer told FBI special agent and series protagonist Dale Cooper, “I’ll see you again in 25 years.” That clip plays again in the first episode of Lynch’s Twin Peaks revival, as a reminder that decades have in fact gone by, Laura’s promise has been carried out, and a series canceled mid-story is back on the air.A lot has changed in 25 years. And his development happened in parallel with the maturing of a TV audience that had to learn how to follow a new kind of story.All protagonists mediate their stories, but Dale Cooper is something else entirelyToday, viewers have more sophisticated expectations than they did in the 1990s. But Showtime’s new 

In [14]:
build_summary(df.content[15])

[2, 4, 15, 19, 20]
25


'Current systems at land borders in Hong Kong perform facial recognition through rolled-down windows, simply to avoid the confounding effect of the glass.But over the past few months, a new system has emerged to solve that problem, developed at Oak Ridge National Laboratory at the request of US Customs and Border Patrol. If effective, it could pave the way for far more aggressive deployment of facial recognition at automotive crossings.“The camera they have developed can go into a vehicle through tint and glare.”The system arose out of an initiative called biometric exit, which mandates a face or fingerprint verification of every US visitor as they exit the country. Colvin Pitts, a senior architect at Lytro, says the camera’s depth-sensing capability could be particularly useful when cleaning up an image for facial recognition. That’s consistent with Manaher’s early assessment, which described the system as “very prototypey.” “There is very little practical way to opt out of this syste