#Text rank Using Spacy

In [92]:
!pip install -U spacy

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.2.2)


In [93]:
!pip install -U spacy-lookups-data

Requirement already up-to-date: spacy-lookups-data in /usr/local/lib/python3.6/dist-packages (0.2.0)


In [94]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [0]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
import spacy

In [0]:
page = urlopen("https://www.whitehouse.gov/briefings-statements/remarks-president-trump-state-union-address-2/")
soup = BeautifulSoup(page , 'lxml')

In [123]:

def get_only_text(url):
    """ 
    return the title and the text of the article
    at the specified url
    """
    page = urlopen(url)
    soup = BeautifulSoup(page, "lxml")
    text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
  
    print ("=====================")
    print (text)
    print ("=====================")
 
    return soup.title.text, text    
 
     
url="https://www.whitehouse.gov/briefings-statements/remarks-president-trump-state-union-address-2/"
text = get_only_text(url)


				Remarks			 
Issued on:
February 6, 2019
 February 5, 2019
9:07 P.M. EST THE PRESIDENT:  Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States — (applause) — and my fellow Americans: We meet tonight at a moment of unlimited potential.  As we begin a new Congress, I stand here ready to work with you to achieve historic breakthroughs for all Americans. Millions of our fellow citizens are watching us now, gathered in this great chamber, hoping that we will govern not as two parties but as one nation.  (Applause.) The agenda I will lay out this evening is not a Republican agenda or a Democrat agenda.  It’s the agenda of the American people. Many of us have campaigned on the same core promises: to defend American jobs and demand fair trade for American workers; to rebuild and revitalize our nation’s infrastructure; to reduce the price of healthcare and prescription drugs; to create an immigration system that is safe, lawful, modern, and secure; and 

In [0]:
nlp = spacy.load('en_core_web_sm')

In [0]:
doc = nlp(str(text))

In [100]:
sentences = []
for i, token in enumerate(doc.sents):
        print('-->Sentence %d: %s' % (i, token.text))
        sentences = [sent.string.strip() for sent in doc.sents]

-->Sentence 0: ('Remarks by President Trump in State of the Union Address | The White House', '\n\t\t\t\tRemarks\t\t\t \nIssued on:\nFebruary 6, 2019\n February 5,
-->Sentence 1: 2019\n9:07 P.M. EST THE PRESIDENT:\xa0
-->Sentence 2: Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States — (applause) — and my fellow Americans: We meet tonight at a moment of unlimited potential.\xa0 As we begin a new Congress, I stand here ready to work with you to achieve historic breakthroughs for all Americans.
-->Sentence 3: Millions of our fellow citizens are watching us now, gathered in this great chamber, hoping that we will govern not as two parties but as one nation.\xa0 (Applause.)
-->Sentence 4: The agenda I will lay out this evening is not a Republican agenda or a Democrat agenda.\xa0
-->Sentence 5: It’s the agenda of the American people.
-->Sentence 6: Many of us have campaigned on the same core promises: to defend American jobs and demand fair trade for 

In [0]:
def unicodeCleaner(text):
    d = text.replace(u'\\xa0', u' ')
    d = d.replace(u'\n', u'')
    d = d.replace(u'\t', u'')
    d = d.replace(u'\r', u'')
    return d

In [0]:
sentences = [unicodeCleaner(sen).strip() for sen in sentences ]

In [103]:
sentences[20:30]

['An amazing quality of life for all of our citizens is within reach.',
 'We can make our communities safer, our families stronger, our culture richer, our faith deeper, and our middle class bigger and more prosperous than ever before.',
 '(Applause.)',
 'But we must reject the politics of revenge, resistance, and retribution, and embrace the boundless potential of cooperation, compromise, and the common good.  (Applause.)',
 'Together, we can break decades of political stalemate.',
 'We can bridge old divisions, heal old wounds, build new coalitions, forge new solutions, and unlock the extraordinary promise of America’s future.',
 'The decision is ours to make.',
 'We must choose between greatness or gridlock, results or resistance, vision or vengeance, incredible progress or pointless destruction.',
 'Tonight, I ask you to choose greatness.  (Applause.)',
 'Over the last two years, my administration has moved with urgency and historic speed to confront problems neglected by leaders o

In [0]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
  values = line.split()
  word = values[0]
  coefs = np.asarray(values[1:] , dtype = 'float32')
  word_embeddings[word] = coefs
f.close()

In [105]:
len(word_embeddings)

40324

#Text Cleaning

In [106]:
stop_words = []
stop_words = spacy.lang.en.stop_words.STOP_WORDS
len(stop_words)

326

In [0]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(filtered_sentence).str.replace("[^a-zA-Z0-9\s]", " ")
# change to lowercase
clean_sentences = [s.lower().strip() for s in clean_sentences]
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [108]:
clean_sentences

['remarks president trump state union address white house n t t t tremarks t t t nissued nfebruary 6 2019 n february 5',
 '2019 n9 07 p m est president',
 'madam speaker mr vice president members congress lady united states applause fellow americans meet tonight moment unlimited potential begin new congress stand ready work achieve historic breakthroughs americans',
 'millions fellow citizens watching gathered great chamber hoping govern parties nation applause',
 'agenda lay evening republican agenda democrat agenda',
 's agenda american people',
 'campaigned core promises defend american jobs demand fair trade american workers rebuild revitalize nation s infrastructure reduce price healthcare prescription drugs create immigration system safe lawful modern secure pursue foreign policy puts america s interests',
 'new opportunity american politics courage seize applause',
 'victory winning',
 'party victory winning country applause',
 'year america recognize important anniversaries maj

In [0]:
sentence_vectors = []
for i in sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)



In [150]:
len(sentence_vectors)
sentence_vectors[0]

array([-4.62830826e-02,  4.98595188e-03,  1.19469552e-01, -2.72067992e-02,
        3.54626921e-02,  6.58560533e-02, -6.34203130e-02,  2.16849196e-02,
       -1.45163565e-01,  5.89128622e-02,  5.84691197e-02, -6.42312297e-02,
        8.50548081e-02, -1.39393385e-02,  3.63639821e-02, -3.21300912e-02,
        4.49736210e-02, -2.24076464e-02, -6.03078429e-02,  1.78758166e-02,
        6.04165023e-02,  7.04713916e-03,  9.71213766e-02,  7.27818184e-02,
        8.47145361e-02, -6.06823476e-02,  7.81153261e-02, -7.02314177e-02,
        2.10332356e-02, -2.72015614e-02,  4.44506939e-02,  2.97472986e-02,
        1.08166352e-03,  4.29980205e-04, -7.08806425e-02,  3.04810713e-02,
        6.66825402e-02,  1.55627826e-01, -7.25515926e-02, -2.21739435e-02,
       -6.84676917e-02, -9.41963747e-02,  3.12194646e-02, -4.66801576e-02,
        1.13301296e-02, -1.68206282e-02,  5.31020401e-02, -9.27420107e-02,
       -5.50701872e-02, -9.78153417e-02, -2.12747964e-02,  5.35588783e-02,
        7.36412537e-02,  

In [0]:
# Create an empty similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

In [0]:
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i !=j :
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100),sentence_vectors[j].reshape(1,100))[0,0]

In [156]:
len(sim_mat)

361

#PageRank

In [162]:
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# Extract top 15 sentences as the summary representation
for i in range(15):
    print(ranked_sentences[i][1])

In the 20th century, America saved freedom, transformed science, redefined the middle class, and, when you get down to it, there’s nothing anywhere in the world that can compete with America.  (Applause.)  Now we must step boldly and bravely into the next chapter of this great American adventure, and we must create a new standard of living for the 21st century.
We do not know whether we will achieve an agreement, but we do know that, after two decades of war, the hour has come to at least try for peace.  And the other side would like to do the same thing.
Another historic trade blunder was the catastrophe known as NAFTA.  I have met the men and women of Michigan, Ohio, Pennsylvania, Indiana, New Hampshire, and many other states whose dreams were shattered by the signing of NAFTA.  For years, politicians promised them they would renegotiate for a better deal, but no one ever tried, until now.
Millions of our fellow citizens are watching us now, gathered in this great chamber, hoping tha

#Gensim

In [0]:
from gensim.summarization import summarize
from gensim.summarization import keywords

In [131]:
#text = requests.get('https://www.whitehouse.gov/briefings-statements/remarks-president-trump-state-union-address-2/').text

print('Summary:')
print(summarize(str(text), ratio=0.02))

print('\nKeywords:')
print(keywords(str(text), ratio=0.01))

Summary:
In just over two years since the election, we have launched an unprecedented economic boom — a boom that has rarely been seen before.\xa0 There’s been nothing like it.\xa0 We have created 5.3 million new jobs and, importantly, added 600,000 new manufacturing jobs — something which almost everyone said was impossible to do.\xa0 But the fact is, we are just getting started.\xa0 (Applause.) Wages are rising at the fastest pace in decades and growing for blue-collar workers, who I promised to fight for.\xa0 They’re growing faster than anyone else thought possible.\xa0 Nearly 5 million Americans have been lifted off food stamps.\xa0 (Applause.)\xa0 The U.S. economy is growing almost twice as fast today as when I took office.\xa0 And we are considered, far and away, the hottest economy anywhere in the world.\xa0 Not even close.\xa0 (Applause.) Unemployment has reached the lowest rate in over half a century.\xa0 (Applause.)\xa0 African American, Hispanic American, and Asian American 