<h1>Text Summarization, Document Similarity, Topic Analysis</h1>

In [None]:
## FYI we can hide iPython warnings
import warnings
warnings.filterwarnings('ignore')

<h2>Prepare restaurant corpus</h2>

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader
restaurants = ['community', 'le_monde', 'shakeshack', 'fiveguys']
restaurants_data = {}
for restaurant in restaurants:
    restaurants_data[restaurant] = PlaintextCorpusReader('Class 7 - Data/%s' % restaurant, '%s.*' % restaurant)

<h2>Import nltk corpora</h2>

In [None]:
from nltk.book import *

<h2>Load the inaugural address corpus</h2>

In [None]:
all_addresses = list()
for file in inaugural.fileids():
    all_addresses.append((file,inaugural.raw(file)))

<h2>Text summarization</h2>
<li>Text summarization is useful because you can generate a short summary of a large piece of text automatically
<li>Then, these summaries can serve as an input into a topic analyzer to figure out what the main topic of the text is
<li>Text summarization typically selects "important" sentences and reports these sentences as a summary

<h3>A naive form of summarization is to identify the most frequent words in a piece of text and use the occurrence of these words in sentences to rate the importance of a sentence.</h3>

<h4>First the imports</h4>

In [None]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from collections import OrderedDict
import pprint

<h4>Then prep the text. Get rid of end of line chars</h4>

In [None]:
text = restaurants_data['community'].raw()
striptext = text.replace('\n\n', ' ')
striptext = striptext.replace('\n', ' ')

<h4>Construct a list of words after getting rid of unimportant ones and numbers</h4>

In [None]:
words = word_tokenize(striptext)
lowercase_words = [word.lower() for word in words
                  if word not in stopwords.words() and word.isalpha()]

<h4>Construct word frequencies and choose the most common n (20)</h4>

In [None]:
word_frequencies = FreqDist(lowercase_words)
most_frequent_words = word_frequencies.most_common(20)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(most_frequent_words)

<h4>Initializations</h4>
<li>candidate_sentences is a dictionary with the original sentence as the key, and its lowercase version as the value
<li>summary_sentences is a list containing the sentences that will be included in the summary
<li>candidate_sentence_counts is a dictionary with the original sentence as the key, and the sum of the frequencies of each word in the sentence as the value


In [None]:
summary_sentences = []
candidate_sentences = {}
candidate_sentence_counts = {}

In [None]:
sentences = sent_tokenize(striptext)
for sentence in sentences:
    candidate_sentences[sentence] = sentence.lower()
candidate_sentences

In [None]:
for upper, lower in candidate_sentences.items():
    count = 0
    for freq_word, frequency_score in most_frequent_words:
        if freq_word in lower:
            count += frequency_score
            candidate_sentence_counts[upper] = count

<h4>sort the sentences by candidate_sentence_count</h4>
<li>And pick the top ranked sentences</li>

In [None]:
candidate_sentence_counts

In [None]:
sorted_sentences = OrderedDict(sorted(
                    candidate_sentence_counts.items(),
                    key = lambda x: x[1],
                    reverse = True)[:4])
pp.pprint(sorted_sentences)

<h4>Packaging all this into a function</h4>


In [None]:
def build_naive_summary(text):
    from nltk.tokenize import word_tokenize
    from nltk.tokenize import sent_tokenize
    from nltk.probability import FreqDist
    from nltk.corpus import stopwords
    from collections import OrderedDict
    summary_sentences = []
    candidate_sentences = {}
    candidate_sentence_counts = {}
    striptext = text.replace('\n\n', ' ')
    striptext = striptext.replace('\n', ' ')
    words = word_tokenize(striptext)
    lowercase_words = [word.lower() for word in words
                      if word not in stopwords.words() and word.isalpha()]
    word_frequencies = FreqDist(lowercase_words)
    most_frequent_words =word_frequencies.most_common(20)
    sentences = sent_tokenize(striptext)
    for sentence in sentences:
        candidate_sentences[sentence] = sentence.lower()
    for upper, lower in candidate_sentences.items():
        count = 0
        for freq_word, frequency_score in most_frequent_words:
            if freq_word in lower:
                count += frequency_score
                candidate_sentence_counts[upper] = count   
    sorted_sentences = OrderedDict(sorted(
                        candidate_sentence_counts.items(),
                        key = lambda x: x[1],
                        reverse = True)[:4])
    return sorted_sentences   

In [None]:
summary = '\n'.join(build_naive_summary(restaurants_data['community'].raw()))
print(summary)

In [None]:
summary = '\n'.join(build_naive_summary(restaurants_data['le_monde'].raw()))
print(summary)

<h4>We can summarize George Washington's first inaugural speech<h4>

In [None]:
build_naive_summary(inaugural.raw('1789-Washington.txt'))

<h2>gensim: another text summarizer</h2>
<li>Gensim uses a network with sentences as nodes and 'lexical similarity' as weights on the arcs between nodes<p>


In [None]:
!pip install gensim

In [None]:
from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize,word_tokenize 
from nltk.book import *
import gensim.summarization

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader
restaurants = ['community', 'le_monde', 'shakeshack', 'fiveguys']
restaurants_data = {}
for restaurant in restaurants:
    restaurants_data[restaurant] = PlaintextCorpusReader('Class 7 - Data/%s' % restaurant, '%s.*' % restaurant)

In [None]:
type(restaurants_data['community'])

<h4>Initialize variables and clean data</h4>

In [None]:
text = restaurants_data['community'].raw()
summary_sentences = []
candidate_sentences = {}
candidate_sentence_counts = {}
striptext = text.replace('\n\n', ' ')
striptext = striptext.replace('\n', ' ')

In [None]:
summary = gensim.summarization.summarize(striptext, word_count=100) 
print(summary)

In [None]:
summary = '\n'.join(build_naive_summary(restaurants_data['community'].raw()))
print(summary)

In [None]:
print(gensim.summarization.keywords(striptext,words=10))

<h3>Comparing Trump's inaugural speech using the two methods</h3>

In [None]:
text = inaugural.raw('2017-Trump.txt')
summary_sentences = []
candidate_sentences = {}
candidate_sentence_counts = {}
striptext = text.replace('\n\n', ' ')
striptext = striptext.replace('\n', ' ')
summary = gensim.summarization.summarize(striptext, word_count=100) 
print(summary)
#print(gensim.summarization.keywords(striptext,words=10))

In [None]:
summary = '\n'.join(build_naive_summary(inaugural.raw('2017-Trump.txt')))
print(summary)

<h1>Topic modeling</h1>
<li>The goal of topic modeling is to identify the major concepts underlying a piece of text
<li>Topic modeling uses "Unsupervised Learning". No a-priori knowledge is necessary
<li>Though, without a-priori knowledge, your results are unlikely to be good!

<h2>LDA: Latent Dirichlet Allocation</h2>
<li>A technique for topic modeling
<li>Computes conditional probabilities for topic word sets
<li>Identifies the most likely topics
<li>Does this over multiple passes probabilistically picking topics in each pass
<li>Good intuitive explanation: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

<li>Basic assumptions:
<ol>
<li>Every document will be associated with a set of topics 
<li>The topics will be distributed across a probability distribution
<li>Each topic will be represented in the document by a set of words
<li>The words associated with the topic will be distributed across a probability distribution
</ol>
<li>Given these assumptions, LDA scans the document and tries to deduce the topic and word distributions

<h4>tf-idf</h4>
<li>tf-idf: term frequency - inverse document frequency
<li>LDA increases the weight of words that occur frequently (tf)
<li>But reduces the weight of words that occur across many documents in the document set (idf)

<h3>Example</h3>
<li>We'll look at the political news stories on slate.com
<li>See what topics they cover


<li>Generate a list of story links
<li>Get the stories and store in a document set

In [None]:
import requests
from bs4 import BeautifulSoup
url="https://www.slate.com"
page = requests.get(url)
bs_page = BeautifulSoup(page.content,'lxml')
all_links = bs_page.find_all('a')
categories = ['news_and_politics','news-and-politics']
followable_links = list()
for link in all_links:
    href = link.get('href')
    if href:
        for cat in categories:
            if cat in href:
                followable_links.append(href)
print(len(followable_links))

In [None]:
followable_links

In [None]:
story_list = list()
count=0
for link in followable_links:
    try:
        page=BeautifulSoup(requests.get(link).content,'lxml')
        text=page.find('body').find('section',class_='article__body').get_text().strip()
        story_list.append(text)
        count+=1
    except:
        continue
print(count)       
        

<h2>imports for LDA</h2>

In [None]:
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
import pprint

<h4>prepare the text</h4>
<li>Clean it (remove numbers, end of line characters, common words)
<li>Sentence tokenize it
<li>Convert each sentence into a list of words


In [None]:
for i in range(len(story_list)):
    story = story_list[i]
    sents = sent_tokenize(story)
    for j in range(len(sents)):
        sent = sents[j]
        sent = sent.strip().replace('\n','').replace('.', '')
        sents[j] = sent
    story_list[i] = '. '.join(sents)
story_list[0]

<li>Each document is converted into a list of words

In [None]:
texts = [[word for word in story.lower().split()
        if word not in STOPWORDS and word.isalnum() and not word.lower() == 'slate']
        for story in story_list]

In [None]:
texts

<h4>Create a (word,frequency) dictionary for each word in the text</h4>
<li>dictionary: corpora.Dictionary generates key = id , value = word (a unique number attached to each word). 
<li>corpus: A list of (word index, frequency) pairs for each text. doc2bow generates this

In [None]:
dictionary = corpora.Dictionary(texts) #(word_id,frequency) pairs
corpus = [dictionary.doc2bow(text) for text in texts] #(word_id,freq) pairs by sentence
dictionary[4]
#dictionary.keys()
#dictionary.token2id
#corpus[3]

<h2>Do the LDA</h2>

<h4>Parameters:</h4>
<li>Number of topics: The number of topics you want generated. 
<li>Passes: The number of time the LDA model goes through the document. More passes, slower analysis
<ol>
<li>LDA first randomly assigns words and word weights to each topic
<li>In each pass, it refines the weights
<li>In short, you want the number of passes to be wherever the gain (improved weights) is minimal

In [None]:
#Set parameters
num_topics = 5 #The number of topics that should be generated
passes = 10

In [None]:
lda = LdaModel(corpus,
              id2word=dictionary,
              num_topics=num_topics,
              passes=passes)

<h4>See results</h4>
<li>We get a set of candidate topics in the form of words
<li>It is up to us to make sense of the words

In [None]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(lda.print_topics(num_words=8))

In [None]:
len(corpus)

In [None]:
from operator import itemgetter
lda.get_document_topics(corpus[0],minimum_probability=0.05)
sorted(lda.get_document_topics(corpus[0],minimum_probability=0),key=itemgetter(1),reverse=True)


<h2>Using the results</h2>
<li>When a new document comes in
<li>See which topic(s) it matches


In [None]:
newdoc = """
President Trump broke with his own intelligence agencies on Friday, appearing to accept Saudi Arabia’s explanation that the journalist Jamal Khashoggi was killed by accident during a fistfight, while the United States’ spy agencies are increasingly convinced that he was assassinated on high-level orders from the Saudi royal court.

Mr. Trump, who has cultivated Crown Prince Mohammed bin Salman and made Saudi Arabia the linchpin of his Middle East strategy, has been deeply reluctant to point a finger at the prince, despite evidence linking him to Saudi operatives who entered the country’s consulate in Istanbul the same day that Mr. Khashoggi disappeared there.

Asked during a visit to an Air Force base in Arizona whether he viewed the Saudi explanation as credible, Mr. Trump said, “I do.”

[Jamal Khashoggi is dead. Here is everything we know so far.]

The president said he still had questions for Prince Mohammed, and he called the killing of Mr. Khashoggi “unacceptable.” Mr. Trump also raised the possibility of sanctions against Saudi Arabia, but said that he hoped that Congress would not try to block billions of dollars in weapons sales to the kingdom, which he has held up as proof of the fruits of the alliance.

Mr. Trump’s response sets up a clash with Congress, where Republicans and Democrats both tarred the Saudi explanation as lacking credibility. A senior lawmaker briefed on American intelligence assessments of the circumstances surrounding Mr. Khashoggi’s death, and the likely culprits, said it was not consistent with the Saudi account.

The lawmaker, Representative Adam B. Schiff of California, the senior Democrat on the House Intelligence Committee, said, “The kingdom and all involved in this brutal murder must be held accountable, and if the Trump administration will not take the lead, Congress must.”

Senator Lindsey Graham, Republican of South Carolina and a close ally of Mr. Trump’s, declared in a Twitter post, “To say that I am skeptical of the new Saudi narrative about Mr. Khashoggi is an understatement.” He added, “It’s hard to find this latest ‘explanation’ as credible.”

The growing evidence that Mr. Khashoggi, a Virginia resident and a columnist for The Washington Post, was killed on orders from the Saudi royal family has put Mr. Trump in an increasingly untenable position.

On Friday evening, the president praised the statement issued by the Saudi government, which confirmed Mr. Khashoggi’s death, as a “good first step” and a “big step.” Earlier, the prince and other senior Saudi officials had denied any role in Mr. Khashoggi’s disappearance.

Editors’ Picks

11 Takeaways From The Times’s Investigation Into Trump’s Wealth

50 Years Later, It Feels Familiar: How America Fractured in 1968

How to Buy a Gun in 15 Countries
Secretary of State Mike Pompeo spoke with Prince Mohammed by phone on Friday evening and then briefed Mr. Trump and his national security adviser, John R. Bolton, according to a White House spokesman.

“I think we’re getting close to solving a big problem,” Mr. Trump told reporters at the Luke Air Force Base, where he was shown an Apache helicopter, an F-35 fighter jet and an array of bombs.

Image
Representative Adam B. Schiff of California, the top Democrat on the House Intelligence Committee, in May on Capitol Hill. He was among the lawmakers who tarred the explanation by Saudi Arabia as lacking credibility.CreditTom Brenner/The New York Times
For the president, Saudi Arabia has become a key ally but also a troublesome partner. Saudi support is critical to his efforts to isolate Iran. But he has watched as Prince Mohammed pursued a deadly war in Yemen, carried on a feud with his neighbor Qatar, jailed female dissidents and detained hundreds of wealthy Saudis.

Mr. Trump’s son-in-law and senior adviser, Jared Kushner, cultivated a relationship with the prince, who is close to him in age and who Mr. Kushner hoped would be an advocate for his peace proposal between Israel and the Palestinians.

In internal discussions, Mr. Kushner has urged the president and his aides not to abandon Prince Mohammed. But as Turkish officials leaked details of the grisly killing of Mr. Khashoggi and of the dismemberment of his body, the White House has become increasingly isolated in its defense of Saudi Arabia.

A stream of prominent Wall Street and tech executives canceled plans to attend an investor conference convened by the prince next week in Riyadh, the Saudi capital. On Thursday, Steven Mnuchin, the Treasury secretary, pulled out of the conference, as well, though he will attend a separate meeting on counterterrorism strategy.

In an interview on Thursday with The New York Times, Mr. Trump acknowledged that the furor over Mr. Khashoggi’s death had mushroomed into one of the biggest foreign policy crises of his presidency.

“This one has caught the imagination of the world, unfortunately,” Mr. Trump said. “It’s not a positive. Not a positive.”

The president also said on Thursday that it was still “a little bit early” in the process to draw definitive conclusions about who ordered the killing. But he expressed no doubt that the truth would come out soon.

“We’re working with the intelligence from numerous countries,” he said, adding, “This is the best intelligence we could have.”

On Wednesday, The Times reported that American intelligence officials were increasingly convinced that Prince Mohammed is culpable in Mr. Khashoggi’s death, and that they were preparing an appraisal for the White House.

Saudi Arabia tried to project the idea of a housecleaning, announcing that Saud al-Qahtani, a close aide to the crown prince; Maj. Gen. Ahmed al-Assiri, the deputy director of Saudi intelligence; and other high-ranking intelligence officials had been dismissed.

For Mr. Trump, who is on a three-day swing in the West before the midterm elections, the Khashoggi affair has become a distraction during a period in which he had hoped to campaign for Republican congressional candidates on a message of economic growth and the recent confirmation of Justice Brett M. Kavanaugh to the Supreme Court.

Just after answering questions about the Saudi announcement, Mr. Trump flew to a “Make America Great Again” rally in Mesa, Ariz.


"""

In [None]:
newdoc

<li>Clean and set up the text
<li>Create corpus

In [None]:
text = newdoc
striptext = text.replace('\n\n', ' ')
striptext = striptext.replace('\n', ' ')
new_text = [nltk.word_tokenize(striptext)]

textdictionary = corpora.Dictionary(new_text) #(word_id,frequency) pairs
corpus_new = [dictionary.doc2bow(text) for text in new_text] #(word_id,freq) pairs by sentence

<h2>Matching topics to documents</h2>
<li>We now have a corpus with one document
<li>Get the topics using the results of the lda we ran before 
<li>And see which topic(s) are the best matches

In [None]:
from operator import itemgetter
lda.get_document_topics(corpus_new[0],minimum_probability=0.05)
sorted(lda.get_document_topics(corpus_new[0],minimum_probability=0),key=itemgetter(1),reverse=True)

In [None]:
lda.print_topic(topicno=0)

<h4>Draw wordclouds</h4>
<li>to better understand the topic we can draw wordclouds weighted by the weight of the terms in the topic

In [None]:
def draw_wordcloud(lda,topicnum,min_size=0,STOPWORDS=[]):
    word_list=[]
    prob_total = 0
    for word,prob in lda.show_topic(topicnum,topn=50):
        prob_total +=prob
    for word,prob in lda.show_topic(topicnum,topn=50):
        if word in STOPWORDS or  len(word) < min_size:
            continue
        freq = int(prob/prob_total*1000)
        alist=[word]
        word_list.extend(alist*freq)

    from wordcloud import WordCloud, STOPWORDS
    import matplotlib.pyplot as plt
    %matplotlib inline
    text = ' '.join(word_list)
    wordcloud = WordCloud(stopwords=STOPWORDS,background_color='white',max_words=20, collocations=False).generate(text)

    plt.axis('off')
    plt.imshow(wordcloud)

    return None

In [None]:
draw_wordcloud(lda,4)

<h4>Roughly,</h4>
<li>lda looks for candidate topics assuming that there are many such candidates
<li>looks for words related to the candidate topics
<li>assign probablilites to those words

<h2>Understanding topics</h2>
<li>pyLDAvis (package for visualizing the results of an LDA)
<li>Shows topic distance between topics and top words in the corpus

In [None]:
!pip install pyLDAvis

In [None]:
pp.pprint(lda.print_topics(num_words=8))

In [None]:
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

<h3>Let's look at Presidential addresses to see what sorts of topics emerge from there</h3>
<li>Each document will be analyzed for topic</li>
<li>The corpus will consist of 58 documents, one per presidential address

In [None]:
inaugural.fileids()

In [None]:
texts = [[word for word in inaugural.raw(file).lower().split()
        if word not in STOPWORDS and word.isalnum() and not word.lower() == 'slate']
        for file in inaugural.fileids()]
dictionary = corpora.Dictionary(texts) #(word_id,frequency) pairs
corpus = [dictionary.doc2bow(text) for text in texts] #(word_id,freq) pairs by sentence


<h2>Create the model</h2>

In [None]:
lda = LdaModel(corpus,
              id2word=dictionary,
              num_topics=10,
              passes=10)

In [None]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(lda.print_topics(num_words=10))

In [None]:
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

<h2>We can now compare presidential addresses by topic</h2>

In [None]:
len(corpus)

In [None]:
from operator import itemgetter
sorted(lda.get_document_topics(corpus[0],minimum_probability=0,per_word_topics=False),key=itemgetter(1),reverse=True)

In [None]:
draw_wordcloud(lda,5)

In [None]:
print(lda.show_topic(5,topn=5))
print(lda.show_topic(4,topn=5))

<h1>Similarity</h1>
<h2>Given a corpus of documents, when a new document arrives, find the document that is the most similar</h2>

In [None]:
doc_list = [restaurants_data['community'],restaurants_data['le_monde'],restaurants_data['fiveguys'],restaurants_data['shakeshack']]
all_text = restaurants_data['community'].raw() + restaurants_data['le_monde'].raw() + restaurants_data['fiveguys'].raw() + restaurants_data['shakeshack'].raw()

documents = [doc.raw() for doc in doc_list]
texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]


In [None]:
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2) # Latent Semantic Indexing
doc = """
Many, many years ago, I used to frequent this place for their amazing french toast. 
It's been a while since then and I've been hesitant to review a place I haven't been to in 7-8 years... 
but I passed by French Roast and, feeling nostalgic, decided to go back.

It was a great decision.

Their Bloody Mary is fantastic and includes bacon (which was perfectly cooked!!), olives, 
cucumber, and celery. The Irish coffee is also excellent, even without the cream which is what I ordered.

Great food, great drinks, a great ambiance that is casual yet familiar like a tiny little French cafe. 
I highly recommend coming here, and will be back whenever I'm in the area next.

Juan, the bartender, is great!! One of the best in any brunch spot in the city, by far.
"""
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])


In [None]:
sims

In [None]:
doc="""
Came to have lunch & also watch the World Cup match. I've been here many times before and not much has changed. 

You can get half off apps and bogo drinks when you sign up for their brew club. I tried their IPA 
(was not a fan).  We also ordered the backyarder and the hot mess burgers with a 
side of disco fries to share. Both were delicious and cooked perfectly. The fries were also really good - 
the gravy and the cheese mix worked perfectly. 

Service was not the best but it could have been because of how packed the bar was for the game. Still a 
solid option in the neighborhood. Should mention that the fried Oreos are out of this world!
"""
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
sims