# Mining the Social Web

## Mining Web Pages

This Jupyter Notebook provides an interactive way to follow along with and explore the examples from the video series. The intent behind this notebook is to reinforce the concepts in a fun, convenient, and effective way.

## Using `dragnet` to extract the text from a web page

Example blog post:
http://radar.oreilly.com/2010/07/louvre-industrial-age-henry-ford.html

In *Mining the Social Web, 3rd Edition*, we used a library called `boilerpipe` to extract the main content of web pages. `boilerpipe` is a sophisticated piece of software that works very well but has some software dependencies that can be very difficult to install, especially if you do not have administrative privileges on the computer you are working with. I have replaced `boilerpipe` with `Goose`, which can be easily installed using `pip`:

`pip install goose3`

You can learn more about `goose3` on its [GitHub page](https://github.com/goose3/goose3). Another example of a content extraction library for Python is `dragnet`, which you can find [here](https://github.com/dragnet-org/dragnet).

In [2]:
from goose3 import Goose

g = Goose()
URL='https://www.oreilly.com/ideas/ethics-in-data-project-design-its-about-planning'
article = g.extract(url=URL)

print(article.title)
print('-'*len(article.title))
print(article.meta_description)

content = article.cleaned_text
print()
print('{}...'.format(content[:500]))

Ethics in data project design: It's about planning
--------------------------------------------------
The destination and rules of the road are clear; the route you choose to get there makes a huge difference.

When I explain the value of ethics to students and professionals alike, I refer it as an “orientation.” As any good designer, scientist, or researcher knows, how you orient yourself toward a problem can have a big impact on the sort of solution you develop—and how you get there. As Ralph Waldo Emerson once wrote, “perception is not whimsical, but fatal.” Your particular perspective, knowledge of, and approach to a problem shapes your solution, opening up certain paths forward and forestalling ot...


## Using feedparser to extract the text (and other fields) from an RSS or Atom feed

In [3]:
import feedparser # pip install feedparser

FEED_URL='http://feeds.feedburner.com/oreilly/radar/atom'

fp = feedparser.parse(FEED_URL)

for e in fp.entries:
    print(e.title)
    print(e.links[0].href)
    print(e.content[0].value)

Four short links: 13 May 2020
http://feedproxy.google.com/~r/oreilly/radar/atom/~3/KQ5CMAVsal8/
<ol>
<li><a href="https://www.wired.com/story/confessions-marcus-hutchins-hacker-who-saved-the-internet/">The Confessions of Marcus Hutchins, the Hacker Who Saved the Internet</a> &#8212; Story of the MalwareTech security researcher who foiled WannaCry, only to be arrested by the FBI for having sold malware as a kid. Young Marcus had terrible opsec.</li>
<li><a href="https://www.nfx.com/post/next-social-era/">The Next Social Era is Here</a> &#8212; Arguing we&#8217;re ready for another boom in social software. <i>First, the pandemic is creating a new topology of psychological and emotional needs. [&#8230;] Second, the work environment is now open game for new social products. Two reasons for this. First, we see how good communication can be with consumer products and demand the same excellence in our work lives. But second, and newer, is that in the last few months, the distance between our 

## Harvesting blog data by parsing feeds

In [4]:
import os
import sys
import json
import feedparser
from bs4 import BeautifulSoup
from nltk import clean_html

FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'

def cleanHtml(html):
    if html == "": return ""

    return BeautifulSoup(html, 'html5lib').get_text()

fp = feedparser.parse(FEED_URL)

print("Fetched {0} entries from '{1}'".format(len(fp.entries[0].title), fp.feed.title))

blog_posts = []
for e in fp.entries:
    blog_posts.append({'title': e.title, 'content'
                      : cleanHtml(e.content[0].value), 'link': e.links[0].href})

out_file = os.path.join('feed.json')
f = open(out_file, 'w+')
f.write(json.dumps(blog_posts, indent=1))
f.close()

print('Wrote output file to {0}'.format(f.name))

Fetched 29 entries from 'Radar'
Wrote output file to feed.json


## Starting to write a web crawler

In [6]:
#more about beautifulsoup
#https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import httplib2
import re
from bs4 import BeautifulSoup

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

soup = BeautifulSoup(response, 'html5lib')

links = []
#more about re.compile
#this is using regular expression
#https://docs.python.org/3.2/library/re.html

#more about regular expressions
#https://www.pythonsheets.com/notes/python-rexp.html#match-email

 
for link in soup.findAll('a', attrs={'href': re.compile("^http(s?)://")}):
    links.append(link.get('href'))

for link in links:
    print(link)

https://www.nytimes.com/es/
https://cn.nytimes.com
https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi
https://www.nytimes.com/section/todayspaper
https://www.nytimes.com/section/world
https://www.nytimes.com/section/us
https://www.nytimes.com/section/politics
https://www.nytimes.com/section/nyregion
https://www.nytimes.com/section/business
https://www.nytimes.com/section/opinion
https://www.nytimes.com/section/technology
https://www.nytimes.com/section/science
https://www.nytimes.com/section/health
https://www.nytimes.com/section/sports
https://www.nytimes.com/section/arts
https://www.nytimes.com/section/books
https://www.nytimes.com/section/style
https://www.nytimes.com/section/food
https://www.nytimes.com/section/travel
https://www.nytimes.com/section/magazine
https://www.nytimes.com/section/t-magazine
https://www.nytimes.com/section/realestate
https://www.nytimes.com/video
https://www.nytimes.com/section/world
https://www.nytimes.com/section/us
https://www.ny

```
Create an empty graph
Create an empty queue to keep track of nodes that need to be processed

Add the starting point to the graph as the root node
Add the root node to a queue for processing

Repeat until some maximum depth is reached or the queue is empty:
  Remove a node from the queue 
  For each of the node's neighbors: 
    If the neighbor hasn't already been processed: 
      Add it to the queue 
      Add it to the graph 
      Create an edge in the graph that connects the node and its neighbor
```

## Using NLTK to parse web page data

**Naive sentence detection based on periods**

In [7]:
text = "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow."
print(text.split("."))

['Mr', ' Green killed Colonel Mustard in the study with the candlestick', ' Mr', ' Green is not a very nice fellow', '']


**More sophisticated sentence detection**

In [8]:
import nltk # Installation instructions: http://www.nltk.org/install.html

# Downloading nltk packages used in this example
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Sohair
[nltk_data]     Zaki\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
#import link for nltk 
#https://www.guru99.com/nltk-tutorial.html

sentences = nltk.tokenize.sent_tokenize(text)
print(sentences)

['Mr. Green killed Colonel Mustard in the study with the candlestick.', 'Mr. Green is not a very nice fellow.']


In [10]:
harder_example = """My name is John Smith and my email address is j.smith@company.com.
Mostly people call Mr. Smith. But I actually have a Ph.D.!
Can you believe it? Neither can most people..."""

sentences = nltk.tokenize.sent_tokenize(harder_example)
print(sentences)

['My name is John Smith and my email address is j.smith@company.com.', 'Mostly people call Mr. Smith.', 'But I actually have a Ph.D.!', 'Can you believe it?', 'Neither can most people...']


**Word tokenization**

In [11]:
text = "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow."
sentences = nltk.tokenize.sent_tokenize(text)

tokens = [nltk.word_tokenize(s) for s in sentences]
print(tokens)

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.'], ['Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.']]


**Part of speech tagging for tokens**

In [12]:
import nltk
nltk.download('averaged_perceptron_tagger')

# Downloading nltk packages used in this example
nltk.download('maxent_treebank_pos_tagger')

pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]
print(pos_tagged_tokens)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Sohair Zaki\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     C:\Users\Sohair Zaki\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\maxent_treebank_pos_tagger.zip.


[[('Mr.', 'NNP'), ('Green', 'NNP'), ('killed', 'VBD'), ('Colonel', 'NNP'), ('Mustard', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')], [('Mr.', 'NNP'), ('Green', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'NN'), ('.', '.')]]


**Alphabetical list of part-of-speech tags used in the Penn Treebank Project**

See: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

| # | POS Tag | Meaning |
|:-:|:-------:|:--------|
| 1	| CC | Coordinating conjunction|
|2|	CD	|Cardinal number|
|3|	DT	|Determiner|
|4|	EX	|Existential there|
|5|	FW	|Foreign word|
|6|	IN	|Preposition or subordinating conjunction|
|7|	JJ	|Adjective|
|8|	JJR	|Adjective, comparative|
|9|	JJS	|Adjective, superlative|
|10|	LS	|List item marker|
|11|	MD	|Modal|
|12|	NN	|Noun, singular or mass|
|13|	NNS	|Noun, plural|
|14|	NNP	|Proper noun, singular|
|15|	NNPS	|Proper noun, plural|
|16|	PDT	|Predeterminer|
|17|	POS	|Possessive ending|
|18|	PRP	|Personal pronoun|
|19|	PRP\$	|Possessive pronoun|
|20|	RB	|Adverb|
|21|	RBR	|Adverb, comparative|
|22|	RBS	|Adverb, superlative|
|23|	RP	|Particle|
|24|	SYM	|Symbol|
|25|	TO	|to|
|26|	UH	|Interjection|
|27|	VB	|Verb, base form|
|28|	VBD	|Verb, past tense|
|29|	VBG	|Verb, gerund or present participle|
|30|	VBN	|Verb, past participle|
|31|	VBP	|Verb, non-3rd person singular present|
|32|	VBZ	|Verb, 3rd person singular present|
|33|	WDT	|Wh-determiner|
|34|	WP	|Wh-pronoun|
|35|	WP\$|Possessive wh-pronoun|
|36|	WRB	|Wh-adverb|

**Named entity extraction/chunking for tokens**

In [13]:
# Downloading nltk packages used in this example
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to C:\Users\Sohair
[nltk_data]     Zaki\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker.zip.
[nltk_data] Downloading package words to C:\Users\Sohair
[nltk_data]     Zaki\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


True

In [14]:
jim = "Jim bought 300 shares of Acme Corp. in 2006."

tokens = nltk.word_tokenize(jim)
jim_tagged_tokens = nltk.pos_tag(tokens)

ne_chunks = nltk.chunk.ne_chunk(jim_tagged_tokens)

In [None]:
ne_chunks

In [16]:
ne_chunks = [nltk.chunk.ne_chunk(ptt) for ptt in pos_tagged_tokens]

ne_chunks[0].pprint()
ne_chunks[1].pprint()

(S
  (PERSON Mr./NNP)
  (PERSON Green/NNP)
  killed/VBD
  (ORGANIZATION Colonel/NNP Mustard/NNP)
  in/IN
  the/DT
  study/NN
  with/IN
  the/DT
  candlestick/NN
  ./.)
(S
  (PERSON Mr./NNP)
  (ORGANIZATION Green/NNP)
  is/VBZ
  not/RB
  a/DT
  very/RB
  nice/JJ
  fellow/NN
  ./.)


## Using NLTK’s NLP tools to process human language in blog data

In [18]:
import json
import nltk

BLOG_DATA = "data/feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

# Download nltk packages used in this example
nltk.download('stopwords')

# Customize your list of stopwords as needed. Here, we add common
# punctuation and contraction artifacts.

stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—',
    ']',
    '[',
    '...'
    ]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\szaki5\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
for post in blog_data:
    sentences = nltk.tokenize.sent_tokenize(post['content'])

    words = [w.lower() for sentence in sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)

    # Remove stopwords from fdist
    for sw in stop_words:
        del fdist[sw]
   
    # Basic stats

    num_words = sum([i[1] for i in fdist.items()])
    num_unique_words = len(fdist.keys())

    # Hapaxes are words that appear only once
    num_hapaxes = len(fdist.hapaxes())

    top_10_words_sans_stop_words = fdist.most_common(10)

    print(post['title'])
    print('\tNum Sentences:'.ljust(25), len(sentences))
    print('\tNum Words:'.ljust(25), num_words)
    print('\tNum Unique Words:'.ljust(25), num_unique_words)
    print('\tNum Hapaxes:'.ljust(25), num_hapaxes)
    print('\tTop 10 Most Frequent Words (sans stop words):\n\t\t', \
          '\n\t\t'.join(['{0} ({1})'.format(w[0], w[1]) for w in top_10_words_sans_stop_words]))
    print()

Four short links: 21 August 2017
	Num Sentences:           10
	Num Words:               140
	Num Unique Words:        113
	Num Hapaxes:             93
	Top 10 Most Frequent Words (sans stop words):
		 signals (5)
		cloud (4)
		application (3)
		drone (3)
		operations (2)
		machine (2)
		learning (2)
		radio (2)
		flying (2)
		cameras (2)

6 practical guidelines for implementing conversational AI
	Num Sentences:           69
	Num Words:               908
	Num Unique Words:        528
	Num Hapaxes:             354
	Top 10 Most Frequent Words (sans stop words):
		 ’ (21)
		“ (21)
		” (21)
		conversational (15)
		bots (7)
		says (7)
		interaction (7)
		must (7)
		user (7)
		kai (7)

Four short links: 18 August 2017
	Num Sentences:           16
	Num Words:               263
	Num Unique Words:        204
	Num Hapaxes:             173
	Top 10 Most Frequent Words (sans stop words):
		 hype (9)
		jobs (5)
		technologies (5)
		cycle (5)
		’ (5)
		bayesian (4)
		years (4)
		style (3)
		cycles (3)

The wisdom hierarchy: From signals to artificial intelligence and beyond
	Num Sentences:           73
	Num Words:               796
	Num Unique Words:        449
	Num Hapaxes:             319
	Top 10 Most Frequent Words (sans stop words):
		 data (37)
		wisdom (19)
		information (17)
		knowledge (17)
		’ (13)
		hierarchy (13)
		humans (11)
		ai (10)
		human (9)
		value (8)

Four short links: 1 August 2017
	Num Sentences:           9
	Num Words:               142
	Num Unique Words:        120
	Num Hapaxes:             102
	Top 10 Most Frequent Words (sans stop words):
		 x86 (3)
		ai (3)
		bugs (3)
		; (3)
		rna (2)
		coding (2)
		1960s (2)
		based (2)
		make (2)
		decisions (2)

How can I add simple, automated data visualizations and dashboards to Jupyter Notebooks
	Num Sentences:           1
	Num Words:               21
	Num Unique Words:        18
	Num Hapaxes:             15
	Top 10 Most Frequent Words (sans stop words):
		 jupyter (2)
		notebooks (2)
		visualizations (2)
		learn (1

## A document summarization algorithm based principally upon sentence detection and frequency analysis within sentences

In [20]:
import json
import nltk
import numpy

BLOG_DATA = "feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

N = 100  # Number of words to consider
CLUSTER_THRESHOLD = 5  # Distance between words to consider
TOP_SENTENCES = 5  # Number of sentences to return for a "top n" summary

In [21]:
stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—',
    '>',
    '<',
    '...'
    ]

In [22]:
# Approach taken from "The Automatic Creation of Literature Abstracts" by H.P. Luhn
def _score_sentences(sentences, important_words):
    scores = []
    sentence_idx = 0

    for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:

        word_idx = []

        # For each word in the word list...
        for w in important_words:
            try:
                # Compute an index for where any important words occur in the sentence.
                word_idx.append(s.index(w))
            except ValueError: # w not in this particular sentence
                pass

        word_idx.sort()

        # It is possible that some sentences may not contain any important words at all.
        if len(word_idx)== 0: continue

        # Using the word index, compute clusters by using a max distance threshold
        # for any two consecutive words.

        clusters = []
        cluster = [word_idx[0]]
        i = 1
        while i < len(word_idx):
            if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
                cluster.append(word_idx[i])
            else:
                clusters.append(cluster[:])
                cluster = [word_idx[i]]
            i += 1
        clusters.append(cluster)

        # Score each cluster. The max score for any given cluster is the score 
        # for the sentence.

        max_cluster_score = 0
        
        for c in clusters:
            significant_words_in_cluster = len(c)
            # true clusters also contain insignificant words, so we get 
            # the total cluster length by checking the indices
            total_words_in_cluster = c[-1] - c[0] + 1
            score = 1.0 * significant_words_in_cluster**2 / total_words_in_cluster

            if score > max_cluster_score:
                max_cluster_score = score

        scores.append((sentence_idx, max_cluster_score))
        sentence_idx += 1

    return scores

In [24]:
def summarize(txt):
    sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
    normalized_sentences = [s.lower() for s in sentences]

    words = [w.lower() for sentence in normalized_sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)
    
    # Remove stopwords from fdist
    for sw in stop_words:
        del fdist[sw]

    top_n_words = [w[0] for w in fdist.most_common(N)]

    scored_sentences = _score_sentences(normalized_sentences, top_n_words)

    # Summarization Approach 1:
    # Filter out nonsignificant sentences by using the average score plus a
    # fraction of the std dev as a filter

    avg = numpy.mean([s[1] for s in scored_sentences])
    std = numpy.std([s[1] for s in scored_sentences])
    mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
                   if score > avg + 0.5 * std]

    # Summarization Approach 2:
    # Another approach would be to return only the top N ranked sentences

    top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])

    # Decorate the post object with summaries

    return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
                mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])

In [25]:
for post in blog_data: 
    post.update(summarize(post['content']))

    print(post['title'])
    print('=' * len(post['title']))
    print()
    print('Top N Summary')
    print('-------------')
    print(' '.join(post['top_n_summary']))
    print()
    print('Mean Scored Summary')
    print('-------------------')
    print(' '.join(post['mean_scored_summary']))
    print()

Four short links: 30 April 2020

Top N Summary
-------------

To Microservices and Back Again: Why Segment Went Back to a Monolith — microservices came with increased operational overhead and problems around code reuse. … If microservices are implemented incorrectly or used as a band-aid without addressing some of the root flaws in your system, you’ll be unable to do new product development because you’re drowning in the complexity. Not limited to editing basic entities such as bits and bytes, it provides a full-fledged procedural, interactive programming language designed to describe data structures and to operate on them. (via Kernel Recipes)
Blender — Facebook open sourced their open-domain (“can talk about anything!”) chatbot. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements.

Mean Scored Summary
-------------------
Not limited to editing basic entities such as bits and bytes, it pr

What you need to know about product management for AI

Top N Summary
-------------
Identifying “viable” machine learning problems
Any product manager is part of the team that determines what product to build. It’s an intuition that you can learn through experience–and it’s why understanding your failures is at least as important as understanding your successes. If the needle doesn’t move, you will undermine your team. Customers want you to solve their problems; they don’t care what kind of neural network you’re using. Scope often needs to be reduced or quality sacrificed to align with other teams or priorities.

Mean Scored Summary
-------------------
If you’re already a software product manager (PM), you have a head start on becoming a PM for artificial intelligence (AI) or machine learning (ML). A PM for AI needs to do everything a traditional PM does, but they also need an operational understanding of machine learning software development along with a realistic view of its capabilit

6 trends framing the state of AI and ML

Top N Summary
-------------
Aggregating artificial intelligence and machine learning topics accounts for nearly 5% of all usage activity on the platform, a touch less than, and growing 50% faster than, the well-established “data science” topic (see Figure 2). Among companies using AI to support production use cases, deep learning was No. The shift to “artificial intelligence”
Does the growing engagement in neural networks, reinforcement learning, unsupervised learning, and the increased focus on putting models into production augur a shift in how practitioners in the space frame what they do? We compared aggregated data for the last three years; a full year of data for 2017 and 2018, and through the end of October for 2019. Share for specific applications (e.g., deep learning) is much higher.

Mean Scored Summary
-------------------
O’Reilly online learning is a trove of information about the trends, topics, and issues tech leaders need to know 

Remembering Freeman Dyson

Top N Summary
-------------
Freeman Dyson died last week at the age of 96 after injuring himself in a fall in the cafeteria at the Institute of Advanced Studies in Princeton, where he had continued to work right up to the end. Jonson and Shakespeare were both successful playwrights. Jonson became famous in his own right as a poet and scholar, and at the end of his life he was honored with burial in Westminster Abbey. He was not interested in publishing pretty papers. Then he began to calculate numbers, using his diagrams as a guide.

Mean Scored Summary
-------------------
Freeman Dyson died last week at the age of 96 after injuring himself in a fall in the cafeteria at the Institute of Advanced Studies in Princeton, where he had continued to work right up to the end. Perhaps the most famous example is the paper he wrote in 1949 at the age of 25 making the case that the visualizations of Richard Feynman were mathematically equivalent to the calculations of th

## Visualizing document summarization results with HTML output

In [26]:
import os
from IPython.display import IFrame
from IPython.core.display import display

HTML_TEMPLATE = """<html>
    <head>
        <title>{0}</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>{1}</body>
</html>"""

for post in blog_data:
   
    # Uses previously defined summarize function.
    post.update(summarize(post['content']))

    # You could also store a version of the full post with key sentences marked up
    # for analysis with simple string replacement...

    for summary_type in ['top_n_summary', 'mean_scored_summary']:
        post[summary_type + '_marked_up'] = '<p>{0}</p>'.format(post['content'])
        
        for s in post[summary_type]:
            post[summary_type + '_marked_up'] = \
            post[summary_type + '_marked_up'].replace(s, '<strong>{0}</strong>'.format(s))

        filename = post['title'].replace("?", "") + '.summary.' + summary_type + '.html'
        
        f = open(os.path.join(filename), 'wb')
        html = HTML_TEMPLATE.format(post['title'] + ' Summary', post[summary_type + '_marked_up'])    
        f.write(html.encode('utf-8'))
        f.close()

        print("Data written to", f.name)

# Display any of these files with an inline frame. This displays the
# last file processed by using the last value of f.name...
print()
print("Displaying {0}:".format(f.name))
display(IFrame('files/{0}'.format(f.name), '100%', '600px'))

Data written to Four short links: 30 April 2020.summary.top_n_summary.html
Data written to Four short links: 30 April 2020.summary.mean_scored_summary.html
Data written to Four short links: 29 April 2020.summary.top_n_summary.html
Data written to Four short links: 29 April 2020.summary.mean_scored_summary.html
Data written to Four short links: 28 April 2020.summary.top_n_summary.html
Data written to Four short links: 28 April 2020.summary.mean_scored_summary.html
Data written to Four short links: 27 April 2020.summary.top_n_summary.html
Data written to Four short links: 27 April 2020.summary.mean_scored_summary.html
Data written to Four short links: 24 April 2020.summary.top_n_summary.html
Data written to Four short links: 24 April 2020.summary.mean_scored_summary.html
Data written to Four short links: 23 April 2020.summary.top_n_summary.html
Data written to Four short links: 23 April 2020.summary.mean_scored_summary.html
Data written to How data privacy leader Apple found itself in a 

Data written to Four short links: 5 March 2020.summary.top_n_summary.html
Data written to Four short links: 5 March 2020.summary.mean_scored_summary.html
Data written to Remembering Freeman Dyson.summary.top_n_summary.html
Data written to Remembering Freeman Dyson.summary.mean_scored_summary.html
Data written to Four short links: 4 March 2020.summary.top_n_summary.html
Data written to Four short links: 4 March 2020.summary.mean_scored_summary.html
Data written to Four short links: 3 March 2020.summary.top_n_summary.html
Data written to Four short links: 3 March 2020.summary.mean_scored_summary.html
Data written to The death of Agile.summary.top_n_summary.html
Data written to The death of Agile.summary.mean_scored_summary.html
Data written to Four short links: 2 March 2020.summary.top_n_summary.html
Data written to Four short links: 2 March 2020.summary.mean_scored_summary.html
Data written to Four short links: 28 February 2020.summary.top_n_summary.html
Data written to Four short links

## Extracting entities from a text with NLTK

In [27]:
import nltk
import json

BLOG_DATA = "feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    sentences = nltk.tokenize.sent_tokenize(post['content'])
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

    # Flatten the list since we're not using sentence structure
    # and sentences are guaranteed to be separated by a special
    # POS tuple such as ('.', '.')

    pos_tagged_tokens = [token for sent in pos_tagged_tokens for token in sent]

    all_entity_chunks = []
    previous_pos = None
    current_entity_chunk = []
    for (token, pos) in pos_tagged_tokens:

        if pos == previous_pos and pos.startswith('NN'):
            current_entity_chunk.append(token)
        elif pos.startswith('NN'):
            
            if current_entity_chunk != []:
                
                # Note that current_entity_chunk could be a duplicate when appended,
                # so frequency analysis again becomes a consideration

                all_entity_chunks.append((' '.join(current_entity_chunk), pos))
            current_entity_chunk = [token]

        previous_pos = pos

    # Store the chunks as an index for the document
    # and account for frequency while we're at it...

    post['entities'] = {}
    for c in all_entity_chunks:
        post['entities'][c] = post['entities'].get(c, 0) + 1

    # For example, we could display just the title-cased entities

    print(post['title'])
    print('-' * len(post['title']))
    proper_nouns = []
    for (entity, pos) in post['entities']:
        if entity.istitle():
            print('\t{0} ({1})'.format(entity, post['entities'][(entity, pos)]))
    print()

Four short links: 30 April 2020
-------------------------------
	Microservices (1)
	Back Again (1)
	Segment Went Back (1)
	Monolith (1)
	Kernel Recipes (1)
	Blender — Facebook (1)
	Videos (1)

Four short links: 29 April 2020
-------------------------------
	Pod Paper Scissors (1)
	Experts (1)
	Ben Klemens (1)
	Liz Landau (1)
	Ben Klemens (1)
	Verification Handbook (1)
	Craig Silverman (1)
	Ransomware Groups (1)
	Microsoft (1)
	Bug Stories — (1)

Four short links: 28 April 2020
-------------------------------
	Language — (1)
	Cookbook (1)
	” Paxos (1)
	Consensus (1)
	Distributed Consensus (1)
	Paxos (1)
	Raft (2)
	Raft (1)
	Paxos (2)
	Raft ’ (1)
	Google Research Football (1)
	Reinforcement Learning (1)

Four short links: 27 April 2020
-------------------------------
	Process (1)
	Different Computer (1)
	Consistency Maps — Jepsen (1)
	Expert Twitter Only Goes (1)
	Bring Back Blogs (1)
	Wired (1)
	Syllabus (1)

Four short links: 24 April 2020
-------------------------------
	Suddenly Remo

The unreasonable importance of data preparation
-----------------------------------------------
	Monica Rogati ’ (1)
	Needs (1)
	Rogati (1)
	Jawbone (1)
	Monica Rogati (1)
	Andrei Karpathy (1)
	Tesla (1)
	Software (1)
	Data (5)
	John Myles White (1)
	Facebook (1)
	” John (1)
	Z Z (1)
	Let ’ (1)
	X Correlation (1)
	Brian Wansink (1)
	Food (1)
	Brand Lab (1)
	Cornell University (1)
	Cornell (1)
	Wansink ’ (1)
	Data (1)
	O ’ Reilly (2)
	“ Chicago ” (1)
	Look (1)
	Google ’ (1)
	Twitter (1)
	Analysts (1)
	Set (1)
	Transparent How Variations (1)
	Analytic (1)
	Choices (1)
	Affect Results (1)
	Model (1)
	Automation (1)
	Democratization (1)
	Snorkel (1)
	Christopher Ré ’ (1)
	Stanford (1)
	Chicago ” (1)
	Snorkel (3)
	Researchers (1)
	Snorkel (1)
	Discovery (1)
	Google (1)
	Microsoft (1)
	Angela Bassa (1)
	Angela Bowne (1)
	Vicki Boykis (1)
	Joyce Chung (1)
	Mike Loukides (1)
	Mikhail Popov (1)
	Emily Robinson (1)
	Mikhail Popov (1)
	Wikimedia Foundation (1)
	Analytics Datasets (1)
	Wikipedia (

Four short links: 3 March 2020
------------------------------
	Facebook ’ (1)
	Incomplete Download Your Data (1)
	Privacy International (1)
	Facebook (2)
	“ Download (1)
	Information ” (1)
	Information (1)
	Off-Facebook (1)
	Bruce Schneier (1)
	Proxy Verifier — (1)
	Yahoo (1)
	Stripe ’ (1)
	Covid-19 Company Plan — (1)
	Google ’ (1)
	Tech-Writing Course — (1)

The death of Agile?
-------------------
	Agile (8)
	Agile (6)
	Eben Hewitt (1)
	Agile (7)
	Scrum (1)
	Scrum (1)
	None (1)
	Yep (1)
	Scrum (1)
	Douglas Adams (1)
	Agile Manifesto (1)
	Neckbeards (1)
	Geeks (1)
	Agile ’ (1)
	“ Agile ” (1)
	Isaac Newton (1)
	Agile Manifesto (1)
	Development (1)
	Progress (1)
	Amazon ’ (1)
	Make (1)
	Domain Driven Design (1)
	Manifesto (1)
	Eben (2)
	Mike Loukides Radar (1)
	O ’ Reilly (2)
	Findings (1)
	Python (2)
	Software (1)
	Coincidence (1)
	Growth (1)
	Security (1)
	Aggregate (1)
	Key (1)
	Data (1)
	Data (1)
	Organizations (1)
	Radar (1)

Four short links: 2 March 2020
--------------------------

## Discovering interactions between entities

In [28]:
import nltk
import json

BLOG_DATA = "feed.json"

def extract_interactions(txt):
    sentences = nltk.tokenize.sent_tokenize(txt)
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

    entity_interactions = []
    for sentence in pos_tagged_tokens:

        all_entity_chunks = []
        previous_pos = None
        current_entity_chunk = []

        for (token, pos) in sentence:

            if pos == previous_pos and pos.startswith('NN'):
                current_entity_chunk.append(token)
            elif pos.startswith('NN'):
                if current_entity_chunk != []:
                    all_entity_chunks.append((' '.join(current_entity_chunk),
                            pos))
                current_entity_chunk = [token]

            previous_pos = pos

        if len(all_entity_chunks) > 1:
            entity_interactions.append(all_entity_chunks)
        else:
            entity_interactions.append([])

    assert len(entity_interactions) == len(sentences)

    return dict(entity_interactions=entity_interactions,
                sentences=sentences)

blog_data = json.loads(open(BLOG_DATA).read())

# Display selected interactions on a per-sentence basis

for post in blog_data:

    post.update(extract_interactions(post['content']))

    print(post['title'])
    print('-' * len(post['title']))
    for interactions in post['entity_interactions']:
        print('; '.join([i[0] for i in interactions]))
    print()

Four short links: 30 April 2020
-------------------------------
Microservices; Back Again; Segment Went Back; Monolith; —; microservices; overhead; problems
…; microservices; band-aid; root; flaws; system; product development; drowning
GNU; —; editor
entities; bits; bytes; procedural; programming language
Kernel Recipes; Blender — Facebook; open-domain; anything; ”
evaluations; models; approaches; dialogue; terms; engagingness; humanness
CopyLeft Conf; Videos; schedule; info

Four short links: 29 April 2020
-------------------------------
podpaperscissors; “ prisoner; ’; dilemma ”; coördination; games; Pod Paper Scissors; game theory; textbook
episode; kinds; games
Experts; variety; fields; hosts; game; applies
episodes; music; topic
podcast; game theorist; Ben Klemens; science journalist; composer
Ben Klemens; Verification Handbook; —; guide; disinformation; media; manipulation; actors; platforms
Craig Silverman; Ransomware Groups; Microsoft; analysis; ransomware; campaigns yields; re


Four short links: 9 April 2020
------------------------------
Fuzzy Edges; Character Encoding; history; politics; basics; character encoding; representations; text; Morse Code; ASCII; Unicode; emoji; text
Everest Pipkin; AutoHotkey —; automation scripting language
Electronic Nose; Applications; Survey —; summary; tech; limitations; applications; “; noses; chemical; sensors; machine
falsisign; —; PDF

Four short links: 8 April 2020
------------------------------
System Design; Advanced Beginners; explanation; systems; acknowledgement; world; tools; strengths; weaknesses; ways
reasons; choices; X; Sara; lot; X ”; “; Y; spur; moment; t; decision; time; ” Lozya — Teleconferencing; RPG
Walk; talk; folks; conversations; corner; drop

Hammerspoon —; desktop automation framework
Lua; scripts; operating system functionality; keyboard/mouse; windows; displays
CSAIL ’; Semester Potpourri; Hitchiker ’; Guide; Logical Verification; PDF; book; course; Microsoft Research ’; Lean

Four short links: 7

The unreasonable importance of data preparation
-----------------------------------------------
world; models; algorithms; importance; data; preparation; quality; models
garbage; garbage; principle; data; leads; results; algorithms; business
car ’; algorithm; data; traffic; day; t; roads
step; algorithm; environment; cars; humans; roads
example described; garbage; side; equation; example; data; data; data
executives; AI; transformation; Monica Rogati ’; s; AI Hierarchy; Needs; everything; foundation; data; Rogati; data science; AI; advisor; VP; data; Jawbone; LinkedIn; data; scientist; courtesy; Monica Rogati
high-quality; data
basing business; decisions; dashboards; results; experiments
machine; side; Andrei Karpathy; director; AI; Tesla; Software; era; paradigm; software; machine learning; AI; focus; code; configuring; inputs; data; level; models
world; data; citizen; computation; programs; thing
model; data specification
right data; approach; function
Data; purpose; use
value; data;

people; problem; people
people; t; work; value add; picture; ”; risk; hurdle; company; technology; business
Codispoti; solution; silos; comfort; zones; expertise; colleagues
“; companies; cross collaboration
“; someone; ’; s; example; business
case; HR; finance; risk organization; audit
people; silos; heads; people; experts
Talk; partner; problems; solving
People; opportunities
s; silos; collaboration
People; mindset; ‘; area; ’; ’
transformations; culture; hats; enterprise; lead; enterprise mindset.; ”; future; humanity; ’; relationship; technology; Codispoti; machines; people
“; ’; level; awareness; technology; future; lot; ways; ’; s; technology; humanity—we; ’; case

…; end; day; work; work; process; culture; people; technology
driving; enabling; time; everybody
Nobody; technology; matter; things; ways; mind; technology

Leaders need to mobilize change-ready workforces
------------------------------------------------

King; co-director; EVP; business development; Science House; ser

AI adoption in the enterprise 2020
----------------------------------
year; interest; intelligence; AI; fever pitch; survey; AI
results; AI; space; state; change; survey
survey; weeks; December
sheds; AI; adoption; hint; deployments; prototype; popularity; techniques; tools; challenges
lot; ’
Key; survey; results; majority; %; respondent; organizations; AI; production
%; anything
half; organizations; mature; ”; adopters; AI; technologies; AI; analysis
learning; ML; technique; AI; adopters; learning; technique; organizations
problem; lack; ML; AI; skills; impediment; AI
%; respondents; lack; support
organizations; governance; controls; AI
takeaway; AI; adoption
companies; AI; production
s; companies; AI; efforts
risk; factors—bias; model development; data; tendency; models; production—or; processes; data; governance; adopters; work cut; AI; production
demographics; Survey; industries; “ Software ”; ~17; %
sample; technology category—; “ Computers; Electronics; Hardware; —accounts; %
“; 

Radar trends to watch: March 2020
---------------------------------
AI; practice; book; TinyML; Pete Warden; talks; stickers; AI; communicate; radio; contain; sensors; machinery
technology; bluetooth; stickers; ambient
year; Foster Provost; causality; thing; data
trend; years; AI; technology; cause; effect
Democratization; learning; ’; idea; t
models; cost; training; time; system
issues; AI; transition; research; development; production
aspect; transition; challenge; models; OpenAI ’; GPT-2; Google ’
coincidence; language
signs; machine learning; edge; fact; way; edge
O ’ Reilly Software Architecture Conference; New York; Mary Poppendieck; absurdity; vacuum
kind; intelligence; needs; edge
COVID-19; Coronavirus; impact; economy; shutdown; factories; China; disruption; supply; chains; deal; ’
Coronavirus; opportunity; technology
China; work; AI; education; t; closing
agriculture; Autonomous; vehicles; farm; time

t; traffic; laws
Cornell ’; College; Agriculture
Schools; Netherlands; Chin

## Visualizing interactions between entities with HTML output

In [29]:
import os
import json
import nltk
from IPython.display import IFrame
from IPython.core.display import display

BLOG_DATA = "feed.json"

HTML_TEMPLATE = """<html>
    <head>
        <title>{0}</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>{1}</body>
</html>"""

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    post.update(extract_interactions(post['content']))

    # Display output as markup with entities presented in bold text

    post['markup'] = []

    for sentence_idx in range(len(post['sentences'])):

        s = post['sentences'][sentence_idx]
        for (term, _) in post['entity_interactions'][sentence_idx]:
            s = s.replace(term, '<strong>{0}</strong>'.format(term))

        post['markup'] += [s] 
            
    filename = post['title'].replace("?", "") + '.entity_interactions.html'
    f = open(os.path.join(filename), 'wb')
    html = HTML_TEMPLATE.format(post['title'] + ' Interactions', ' '.join(post['markup']))
    f.write(html.encode('utf-8'))
    f.close()

    print('Data written to', f.name)
    
    # Display any of these files with an inline frame. This displays the
    # last file processed by using the last value of f.name...
    
    print('Displaying {0}:'.format(f.name))
    display(IFrame('files/{0}'.format(f.name), '100%', '600px'))

Data written to Four short links: 30 April 2020.entity_interactions.html
Displaying Four short links: 30 April 2020.entity_interactions.html:


Data written to Four short links: 29 April 2020.entity_interactions.html
Displaying Four short links: 29 April 2020.entity_interactions.html:


Data written to Four short links: 28 April 2020.entity_interactions.html
Displaying Four short links: 28 April 2020.entity_interactions.html:


Data written to Four short links: 27 April 2020.entity_interactions.html
Displaying Four short links: 27 April 2020.entity_interactions.html:


Data written to Four short links: 24 April 2020.entity_interactions.html
Displaying Four short links: 24 April 2020.entity_interactions.html:


Data written to Four short links: 23 April 2020.entity_interactions.html
Displaying Four short links: 23 April 2020.entity_interactions.html:


Data written to How data privacy leader Apple found itself in a data ethics catastrophe.entity_interactions.html
Displaying How data privacy leader Apple found itself in a data ethics catastrophe.entity_interactions.html:


Data written to Four short links: 22 April 2020.entity_interactions.html
Displaying Four short links: 22 April 2020.entity_interactions.html:


Data written to Four short links: 21 April 2020.entity_interactions.html
Displaying Four short links: 21 April 2020.entity_interactions.html:


Data written to Four short links: 20 April 2020.entity_interactions.html
Displaying Four short links: 20 April 2020.entity_interactions.html:


Data written to Four short links: 17 April 2020.entity_interactions.html
Displaying Four short links: 17 April 2020.entity_interactions.html:


Data written to Four short links: 16 April 2020.entity_interactions.html
Displaying Four short links: 16 April 2020.entity_interactions.html:


Data written to Four short links: 15 April 2020.entity_interactions.html
Displaying Four short links: 15 April 2020.entity_interactions.html:


Data written to Four short links: 14 April 2020.entity_interactions.html
Displaying Four short links: 14 April 2020.entity_interactions.html:


Data written to Radar trends to watch: April 2020.entity_interactions.html
Displaying Radar trends to watch: April 2020.entity_interactions.html:


Data written to Four short links: 13 April 2020.entity_interactions.html
Displaying Four short links: 13 April 2020.entity_interactions.html:


Data written to Four short links: 10 April 2020.entity_interactions.html
Displaying Four short links: 10 April 2020.entity_interactions.html:


Data written to Four short links: 9 April 2020.entity_interactions.html
Displaying Four short links: 9 April 2020.entity_interactions.html:


Data written to Four short links: 8 April 2020.entity_interactions.html
Displaying Four short links: 8 April 2020.entity_interactions.html:


Data written to Four short links: 7 April 2020.entity_interactions.html
Displaying Four short links: 7 April 2020.entity_interactions.html:


Data written to Governance and Discovery.entity_interactions.html
Displaying Governance and Discovery.entity_interactions.html:


Data written to Four short links: 6 April 2020.entity_interactions.html
Displaying Four short links: 6 April 2020.entity_interactions.html:


Data written to Four short links: 3 April 2020.entity_interactions.html
Displaying Four short links: 3 April 2020.entity_interactions.html:


Data written to Four short links: 2 April 2020.entity_interactions.html
Displaying Four short links: 2 April 2020.entity_interactions.html:


Data written to Four short links: 1 April 2020.entity_interactions.html
Displaying Four short links: 1 April 2020.entity_interactions.html:


Data written to Four short links: 31 March 2020.entity_interactions.html
Displaying Four short links: 31 March 2020.entity_interactions.html:


Data written to What you need to know about product management for AI.entity_interactions.html
Displaying What you need to know about product management for AI.entity_interactions.html:


Data written to The unreasonable importance of data preparation.entity_interactions.html
Displaying The unreasonable importance of data preparation.entity_interactions.html:


Data written to Four short links: 24 March 2020.entity_interactions.html
Displaying Four short links: 24 March 2020.entity_interactions.html:


Data written to 3 ways to confront modern business challenges.entity_interactions.html
Displaying 3 ways to confront modern business challenges.entity_interactions.html:


Data written to An enterprise vision is your company’s North Star.entity_interactions.html
Displaying An enterprise vision is your company’s North Star.entity_interactions.html:


Data written to Leaders need to mobilize change-ready workforces.entity_interactions.html
Displaying Leaders need to mobilize change-ready workforces.entity_interactions.html:


Data written to Great leaders inspire innovation and creativity from within their workforces.entity_interactions.html
Displaying Great leaders inspire innovation and creativity from within their workforces.entity_interactions.html:


Data written to Strong leaders forge an intersection of knowledge and experience.entity_interactions.html
Displaying Strong leaders forge an intersection of knowledge and experience.entity_interactions.html:


Data written to Four short links: 23 March 2020.entity_interactions.html
Displaying Four short links: 23 March 2020.entity_interactions.html:


Data written to Four short links: 20 March 2020.entity_interactions.html
Displaying Four short links: 20 March 2020.entity_interactions.html:


Data written to 6 trends framing the state of AI and ML.entity_interactions.html
Displaying 6 trends framing the state of AI and ML.entity_interactions.html:


Data written to Four short links: 19 March 2020.entity_interactions.html
Displaying Four short links: 19 March 2020.entity_interactions.html:


Data written to It’s an unprecedented crisis: 8 things to do right now.entity_interactions.html
Displaying It’s an unprecedented crisis: 8 things to do right now.entity_interactions.html:


Data written to AI adoption in the enterprise 2020.entity_interactions.html
Displaying AI adoption in the enterprise 2020.entity_interactions.html:


Data written to Four short links: 18 March 2020.entity_interactions.html
Displaying Four short links: 18 March 2020.entity_interactions.html:


Data written to Four short links: 17 March 2020.entity_interactions.html
Displaying Four short links: 17 March 2020.entity_interactions.html:


Data written to Four short links: 16 March 2020.entity_interactions.html
Displaying Four short links: 16 March 2020.entity_interactions.html:


Data written to Four short links: 13 March 2020.entity_interactions.html
Displaying Four short links: 13 March 2020.entity_interactions.html:


Data written to Four short links: 12 March 2020.entity_interactions.html
Displaying Four short links: 12 March 2020.entity_interactions.html:


Data written to Four short links: 11 March 2020.entity_interactions.html
Displaying Four short links: 11 March 2020.entity_interactions.html:


Data written to Four short links: 10 March 2020.entity_interactions.html
Displaying Four short links: 10 March 2020.entity_interactions.html:


Data written to Four short links: 9 March 2020.entity_interactions.html
Displaying Four short links: 9 March 2020.entity_interactions.html:


Data written to Four short links: 6 March 2020.entity_interactions.html
Displaying Four short links: 6 March 2020.entity_interactions.html:


Data written to Radar trends to watch: March 2020.entity_interactions.html
Displaying Radar trends to watch: March 2020.entity_interactions.html:


Data written to Four short links: 5 March 2020.entity_interactions.html
Displaying Four short links: 5 March 2020.entity_interactions.html:


Data written to Remembering Freeman Dyson.entity_interactions.html
Displaying Remembering Freeman Dyson.entity_interactions.html:


Data written to Four short links: 4 March 2020.entity_interactions.html
Displaying Four short links: 4 March 2020.entity_interactions.html:


Data written to Four short links: 3 March 2020.entity_interactions.html
Displaying Four short links: 3 March 2020.entity_interactions.html:


Data written to The death of Agile.entity_interactions.html
Displaying The death of Agile.entity_interactions.html:


Data written to Four short links: 2 March 2020.entity_interactions.html
Displaying Four short links: 2 March 2020.entity_interactions.html:


Data written to Four short links: 28 February 2020.entity_interactions.html
Displaying Four short links: 28 February 2020.entity_interactions.html:


Data written to Four short links: 27 February 2020.entity_interactions.html
Displaying Four short links: 27 February 2020.entity_interactions.html:


Data written to Intellectual control.entity_interactions.html
Displaying Intellectual control.entity_interactions.html:


Data written to Highlights from the O’Reilly Software Architecture Conference in New York 2020.entity_interactions.html
Displaying Highlights from the O’Reilly Software Architecture Conference in New York 2020.entity_interactions.html:
