# How to build your own content feed with Python

By crawling Reddit and Hacker News, and then using machine learning with Algorithmia.

## Scraping the web

#### Step 1: Install `scrapy`

In [1]:
# Install scrapy if not already installed
#
# How to read this line: 
#   1. Show info about installed package scrapy.
#   2. If it could not find the package: install it.
!pip3 show scrapy || pip3 install --user scrapy

Name: Scrapy
Version: 1.4.0
Summary: A high-level Web Crawling and Web Scraping framework
Home-page: http://scrapy.org
Author: Pablo Hoffman
Author-email: pablo@pablohoffman.com
License: BSD
Location: /home/erb/.local/lib/python3.6/site-packages
Requires: PyDispatcher, cssselect, queuelib, lxml, service-identity, parsel, six, Twisted, w3lib, pyOpenSSL


In [2]:
# For some Very Good Reasons, we've made our crawler runnable as a script.
!python3 run_crawler.py

2017-09-17 23:03:05 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-09-17 23:03:05 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2017-09-17 23:03:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-17 23:03:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-17 23:03:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.reddit.com/r/Python/top/?sort=top&t=day> from <GET https://www.reddit.com/r/python/top/?sort=top&t=day>


In [3]:
from basespiders import load_content

all_content = load_content()
print("Loaded {} things".format(len(all_content)))

Loaded 140 things


In [4]:
def score_key(t):
    """Useful when sorting content by score"""
    return int(t.score) if t.score else 0

def list_content(title, content, n=5):
    print("=" * (len(title) + 2))
    print(" " + title + " ")
    print("-" * (len(title) + 2))
    for c in content[:n]:
        print(c)
        
reddit_things = sorted([c for c in all_content if c.source == "reddit"], key=score_key, reverse=True)
list_content("Reddit", reddit_things)

hn_things = sorted([c for c in all_content if c.source == "hn"], key=score_key, reverse=True)
list_content("Hacker News", hn_things)

 Reddit 
--------
[389] A font for coders: "Input". I have never seen clearer curly brackets! (http://input.fontbureau.com/info/)
[193] Python made animation. (/r/Python/comments/70nxze/python_made_animation/)
[170] Chrome to force .dev domains to HTTPS via preloaded HSTS (https://ma.ttias.be/chrome-force-dev-domains-https-via-preloaded-hsts/)
[158] What’s New In Python 3.7 — Python 3.7.0a0 documentation (https://docs.python.org/3.7/whatsnew/3.7.html)
[105] Python Release Python 2.7.14 (https://www.python.org/downloads/release/python-2714/)
 Hacker News 
-------------
[826] Firefox Multi-Account Containers (https://blog.mozilla.org/firefox/introducing-firefox-multi-account-containers/)
[569] We've failed: open access is winning and we must change our approach (http://onlinelibrary.wiley.com/doi/10.1002/leap.1116/full)
[362] Buffett wins $1M decade-old bet that the S&P500 would outperform hedgefunds (http://www.aei.org/publication/warren-buffett-wins-1m-bet-made-a-decade-ago-that-the-sp

In [5]:
import random
from random import randint

def weighted_random(pairs):
    total = sum(pair[0] for pair in pairs)
    r = randint(1, total)
    for (weight, value) in pairs:
        r -= weight
        if r <= 0: return value
        
def pick_random_weighted(things, key=score_key):
    return weighted_random([(score_key(thing), thing) for thing in things])

c = pick_random_weighted(reddit_things)
print(repr(c))
print(c.title)
print(c.url)

<Content source='reddit' title='Pythons Positive Press Pumps Pandas'>
Pythons Positive Press Pumps Pandas
http://paddy3118.blogspot.co.uk/2017/09/pythons-positive-press-pumps-pandas.html


In [6]:
def recommend(things, n=20):
    """A simple recommender that picks 10 articles randomly from each source"""
    # Copy the input list and get rid of things that don't have a score
    things = [t for t in things if t.score and int(t.score) >= 1]
    
    good_stuff = []
    for i in range(n):
        # Pick a source randomly using a rectangular pdf (probability density function)
        source = random.choice(["reddit", "hn"])
        
        source_things = [t for t in things if t.source == source]
        recommended = pick_random_weighted(source_things)
        
        things.remove(recommended)
        good_stuff.append(recommended)
    return good_stuff

recommend(all_content, n=5)

[<Content source='hn' title='Firefox Multi-Account Containers'>,
 <Content source='hn' title='Wind energy used to mine cryptocurrency ...'>,
 <Content source='reddit' title='Python made animation.'>,
 <Content source='reddit' title='Whats New In Python 3.7  Python 3.7.0a...'>,
 <Content source='hn' title='Chrome to force .dev domains to HTTPS vi...'>]

In [7]:
# These weights give us about 60% reddit content and 40% HN content
# Sources not in this list will have weight 1 by default.
source_weights = {"reddit": 6, "hn": 4}

def recommend_generator(things):
    """
    A little more advanced than the above. 
    Uses a generator so we can get however many recommendations we want.
    This simplifies the logic, which enables us to easily add some other features.
    """
    # Copy the input list and get rid of things that don't have a score
    things = [t for t in things if t.score and int(t.score) >= 1]
    
    # Create a set of all sources
    sources = {t.source for t in things}
    sources_weighted = [(1 if source not in source_weights else source_weights[source], 
                         source) 
                        for source in sources]
    
    while len(things) > 0:
        # Pick a source according to some weights
        source = weighted_random(sources_weighted)
        
        source_things = [t for t in things if t.source == source]
        if len(source_things) <= 0:
            continue
            
        # Pick one article from the source, weighted by score
        recommended = pick_random_weighted(source_things)
        
        # Remove it to prevent it to get recommended twice
        things.remove(recommended)
        
        yield recommended
        
feed = list(recommend_generator(all_content))[:20]
for source in source_weights:
    print("{:8s}: {} out of {}".format(source, len([t for t in feed if t.source == source]), len(feed)))
feed

reddit  : 12 out of 20
hn      : 8 out of 20


[<Content source='reddit' title='A font for coders: "Input". I have never...'>,
 <Content source='hn' title='Ask HN: What's the next big advance in A...'>,
 <Content source='hn' title='Two museums having an informative fight ...'>,
 <Content source='reddit' title='Chrome to force .dev domains to HTTPS vi...'>,
 <Content source='hn' title='To treat back pain, look to the brain no...'>,
 <Content source='hn' title='Chrome to force .dev domains to HTTPS vi...'>,
 <Content source='hn' title='Buffett wins $1M decade-old bet that the...'>,
 <Content source='reddit' title='Discover the world of microcontrollers t...'>,
 <Content source='reddit' title='Do you use pandas Index?'>,
 <Content source='reddit' title='Python made animation.'>,
 <Content source='reddit' title='Electron: The Bad Parts'>,
 <Content source='reddit' title='Pythons Positive Press Pumps Pandas'>,
 <Content source='reddit' title='Python Release Python 2.7.14'>,
 <Content source='hn' title='Here I Stand, at Age 80'>,
 <Conte

In [8]:
# Run this once to create the content recommendation generator
R = recommend_generator(all_content)

In [9]:
# Every time you run this, you will get fresh content using the generator
for _ in range(10):
    t = next(R)
    print(t)

[389] A font for coders: "Input". I have never seen clearer curly brackets! (http://input.fontbureau.com/info/)
[158] What’s New In Python 3.7 — Python 3.7.0a0 documentation (https://docs.python.org/3.7/whatsnew/3.7.html)
[193] Python made animation. (/r/Python/comments/70nxze/python_made_animation/)
[8] OpenJ9 (https://github.com/eclipse/openj9)
[105] Python Release Python 2.7.14 (https://www.python.org/downloads/release/python-2714/)
[89] California Legislature Sells Out Our Data to ISPs (https://www.eff.org/deeplinks/2017/09/california-legislature-sells-out-our-data-isps)
[74] Retro Arcade Racing Game - Programming from Scratch (Quick and Simple C++) (https://www.youtube.com/watch?v=KkMZI5Jbf18)
[49] Understanding V8’s Bytecode (https://medium.com/dailyjs/understanding-v8s-bytecode-317d46c94775)
[97] Electron: The Bad Parts (https://hackernoon.com/electron-the-bad-parts-2b710c491547)
[46] The Ultimate List of Python Podcasts (https://dbader.org/blog/ultimate-list-of-python-podcasts)

In [10]:
# Or alternatively, you can get all the content in one go.
R = recommend_generator(all_content)
feed = list(R)
print("        | Your Feed")
print("———————— — — — —  —  —  —  —")
for t in feed[:10]:
    print("@{:6} | {}".format(t.source, t.title))
    print("{:6}p | {}".format(t.score, t.url))
    print("———————— — — — —  —  —  —  —")

        | Your Feed
———————— — — — —  —  —  —  —
@reddit | What’s New In Python 3.7 — Python 3.7.0a0 documentation
   158p | https://docs.python.org/3.7/whatsnew/3.7.html
———————— — — — —  —  —  —  —
@hn     | Finding UX in the Trash
    78p | https://f2.svbtle.com/ux-in-the-trash
———————— — — — —  —  —  —  —
@reddit | Chrome to force .dev domains to HTTPS via preloaded HSTS
   170p | https://ma.ttias.be/chrome-force-dev-domains-https-via-preloaded-hsts/
———————— — — — —  —  —  —  —
@reddit | Python explosion blamed on pandas
    42p | https://www.theregister.co.uk/2017/09/14/python_explosion_blamed_on_pandas/
———————— — — — —  —  —  —  —
@reddit | A font for coders: "Input". I have never seen clearer curly brackets!
   389p | http://input.fontbureau.com/info/
———————— — — — —  —  —  —  —
@reddit | Python made animation.
   193p | /r/Python/comments/70nxze/python_made_animation/
———————— — — — —  —  —  —  —
@hn     | Discover the world of microcontrollers through Rust
   234p | https:/

## Analyzing articles

Now that we have some content to play with, lets analyse it.

#### Step 1: Install and import the algorithmia package

In [11]:
# Install
!pip3 show algorithmia || pip3 install --user algorithmia

Name: algorithmia
Version: 1.0.8
Summary: Algorithmia Python Client
Home-page: http://github.com/algorithmiaio/algorithmia-python
Author: Algorithmia
Author-email: support@algorithmia.com
License: MIT
Location: /home/erb/.local/lib/python3.6/site-packages
Requires: requests, six, enum34


In [12]:
# Import
import Algorithmia

#### Step 2: Get and setup and API key

[Create an account]() on Algorithmia, log in, and put your API key below (it can be found [here](https://algorithmia.com/users/erb#credentials)).

In [13]:
API_KEY = 'simtICyX1Ng5PD33Bm479NS78Sq1'
client = Algorithmia.client(API_KEY)

#### Step 3: Analyse it!

Now for the *really* cool part.

Here are a few useful algos you could do cool stuff with:

 - https://algorithmia.com/algorithms/nlp/SentimentByTerm
 - https://algorithmia.com/algorithms/nlp/SummarizeURL
 - https://algorithmia.com/algorithms/nlp/AutoTag
 - https://algorithmia.com/algorithms/tags/AutoTagURL
 - https://algorithmia.com/algorithms/StanfordNLP/NamedEntityRecognition

In [14]:
# See which terms tend to appear in the post titles
# Note that this doesn't work very well. 
# Probably because of the weird capitalization used in titles, which the algorithm presumably relies upon.

from collections import Counter
from pprint import pprint

logging.getLogger("requests").setLevel(logging.INFO)


# Join all titles together into a sequence of sentences
titles = [c.title for c in all_content if c.title]
title_corpus = ". ".join(titles)

algo = client.algo('StanfordNLP/NamedEntityRecognition/0.2.0')
response = algo.pipe({"document": title_corpus})

c = Counter()
for sentence in response.result["sentences"]:
    for entity in sentence["detectedEntities"]:
        entity_type = entity["entity"]
        word = entity["word"]
        if entity_type in ["ORGANIZATION", "PERSON", "LOCATION"]:
            #print("{}: {}".format(entity_type, word))
            c[(entity_type, word)] += 1
            
pprint(c.most_common(10))

[(('PERSON', 'Python'), 3),
 (('PERSON', 'Buffett'), 2),
 (('ORGANIZATION', 'S&P'), 2),
 (('LOCATION', 'China'), 2),
 (('PERSON', 'Isaac'), 2),
 (('PERSON', 'Newton'), 2),
 (('LOCATION', 'California'), 2),
 (('PERSON', 'Larry'), 2),
 (('PERSON', 'Ellison'), 2),
 (('ORGANIZATION', 'NBA'), 1)]


In [15]:
for c in all_content[:5]:
    algo = client.algo('tags/AutoTagURL/0.1.9')
    response = algo.pipe(c.url)
    if response.result:
        print(c.title)
        print(Counter(**response.result).most_common(5))
        print()

The iPhone X’s notch is basically a Kinect
[('primesense', 5), ('based', 2), ('huge', 2), ('intel', 2), ('evidenced', 1)]

Discover the world of microcontrollers through Rust
[('book', 5), ('embedded', 4), ('digital', 3), ('loop', 2), ('make', 2)]

A Prehistory of the Ethereum Protocol
[('protocol', 17), ('contract', 16), ('limit', 10), ('transactions', 7), ('gas', 6)]

Random Write Considered Harmful in SSDs (2012) [pdf]
[('file', 131), ('systems', 57), ('system', 51), ('flash', 37), ('pages', 33)]

Here I Stand, at Age 80
[('alternative', 1), ('career', 1), ('ideals', 1)]



In [17]:
for c in all_content[:5]:
    print(c.title)
    algo = client.algo('nlp/SummarizeURL/0.1.4')
    response = algo.pipe([c.url, 2])
    print("Summary:", response.result[:300])
    print()

The iPhone X’s notch is basically a Kinect
Summary: Sometimes it's hard to tell exactly how fast technology is moving. But Apple's iPhone X provides a nice little illustration of how sensor and processing technology has evolved in the past decade.

Discover the world of microcontrollers through Rust
Summary: Discover the world of microcontrollers through Rust. This book is an "introductory course" on microcontroller-based "embedded systems" that uses Rust as the teaching language rather than the usual C/C++.

A Prehistory of the Ethereum Protocol
Summary: While we can certainly make more blog posts talking about all of the various ideas Vlad, Gavin, myself and others came up with, and discarded, including “proof of proof of work”, hub-and-spoke chains, “hypercubes”, shadow chains (arguably a precursor to Plasma), chain fibers, and various iterations 

Random Write Considered Harmful in SSDs (2012) [pdf]
Summary: 
SFS: Random Write Considered Harmful in Solid State Drives
Changwoo Mina,