# How to build your own content feed with Python

By crawling Reddit and Hacker News, and then using machine learning with Algorithmia.

## Scraping the web

#### Step 1: Install `scrapy`

In [196]:
# Install scrapy if not already installed
# How to read this line: 
#   1. Show info about installed package scrapy.
#   2. If it could not find the package: install it.
!pip3 show scrapy || pip3 install --user scrapy

Name: Scrapy
Version: 1.4.0
Summary: A high-level Web Crawling and Web Scraping framework
Home-page: http://scrapy.org
Author: Pablo Hoffman
Author-email: pablo@pablohoffman.com
License: BSD
Location: /home/erb/.local/lib/python3.6/site-packages
Requires: Twisted, queuelib, lxml, parsel, w3lib, PyDispatcher, service-identity, cssselect, pyOpenSSL, six


In [197]:
# For some Very Good Reasons, we've made our crawler runnable as a script.
!python3 run_crawler.py

2017-09-17 21:57:52 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-09-17 21:57:52 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2017-09-17 21:57:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-17 21:57:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)


In [198]:
from basespiders import load_content

all_content = load_content()
print("Loaded {} things".format(len(all_content)))

Loaded 115 things


In [199]:
def score_key(t):
    """Useful when sorting content by score"""
    return int(t.score) if t.score else 0

def list_content(title, content, n=5):
    print("=" * (len(title) + 2))
    print(" " + title + " ")
    print("-" * (len(title) + 2))
    for c in content[:n]:
        print(c)
        
reddit_things = sorted([c for c in all_content if c.source == "reddit"], key=score_key, reverse=True)
list_content("Reddit", reddit_things)

hn_things = sorted([c for c in all_content if c.source == "hn"], key=score_key, reverse=True)
list_content("Hacker News", hn_things)

 Reddit 
--------
[164] A font for coders: "Input". I have never seen clearer curly brackets! (http://input.fontbureau.com/info/)
[153] Chrome to force .dev domains to HTTPS via preloaded HSTS (https://ma.ttias.be/chrome-force-dev-domains-https-via-preloaded-hsts/)
[92] Electron: The Bad Parts (https://hackernoon.com/electron-the-bad-parts-2b710c491547)
[54] Retro Arcade Racing Game - Programming from Scratch (Quick and Simple C++) (https://www.youtube.com/watch?v=KkMZI5Jbf18)
[30] Fully commented machine learning code, made with love by yours truly. (Python 3) (https://repl.it/LMlH)
 Hacker News 
-------------
[826] Firefox Multi-Account Containers (https://blog.mozilla.org/firefox/introducing-firefox-multi-account-containers/)
[569] We've failed: open access is winning and we must change our approach (http://onlinelibrary.wiley.com/doi/10.1002/leap.1116/full)
[569] We've failed: open access is winning and we must change our approach (http://onlinelibrary.wiley.com/doi/10.1002/leap.11

In [200]:
import random
from random import randint

def weighted_random(pairs):
    total = sum(pair[0] for pair in pairs)
    r = randint(1, total)
    for (weight, value) in pairs:
        r -= weight
        if r <= 0: return value
        
def pick_random_weighted(things, key=score_key):
    return weighted_random([(score_key(thing), thing) for thing in things])

c = pick_random_weighted(reddit_things)
print(repr(c))
print(c.title)
print(c.url)

<Content source='reddit' title='A font for coders: "Input". I have never...'>
A font for coders: "Input". I have never seen clearer curly brackets!
http://input.fontbureau.com/info/


In [201]:
def recommend(things):
    good_stuff = []
    for source in ["reddit", "hn"]:
        source_things = [t for t in things if t.source == source]
        for _ in range(10):
            c = pick_random_weighted(source_things)
            source_things.remove(c)
            good_stuff.append(c)
    return good_stuff
        
recommend(all_content)

def recommend_generator(things):
    # Copy the input list and get rid of things that don't have a score
    things = [t for t in things if t.score and int(t.score) >= 1]
    while len(things) > 0:
        # Pick a source randomly with a rectangular pdf (probability density function)
        source = random.choice(["reddit", "hn"])
        
        source_things = [t for t in things if t.source == source]
        if len(source_things) <= 0:
            continue
            
        # Pick one article from the source, weighted by score
        coming_up = pick_random_weighted(source_things)
        things.remove(coming_up)  # Remove to prevent thing from appearing twice
        
        yield coming_up

R = recommend_generator(all_content)

In [202]:
# Run this once to create the content recommendation generator
R = recommend_generator(all_content)

In [203]:
# Every time you run this, you will get fresh content using the generator
for _ in range(10):
    t = next(R)
    print(t)

[211] Two museums having an informative fight on Twitter (http://www.newstatesman.com/science-tech/social-media/2017/09/two-museums-are-having-fight-twitter-and-its-gloriously)
[16] Implementing the function composition operator in JavaScript (https://medium.com/@gigobyte/implementing-the-function-composition-operator-in-javascript-e2c4f1847d6a)
[164] A font for coders: "Input". I have never seen clearer curly brackets! (http://input.fontbureau.com/info/)
[12] NBA 2k18 save file is 5GB per profile on the Nintendo Switch (http://en-americas-support.nintendo.com/app/answers/detail/a_id/27434)
[92] Electron: The Bad Parts (https://hackernoon.com/electron-the-bad-parts-2b710c491547)
[153] Chrome to force .dev domains to HTTPS via preloaded HSTS (https://ma.ttias.be/chrome-force-dev-domains-https-via-preloaded-hsts/)
[54] Retro Arcade Racing Game - Programming from Scratch (Quick and Simple C++) (https://www.youtube.com/watch?v=KkMZI5Jbf18)
[3] Tutorial on visualizing convolutional neural n

In [222]:
# Or alternatively, you can get all the content in one go.
R = recommend_generator(all_content)
feed = list(R)
print("        | Your Feed")
print("———————— — — — —  —  —  —  —")
for t in feed[:10]:
    print("@{:6} | {}".format(t.source, t.title))
    print("{:6}p | {}".format(t.score, t.url))
    print("———————— — — — —  —  —  —  —")

        | Your Feed
———————— — — — —  —  —  —  —
@reddit | Chrome to force .dev domains to HTTPS via preloaded HSTS
   153p | https://ma.ttias.be/chrome-force-dev-domains-https-via-preloaded-hsts/
———————— — — — —  —  —  —  —
@reddit | Fully commented machine learning code, made with love by yours truly. (Python 3)
    30p | https://repl.it/LMlH
———————— — — — —  —  —  —  —
@hn     | Show HN: Colors – A data-driven collection of beautiful color palettes
    57p | https://klart.co/colors
———————— — — — —  —  —  —  —
@reddit | A font for coders: "Input". I have never seen clearer curly brackets!
   164p | http://input.fontbureau.com/info/
———————— — — — —  —  —  —  —
@hn     | Buffett wins $1M decade-old bet that the S&P500 would outperform hedgefunds
   291p | http://www.aei.org/publication/warren-buffett-wins-1m-bet-made-a-decade-ago-that-the-sp-500-stock-index-would-outperform-hedge-funds/
———————— — — — —  —  —  —  —
@hn     | Buffett wins $1M decade-old bet that the S&P500 would out

## Analyzing articles

Now that we have some content to play with, lets analyse it.

#### Step 1: Install and import the algorithmia package

In [224]:
# Install
!pip3 show algorithmia || pip3 install --user algorithmia

Name: algorithmia
Version: 1.0.8
Summary: Algorithmia Python Client
Home-page: http://github.com/algorithmiaio/algorithmia-python
Author: Algorithmia
Author-email: support@algorithmia.com
License: MIT
Location: /home/erb/.local/lib/python3.6/site-packages
Requires: enum34, requests, six


In [225]:
# Import
import Algorithmia

#### Step 2: Get and setup and API key

[Create an account]() on Algorithmia, log in, and put your API key below (it can be found [here](https://algorithmia.com/users/erb#credentials)).

In [226]:
API_KEY = 'simtICyX1Ng5PD33Bm479NS78Sq1'
client = Algorithmia.client(API_KEY)

#### Step 3: Analyse it!

Now for the *really* cool part.

Here are a few useful algos you could do cool stuff with:

 - https://algorithmia.com/algorithms/nlp/SentimentByTerm
 - https://algorithmia.com/algorithms/nlp/SummarizeURL
 - https://algorithmia.com/algorithms/nlp/AutoTag
 - https://algorithmia.com/algorithms/tags/AutoTagURL
 - https://algorithmia.com/algorithms/StanfordNLP/NamedEntityRecognition

In [227]:
# See which terms tend to appear in the post titles
# Note that this doesn't work very well. 
# Probably because of the weird capitalization used in titles, which the algorithm presumably relies upon.

from collections import Counter
from pprint import pprint

logging.getLogger("requests").setLevel(logging.INFO)


# Join all titles together into a sequence of sentences
titles = [c.title for c in all_content if c.title]
title_corpus = ". ".join(titles)

algo = client.algo('StanfordNLP/NamedEntityRecognition/0.2.0')
response = algo.pipe({"document": title_corpus})

c = Counter()
for sentence in response.result["sentences"]:
    for entity in sentence["detectedEntities"]:
        entity_type = entity["entity"]
        word = entity["word"]
        if entity_type in ["ORGANIZATION", "PERSON", "LOCATION"]:
            #print("{}: {}".format(entity_type, word))
            c[(entity_type, word)] += 1
            
pprint(c.most_common(10))

[(('ORGANIZATION', 'News'), 2),
 (('PERSON', 'Buffett'), 2),
 (('ORGANIZATION', 'S&P'), 2),
 (('LOCATION', 'Silicon'), 2),
 (('LOCATION', 'Valley'), 2),
 (('PERSON', 'Isaac'), 2),
 (('PERSON', 'Newton'), 2),
 (('LOCATION', 'California'), 2),
 (('LOCATION', 'China'), 2),
 (('PERSON', 'Richard'), 2)]


In [228]:
for c in all_content[:5]:
    print(c.title)
    algo = client.algo('tags/AutoTagURL/0.1.9')
    response = algo.pipe(c.url)
    pprint(Counter(**response.result).most_common(5))
    print()

Chrome to force .dev domains to HTTPS via preloaded HSTS
[('reply', 9),
 ('development', 8),
 ('https', 4),
 ('practice', 3),
 ('browser', 2)]

Electron: The Bad Parts
[('electron', 22),
 ('things', 8),
 ('installer', 7),
 ('build', 6),
 ('modules', 6)]

A font for coders: "Input". I have never seen clearer curly brackets!
[('code', 19), ('bold', 8), ('indentation', 4), ('normal', 4), ('pixel', 4)]

Retro Arcade Racing Game - Programming from Scratch (Quick and Simple C++)
[('game', 3), ('enjoyment', 1), ('maths', 1)]

Implementing the function composition operator in JavaScript
[('compose', 7),
 ('javascript', 7),
 ('composition', 4),
 ('property', 3),
 ('mind', 2)]



In [229]:
for c in all_content[:5]:
    print(c.title)
    algo = client.algo('nlp/SummarizeURL/0.1.4')
    response = algo.pipe([c.url, 2])
    print("Summary:", response.result)
    print()

Chrome to force .dev domains to HTTPS via preloaded HSTS
Summary: tl;dr: one of the next versions of Chrome is going to force all domains ending on .dev (and .foo) to be redirected to HTTPs via a preloaded HTTP Strict Transport Security (HSTS) header. Although ultimately I’d say you just shouldn’t consider hostnames to be a secret.) Reply ↓ @Hanno I think it’s silly to pay a CA for a development certificate.

Electron: The Bad Parts
Summary: Electron probably has more than its share of the good — but also hides some dark secrets under its shining facade. While the first impression of Electron might be that it solves all the problems related to cross platform development the reality is that many things won’t work out of the box and they probably can’t.

A font for coders: "Input". I have never seen clearer curly brackets!
Summary: Input was drawn over an 11‑pixel grid. Input takes its aesthetic cues from monospaced fonts and pixel fonts designed for consoles and screens, but casts off t