# How to build your own content recommendation with Python

By crawling Reddit and Hacker News, and then using machine learning with Algorithmia.

## Scraping the web

#### Step 1: Install `scrapy`

In [1]:
# Install scrapy if not already installed
!pip3 show -q scrapy || pip3 install --user scrapy

In [2]:
import logging

import scrapy
logging.getLogger().setLevel(logging.WARNING)

from content import Content

# Here we import the spiders
# To understand them, read the scrapy docs.
# This is a good start: https://doc.scrapy.org/en/latest/topics/practices.html
from redditspider import RedditSpider
from hnspider import HNSpider

In [3]:
#from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess

SETTINGS = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
}

crawler = CrawlerProcess(SETTINGS)

logging.getLogger("scrapy.middleware").setLevel(logging.WARNING)
logging.getLogger("scrapy.statscollectors").setLevel(logging.WARNING)
logging.getLogger("scrapy.core.engine").setLevel(logging.WARNING)

crawler.crawl(RedditSpider)
crawler.crawl(HNSpider)
crawler.start()

# TODO: Remove duplicates
reddit_content = RedditSpider.content_found
hn_content = HNSpider.content_found

all_content = reddit_content + hn_content
print("Found {} things".format(len(all_content)))

INFO:scrapy.utils.log:Scrapy 1.4.0 started (bot: scrapybot)
2017-09-15 19:23:03 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
INFO:scrapy.utils.log:Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2017-09-15 19:23:03 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


Found 110 things


In [4]:
reddit_things = RedditSpider.content_found

score_key = lambda t: int(t.score) if t.score else 0

print("========")
print(" Reddit ")
print("--------")
for t in sorted(reddit_things, key=score_key, reverse=True)[:10]:
    print(t)
    
hn_things = HNSpider.content_found
    
print("=============")
print(" Hacker News ")
print("-------------")
for t in sorted(hn_things, key=score_key, reverse=True)[:10]:
    print(t)

 Reddit 
--------
[4835] Sublime Text 3 is out! (http://www.sublimetext.com/blog/articles/sublime-text-3-point-0)
[2518] WordPress abandoning React due to Facebook patent clause (https://ma.tt/2017/09/on-react-and-wordpress/)
[2515] WordPress abandoning React due to Facebook patent clause (https://ma.tt/2017/09/on-react-and-wordpress/)
[2372] Build a working game of Tetris in Conway's Game of Life (https://codegolf.stackexchange.com/questions/11880/build-a-working-game-of-tetris-in-conways-game-of-life)
[1783] Main is usually a function. So then when is it not? (http://jroweboy.github.io/c/asm/2015/01/26/when-is-main-not-a-function.html)
[1506] The hackers who broke into Equifax exploited a flaw in Struts (https://qz.com/1073221/the-hackers-who-broke-into-equifax-exploited-a-nine-year-old-security-flaw/)
[1265] Atom (Github) announces atom-ide (http://blog.atom.io/2017/09/12/announcing-atom-ide.html)
[1159] Is StubHub's Website Deceiving users? (http://jordancolburn.com/2017/09/11/stub

## Analyzing articles

Now that we have some content to play with, lets analyse it.

#### Step 1: Install and import the algorithmia package

In [5]:
# Install
!pip3 show -q algorithmia || pip3 install --user algorithmia

In [6]:
# Import
import Algorithmia

#### Step 2: Get and setup and API key

[Create an account]() on Algorithmia, log in, and put your API key below (it can be found [here](https://algorithmia.com/users/erb#credentials)).

In [7]:
API_KEY = 'simtICyX1Ng5PD33Bm479NS78Sq1'
client = Algorithmia.client(API_KEY)

#### Step 3: Analyse it!

Now for the *really* cool part.

Here are a few useful algos you could do cool stuff with:

 - https://algorithmia.com/algorithms/nlp/SentimentByTerm
 - https://algorithmia.com/algorithms/nlp/SummarizeURL
 - https://algorithmia.com/algorithms/nlp/AutoTag
 - https://algorithmia.com/algorithms/tags/AutoTagURL
 - https://algorithmia.com/algorithms/StanfordNLP/NamedEntityRecognition

In [8]:
# See which terms tend to appear in the post titles
# Note that this doesn't work very well. 
# Probably because of the weird capitalization used in titles, which the algorithm presumably relies upon.

from collections import Counter
from pprint import pprint

logging.getLogger("requests").setLevel(logging.INFO)


# Join all titles together into a sequence of sentences
titles = [c.title for c in all_content if c.title]
title_corpus = ". ".join(titles)

algo = client.algo('StanfordNLP/NamedEntityRecognition/0.2.0')
response = algo.pipe({"document": title_corpus})

c = Counter()
for sentence in response.result["sentences"]:
    for entity in sentence["detectedEntities"]:
        entity_type = entity["entity"]
        word = entity["word"]
        if entity_type in ["ORGANIZATION", "PERSON", "LOCATION"]:
            #print("{}: {}".format(entity_type, word))
            c[(entity_type, word)] += 1
            
pprint(c.most_common(10))

[(('ORGANIZATION', 'Facebook'), 4),
 (('ORGANIZATION', 'Equifax'), 3),
 (('ORGANIZATION', 'Huffman'), 2),
 (('ORGANIZATION', 'Firefox'), 2),
 (('ORGANIZATION', 'Google'), 2),
 (('PERSON', 'Python'), 2),
 (('LOCATION', 'Spain'), 1),
 (('ORGANIZATION', 'Direct'), 1),
 (('PERSON', 'Patreon'), 1),
 (('PERSON', 'Euler'), 1)]


In [9]:
for c in all_content[:5]:
    print(c.title)
    algo = client.algo('tags/AutoTagURL/0.1.9')
    response = algo.pipe(c.url)
    pprint(Counter(**response.result).most_common(5))
    print()

Sublime Text 3 is out!
[('touch', 2), ('highlighting', 2), ('team', 2), ('license', 2), ('support', 2)]

WordPress abandoning React due to Facebook patent clause
[('vue', 53), ('community', 19), ('matt', 15), ('developers', 13), ('happy', 9)]

Build a working game of Tetris in Conway's Game of Life
[('address', 32), ('bit', 29), ('subroutine', 28), ('clock', 16), ('cells', 16)]

Main is usually a function. So then when is it not?
[('code', 17), ('find', 7), ('works', 5), ('section', 5), ('dump', 3)]

The hackers who broke into Equifax exploited a flaw in Struts
[('apache', 3), ('researchers', 3), ('company', 2), ('personal', 2), ('bug', 2)]



In [10]:
for c in all_content[:5]:
    print(c.title)
    algo = client.algo('nlp/SummarizeURL/0.1.4')
    response = algo.pipe([c.url, 2])
    print("Summary:", response.result)
    print()

Sublime Text 3 is out!
Summary: Sublime Text 3.0 is out. I wanted to highlight some of the changes from Sublime Text 2 here, however it's surprisingly hard: virtually every aspect of the editor has been improved in some way, and even a list of the major changes would be too long.

WordPress abandoning React due to Facebook patent clause
Summary: Big companies like to bury unpleasant news on Fridays: A few weeks ago, Facebook announced they have decided to dig in on their patent clause addition to the React license, even after Apache had said it’s no longer allowed for Apache.org projects. Wonder if they will look at Vue.

Build a working game of Tetris in Conway's Game of Life
Summary: Here is a theoretical question - one that doesn't afford an easy answer in any case, not even the trivial one. Your program will receive input by manually changing the state of the automaton at a specific generation to represent an interrupt (e.g.

Main is usually a function. So then when is it not?
Summ