Skip to content

newslynx/pageone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

travis-img pageone

a module for polling urls and stats from homepages

Install

$ pip install pageone

Tests

Requires nose

$ nosetests

Usage

pageone does two things: extract article urls from a site's homepage and also uses selenium and phantomjs to find the relative positions of these urls.

pageone provides a single interface:

import pageone

for link in pageone.get('http://www.propublica.org/', pattern='.*article.*'):
    print link

Here, pattern represents regex used to identify which urls are artilces. If newslynx is installed and pattern is not provided, it will default to using newslynx.lib.url.is_article, which uses a series of heuristics to determine whether a url is an article.

All methods will return a list of dictionaries that look like this:

{
 'bucket': 8,
 'bucket_size': 200,
 'datetime': datetime.datetime(2015, 10, 6, 20, 21, 22, 422478),
 'domain': 'www.propublica.org',
 'font_size': 14,
 'n_links': 1,
 'page': 'http://www.propublica.org/',
 'text': u'The Stories of Everyday Lives, Hidden in Reams of Data',
 'url': u'https://www.propublica.org/nerds/item/the-stories-of-everyday-lives-hidden-in-reams-of-data/',
 'visible': True,
 'x': 61,
 'x_bucket': 1,
 'y': 1578,
 'y_bucket': 8
}

Here bucket variables represent where a link falls in 200x200 pixel grid. For x_bucket this number moves from left-to-right. For y_bucket, it moves top-to-bottom. bucket moves from top-left to bottom right. You can customize the size of this grid by passing in bucket_pixels to get, eg:

import pageone

for link in pageone.get('http://www.propublica.org/', bucket_pixels = 100, pattern='.*article.*'):
    print link

PhantomJS

pageone requires phantomjs to run pageone.get(). pageone defaults to looking for phantomjs in /usr/bin/phantomjs, but if you want to specify another path, pass in phantom_path to pageone.get:

import pageone

for link in pageone.get('http://www.propublica.org/', pattern='.*article.*', phantom_path="/usr/bin/phantomjs"):
    print link

About

A module for polling urls and stats from homepages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages