Lots of people have been asking for a way to get the NZ Govt Flags out of the govt.nz site in a machine parsable format.
This is an extension of @fogonwater's getflags tool. It uses BeautifulSoup to get the images and metadata of the flag submissions, and store them in a SQLite DB.
It then serves them up over a JSON API via a tiny Flask application.
Assumes you have:
- python
- setuptools/pip
- nodejs
- memcached (default is a
memcached.sock
in your home directory)
npm install
Note that the npm
tasks are aliased to the generic version of python
in your
shell, for convenience. If you want to run it using python2.7
for instance,
you're better off running the raw python commands:
easy_install-2.7 install
python2.7 scraper scrape
python2.7 api
First, you'll need to populate the DB. The scraper has to traverse every submission on the first run. At the time of writing, there's about 850 of them, so it takes a while.
Subsequent scrapes are much faster, since it traverses the listing pages first, to harvest new submissions.
python scraper scrape # This will take a while
python api # API on port 5000 for you
You can then get all the flag submissions by visiting the API in your browser
http://localhost:5000/api/
You can also get individual submissions by passing an ID:
http://localhost:5000/api/610
Or search (it's just using LIKE
) by tags and descriptions:
/api/tag/<string:name>
/api/description/<string:name>
{
"description": "Bird accualy is kiwi. Is well known icon of our country and people. Also stars are southern cross and blue is pacific ocean. Where we're from and how we got here.",
"designer": "Aku A.",
"designer_location": "Waikato",
"id": 4338,
"image_path": "http://localhost:3000/static/submissions/4338.jpg",
"last_scraped": "2015-05-16T15:45:03.146103+00:00",
"suggested_by": null,
"suggested_by_location": null,
"tags": [
"blue",
"red",
"white",
"kiwi",
"Southern Cross",
"history",
"Kiwiana",
"Ocean",
"brown"
],
"title": "Southern Kiwi",
"url": "https://www.govt.nz/browse/engaging-with-government/the-nz-flag-your-chance-to-decide/gallery/design/4338"
}
Image of flag submissions are saved via urllib
to the static/submissions
directory. You should configure your wsgi app to serve these directly.
python api # runs the Flask API app on port 5000
python scraper scrape # scrapes the govt.nz site for flags
python scraper # List all available commands
npm start # runs the dev server on port 5000
npm run scrape # scrapes the govt.nz site for flags
npm run dump # dumps the results out as JSON `var/submissions.json`
npm run migrate # populate the DB from JSON file `var/submissions.json`
npm run drop # drops the table
npm run cron # scheduled task to scrape the govt site every hour
Example wsgi script:
import sys
import os
sys.path.insert(1, os.path.join(sys.path[0], '.'))
from api.app import app as application
drop an env.py
in the root of your project to override any configuration
options from config.py
.
The app stores the listing and the individual flag items using memcached
.
The default cache time is 60 minutes.
To make sure you've got the freshest, juiciest flags, run this task to scrape the site every couple of hours.
npm run cron
- Write better docs
- Add some caching
- Move the static assets to a CDN