Skip to content
This is a repo for Programming for Cultural Heritage
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This is a repo for a Pratt SILS/SI/MLS/LIS class, 'Programming for Cultural Heritage'

I found a digital archive of almost 5,000 vintage circus posters hosted at the Ringling Museum site. It is amazing.

Notes on the content Metadata - the date field is seriously lacking in content - just about zero attempt at a controlled vocabulary

Steps of python: BeautifulSoup implementation to scrape each individual poster's web page Luckily, site used an incremental url system which allows for ++1 in code HTML structured .... Structured resulting metadata fields to a long list of ~5g dictionaries Print to a JSON file. Archive will break if too much traffic. Full scrape of the ~5g takes somewhere between 4-8 hours.

JSOR file reading Python to read through list of dictionaries and pull a key value (title) from each dictionary in the list. Python with Regular Expression to pull all words from the above key value (title) list that appear before a colon

You can’t perform that action at this time.