This project started at Code the City 19 History and Data event.
It's purpose is to gather data on Aberdeen-built ships, with the permission of the site's owners, and to push that bulk of data onto Wikidata as open data, with links back to the Aberdeen Ships site through using a new identifier.
So far the following has been accomplished.
- The script get_ids.py gathers all the ship IDs from Aberdeen-built ships and writes them to ids.txt.
- The script get_details.py uses the IDs from ids.txt and scrapes the full ship information from Aberdeen-built ships and writes it to the file ships.json.
- The file query.rq contains code to execute a query on Wikidata Query Service to get the QID and name of every ship on Wikidata. This has been manually downloaded as all_wd_ships.json.
- The file ship_builders.py checks ships.json and constructs a list of all ship builders and a frequency count of their appearance, writing it out to ship_builders.csv.
- The file already_in_wd.py has checked for ships names in ships.json and crossed matched with all_wd_ships.json and generated a list of ships whose name indicates that they MAY be already in Wikidata.
- An identifier for Aberdeen Built Ships, P8260, has now been created in Wikidata
- The the list of all 904 ships that possibly existed in Wikidata has been manually checked. the resulting list of 59 positive matches has been created with a one-to-one mapping between Wikidata entry and ABS ids.
- The file ship_types.py checks ships.json and constructs a list of all ship types and a frequency count of their appearance, writing it out to ship_types.csv. Currently there are 207 distinct types of ship!!
Some core data was imported into wikidata for most of the ships, excluded some from the import as the name field was blank or UNKNOWN or UNNAMED.
Was initially trying to use the CSV format for wikidata quickstatements, but couldn't get this to work so switched to the TSV version. A python script was written to write the quickstatements file that could then be copied into the quickstatements batch import tool. The import had 2 errors for ships that had a range of years in the Date so generated invalid dates in the quickstatements. These (and 2 duplicates that I noticed after the import) are noted to correct later.
The ABS ID property (P8260) was manually added to the ships that already existed in wikidata.
The mappings between QID and ABS ID was found from SPARQL query:
SELECT ?qid ?absid
WHERE
{
?qid wdt:P8260 ?absid.
}
To complete the project the following needs to be done
- Rationalise all ship builders that exist in ship_builders.csv - deduplicating these and create Wikidata entries for each we will use.
- Rationalise all ship types that exist in ship_types.csv - deduplicating these and create Wikidata entries for each we will use.
- Extract/rationalise data from some of the fields, e.g. we have one dimensions field rather than separate fields for length/beam/draft/... and what's there is inconsistent
- Isolate ships that have no Wikidata identifier - i.e. any one not in the list of 59 positive matches. Set aside those which have entries for later processing.
- Decide on best route to bulk upload - eg Quickstatements. This may be useful: Wikidata Import Guide
- Agree a core set of data for each ship that will parsed from ships.json to be added to Wikidata. See Wikidata Ship Properties below.
- Create a script to output text that can be dropped into a CSV or other file to be used by QuickStatements (assuming that to be the right tool) for bulk input ensuring links for shipbuilder IDs and ABS identifiers are used.
- deal with adding data to existing 59 wikidata entries
- Develop a means of monitoring both the original ABS system (rescrape periodically and do a diff on the file in some way? ) and monitor Wikidata for changes to the ships records (Wikidata query, executed periodically, generating a CSV download and checked for differences from previous runs?) to feed back to ABS.
The following have been identified as potential Wikidata statements that we need to consider using. Not all ships will have all data available. Core ones have (*) after them.
- Label (*)
- Description (*)
- Instance of (P31) (*)
- Ship or if available a subclass such as
- Schooner
- Clipper
- Whaler
- Brig etc. Note - this can be multiples.
- Name (P2561) - or official name (P1448)?? - (*) could have multiple values with dates for start + end
- Abedeen Built Ships ID (P8260) (*)
- Significant event (P793) Include possible such as order (Q566889), keel laying (Q14592615), ceremonial ship launching (Q596643), ship decomissioning (Q7497952), shipwrecking (Q906512), but also sea voyage etc. each with point in time (P585). Voyages could have destination and start and end dates. Also destruction, breaking up etc.
- Cost (P2130)
- Mass (P2067)
- Gross Tonnage (P1093)
- Length (P2043)
- Beam (P2261)
- Draft (P2262)
- Number of masts (P1099)
- Speed (P2052)
- Manufacturer (P176) - take values from table of Ship builders
- location of creation (P1071) - Aberdeen (Q36405) (*)
- Country of origin (P495) - GB 1701-1801, UK GBNI 1801-1927, UK (1927-) (*)
- Service entry (P729)
- Service Retirement (P730)
- Described at URL (P973) with a link to ABS (maybe not given we'll have specific ABS ID)
- Country of Registry (P8047) - could have numerous values / dates
- Home port (P504) - could have numerous values / dates