Skip to content

Releases: DavidUnderdown/DiscoveryAPI

Resolving mismatch between code and originally distributed exe in v2.1

03 Mar 13:50
Compare
Choose a tag to compare

Release v2.1 was inadvertently made without committing the last revisions that should have been included, the exe originally included on v2.1 was built including that code. This release resolves the mismatch.

The input CSV file now has regex and excel_sheet_name columns. This allows users to give an arbitrary regex for descriptions where the autobuilt regex based on labels in the Discovery description won't work or is not what is wanted. If a regex is supplied, any label list also given will be ignored. excel_sheet_names allows specific names for Excel worksheets to be given where Excel output has been chosen (if output will be CSV, any sheet names given will be ignored). Note that characters known not to be permitted in sheet names will be removed, and that there is a hard limit on sheet name length of 30 characters, any sheet names supplied that are longer than that will be truncated. Names must also be unique within a workbook, any non-unique name will have n appended (or in the event of names at the 30 character limit naming will revert to Sheetn) - where n is the number of sheets in the workbook (including the current sheet). If three attempts to produce a unique name fail, the script will terminate with an error.

Provide facility for user to input arbitrary regex, and name sheets in Excel output

03 Mar 12:08
ccd273f
Compare
Choose a tag to compare

This release addresses the two issues in the 2.1 Milestone.

This release inadvertently omitted some code that was intended to be included, and which was used in building the exe originally included in this release. This release is superseded by v2.1.1.

The input CSV file now has regex and excel_sheet_name columns. This allows users to give an arbitrary regex for descriptions where the autobuilt regex based on labels in the Discovery description won't work or is not what is wanted. If a regex is supplied, any label list also given will be ignored. excel_sheet_names allows specific names for Excel worksheets to be given where Excel output has been chosen (if output will be CSV, any sheet names given will be ignored). Note that characters known not to be permitted in sheet names will be removed, and that there is a hard limit on sheet name length of 30 characters, any sheet names supplied that are longer than that will be truncated. Names must also be unique within a workbook, any non-unique name will have n appended (or in the event of names at the 30 character limit naming will revert to Sheetn) - where n is the number of sheets in the workbook (including the current sheet). If three attempts to produce a unique name fail, the script will terminate with an error.

Enable native Excel output, additional input parameters, and choose input file, build executable version

24 Feb 13:33
Compare
Choose a tag to compare

This includes all Issues under the v2.0 Milestone.

The location of the input file is no longer fixed, on running the script (or executable) you will be asked for the location of input CSV. Hitting enter without giving one with default to the looking for discovery_api_SearchRecords_input_params.csv in the current working directory.

Input parameters now also include specification of output file location, text encoding, and native Excel output if the output file is given a .xls or .xlsx extension, plus the ability to specify which Discovery fields should be included in the output.

A Windows 64 bit executable is included, this can run without Python being installed (or without all required libraries being installed). This was built with PyInstaller 3.3.1. As running the executable has to build a complete virtual Python environment it takes quite a while to start, and is quite a large binary.

In addition to the executable the sample CSV file is provided along with the CSV Schema file (which uses the CSV Schema Language 1.1 develop by The National Archives. This can be used to check the structure of your own CSV input file using the CSV Validator.

Hashes for discovery_api_SearchRecords.exe are:

  • SHA256 679827e158b1b9cc5f4f922d4eb115ff3f4b1bdc56a4c3b9ecfed81e0471f913
  • MD5 caa0403d5f1d647e8a54dac736a12577

Checking for data extraction and possible additional labels

11 Feb 23:05
Compare
Choose a tag to compare

Added checking that data extraction has happened, and that there are not additional labels present in the data not in the list supplied (Issues #5 and #3). Also switched to regex library rather than using re, to make get longest left match using regex.POSIX flag (Issue #5), extended normalisation/escaping of labels in building regex (Issues #8 and #7) and to improve performance, match is only performed once (issue #2).

Now builds regex from list of labels, and includes each in the output CSV

04 Feb 16:11
Compare
Choose a tag to compare

Now the regex for extracting labelled data from the description field is built up from a list of labels. Normalised versions of these are then used as column names in the output CSV and populated with the relevant data.

For SC 8 the expected labels are: "Petitioners","Name(s)","Addressees","Occupation","Nature of request","Nature of endorsement","Places mentioned","People mentioned" which give output fields petitioners, names, addressees, occupation, nature_of_request, nature_of_endorsement, places_mentioned, people_mentioned

Future generalisation should make it possible to input desired list of labels and URL parameters to allow more flexible usage.

Initial release

04 Feb 14:41
Compare
Choose a tag to compare

This version gets the data required for the work described in http://blog.nationalarchives.gov.uk/blog/catalogue-data-basics/ by using the API for Discovery. Only the fields needed for the plots shown are included.