Test cases and other data for training and testing address matching algorithms.
Test cases are held in tab-separated format files with the following columns:
- test — an identifier for the test case which should be unique across all tests
- name — the addressee or name of the business (if separable)
- text — address text to be matched, newlines should be encoded as '\n' (only include name or postcode if can't be stored in separate field)
- postcode — an optional, separate postcode (if separable)
- uprns — one or more UPRN values in decimal which could match the address, separated by semicolon ';'
- notes — an explanation of the test
A test case may contain additional fields for information.
The bulk directory contains addresses found in bulk in open data, to be matched.
- charity-commission — 345k addresses from Charity Commission
- democracy-club-polling-stations – 5k polling stations from Democracy Club
- edubase — 44k schools from Department of Education
- food-hygiene — 360k food establishments from Food Standards Agency
- general-medical-practices 13k practices from HSCIC
- pharmacy — 11k pharmacies from NHS Choices
- price-paid — 21 million residential properties from Land Registry
- voa – scripts to process the 1.9 million VOA business-rates valuations (not openly available)
Few bulk datasets currently contain resolved UPRNs, but can form the basis of test cases as we build registers.
The software in this project is open source, covered by LICENSE file.
The data held in this repository is © Crown copyright and available under the terms of the Open Government 3.0 licence.
Data downloaded by the build process may be covered by different copyright and terms.