New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull external codes automatically including withdrawn (historic) codes #51

Open
wants to merge 12 commits into
from

Conversation

Projects
None yet
8 participants
@Bjwebb
Contributor

Bjwebb commented Feb 4, 2015

The work done here means that the Country, Currency and FileFormat codelists can now be pulled from source programmatically.

Where IATI derives codelists from external sources we are aiming to pull those codes automatically and directly from their source. However, when those external sources delete or remove values we need to consider how we handle that. (We also need to consider the cases where IATI has agreed to add additional values not in the source e.g. XK=Kosovo on the Country codelist)

We are using the term 'withdrawn' to deal with codes that have been removed from the current source lists. We may also refer to these as historic codes.

We need to make these withdrawn, or historic, values available to data users that wish to report older data. Adding withdrawn codes is important, as it ensures that codes currently in use (and in historical data) are valid against the codelist, even though they may have been subsequently withdrawn.

In the case of the ISO Country and ISO currency sources, they have clearly defined ways of dealing with withdrawn values. Both source lists maintain their own list of withdrawn values, which we plan to now import. As a result, this pull request adds a large number of new (withdrawn) codes to these lists. Consequently, this pull request increases the country codelist from 251 to 308 codes in total; and the Currency codelist from 167 to 300 codes in total.

When we add a withdrawn value to the IATI codelists we flag it by adding a withdrawn attribute on the codelist-item element in the XML. (This has been changed in the codelist schema - see below).

As a result of adding withdrawn codes, this pull request doesn't remove any codes from the Country and Currency lists (but some existing codes may be listed as withdrawn if that is true in the ISO source lists).

Impact for codelist users

Once this change is accepted anyone parsing the XML codelists will, by default, see all the entries in their results - e.g. a drop down selection list will contain all entries. To exclude withdrawn entries you will need to specifically request 'not withdrawn' values.

As the codelist API currently stands, consumers of the JSON and CSV, CLv1 XML, CLv2 XML versions of codelists will not yet see the withdrawn attribute and therefore be unable to tell which are current values. However, we plan to address this before the change goes live: IATI/IATI-Codelists#79

Altering the codelist XML Schema

The withdrawn attribute should also be added to the xsd in the main codelists repository, see IATI/IATI-Codelists#78. The withdrawn attribute contains whatever information the source has about when the code withdrawn, so the format is not specified in the schema.

Handling IATI specific codes

Any IATI specific codes should are maintained by adding them to the codelist template file. e.g.
XK - https://github.com/IATI/IATI-Codelists-NonEmbedded/blob/9-historical-codes/templates/Country.xml#L10

Country codelist note

This has had the withdrawn two letter codes added (which are only guaranteed to not be reused for 50 years), and the new 4 letter codes that these countries are assigned when withdrawn (see https://en.wikipedia.org/wiki/ISO_3166-3 for more information).

Renames on the IANA Media Types list (used for FileFormat)

audio/amr-wb -> audio/AMR-WB
video/MJ2 - > video/mj2

These are partly because IANA does not treat their codes as case sensitive. In order to maintain faithfullness to the source list, and because no-one is using these codes, I suggest that in this case we make these renames on the FileFormat list without maintaining the old codes as withdrawn.

Relevant GitHub issues

This pull request resolves the following GitHub issues:

Remaining Tasks

  • Implement the change to the codelist API - IATI/IATI-Codelists#79
  • Update the travis test after the change to the codelist schema has been merged
  • Update the README in this repository to describe how external codelists are fetched

@Bjwebb Bjwebb changed the title from Pull external codes automatically including historic codes to Pull external codes automatically including withdrawn (historic) codes Feb 5, 2015

Show outdated Hide outdated xml/Currency.xml
@@ -1,7 +1,7 @@
<codelist name="Currency" xml:lang="en" complete="1">
<metadata>
<name>
<narrative>Currency</narrative>
<narrative>&gt;Currency</narrative>

This comment has been minimized.

@caprenter

caprenter Feb 11, 2015

Contributor

typo?

@caprenter

caprenter Feb 11, 2015

Contributor

typo?

This comment has been minimized.

@Bjwebb

Bjwebb Feb 19, 2015

Contributor

Yep, thanks. Fixed in a5ec422

@Bjwebb

Bjwebb Feb 19, 2015

Contributor

Yep, thanks. Fixed in a5ec422

[#9] Remove historic 2 letter codes from Country list
4 letter codes should be used instead.
@markbrough

This comment has been minimized.

Show comment
Hide comment
@markbrough

markbrough Jan 4, 2016

What's the status of this pull request? Looks like it would be quite useful for keeping some of these codelists up to date…

What's the status of this pull request? Looks like it would be quite useful for keeping some of these codelists up to date…

@markbrough

This comment has been minimized.

Show comment
Hide comment
@markbrough

markbrough Aug 18, 2016

Bumping this pull request...

Bumping this pull request...

@andylolz andylolz referenced this pull request Mar 21, 2017

Merged

Update Sector codelist #137

@andylolz

This comment has been minimized.

Show comment
Hide comment
@andylolz

andylolz Mar 22, 2017

Contributor

Also bumping (because automating the codelist update process is relevant to my interests!)

Finding a way to track withdrawn codes from DAC codelists would also be great. But that shouldn’t be a blocker for this PR! (so I have (re)moved my earlier comments)

Contributor

andylolz commented Mar 22, 2017

Also bumping (because automating the codelist update process is relevant to my interests!)

Finding a way to track withdrawn codes from DAC codelists would also be great. But that shouldn’t be a blocker for this PR! (so I have (re)moved my earlier comments)

wget "https://www.iana.org/assignments/media-types/media-types.xml" -O source/media-types.xml
wget "http://www.currency-iso.org/dam/downloads/table_a1.xml" -O source/table_a1.xml

This comment has been minimized.

@andylolz

andylolz Mar 22, 2017

Contributor

This should now become:

wget "https://www.currency-iso.org/dam/downloads/lists/list_one.xml" -O source/table_a1.xml
@andylolz

andylolz Mar 22, 2017

Contributor

This should now become:

wget "https://www.currency-iso.org/dam/downloads/lists/list_one.xml" -O source/table_a1.xml
wget "https://www.iana.org/assignments/media-types/media-types.xml" -O source/media-types.xml
wget "http://www.currency-iso.org/dam/downloads/table_a1.xml" -O source/table_a1.xml
wget "http://www.currency-iso.org/dam/downloads/table_a3.xml" -O source/table_a3.xml

This comment has been minimized.

@andylolz

andylolz Mar 22, 2017

Contributor

This should now become:

wget "https://www.currency-iso.org/dam/downloads/lists/list_three.xml" -O source/table_a3.xml
@andylolz

andylolz Mar 22, 2017

Contributor

This should now become:

wget "https://www.currency-iso.org/dam/downloads/lists/list_three.xml" -O source/table_a3.xml
codelist_items.append(codelist_item)
countries = ET.parse('source/iso_country_codes.xml')

This comment has been minimized.

@andylolz

andylolz Mar 22, 2017

Contributor

Looks as though the missing iso_country_codes.xml comes from here… Presumably this can’t be pulled in a completely automated way?

@andylolz

andylolz Mar 22, 2017

Contributor

Looks as though the missing iso_country_codes.xml comes from here… Presumably this can’t be pulled in a completely automated way?

This comment has been minimized.

@Bjwebb

Bjwebb Mar 22, 2017

Contributor

That's right. It's behind a paywall, so we can't automatically download.

@Bjwebb

Bjwebb Mar 22, 2017

Contributor

That's right. It's behind a paywall, so we can't automatically download.

codelist_item = ET.Element('codelist-item')
if withdrawn:
codelist_item.attrib['withdrawn'] = withdrawn

This comment has been minimized.

@hayfield

hayfield Mar 23, 2017

Contributor

The chosen solution ended up adding a trio of attributes - status, activation-date and withdrawal-date. As such, this script and the generated Codelists should use that method of identifying withdrawn codes.

@hayfield

hayfield Mar 23, 2017

Contributor

The chosen solution ended up adding a trio of attributes - status, activation-date and withdrawal-date. As such, this script and the generated Codelists should use that method of identifying withdrawn codes.

currency_codes[currency_code] = (currency_name, country_currency.find('WthdrwlDt').text if historic else None)
# Ensure that historic codes come after current codes
for histroic_section in [False, True]:

This comment has been minimized.

@hayfield

hayfield Mar 23, 2017

Contributor

Typo in the spelling of the variable name - histroic vs historic

@hayfield

hayfield Mar 23, 2017

Contributor

Typo in the spelling of the variable name - histroic vs historic

@hayfield

This comment has been minimized.

Show comment
Hide comment
@hayfield

hayfield Mar 23, 2017

Contributor

@dalepotter @wendyrogers What is the current plan for automating updates to replicated Codelists? With a couple of fairly minor modifications, this could allow a couple of Codelists to be updated, though if the plan is to automate retrieval of larger number of Codelists then building a separate tool on top of iati.core would lead to a more sustainable architecture.

Additionally, iatistandard.org does not appear to deal with the status and withdrawal-date attributes, so there will likely be extra work needed there for this change to have its desired impact.

Contributor

hayfield commented Mar 23, 2017

@dalepotter @wendyrogers What is the current plan for automating updates to replicated Codelists? With a couple of fairly minor modifications, this could allow a couple of Codelists to be updated, though if the plan is to automate retrieval of larger number of Codelists then building a separate tool on top of iati.core would lead to a more sustainable architecture.

Additionally, iatistandard.org does not appear to deal with the status and withdrawal-date attributes, so there will likely be extra work needed there for this change to have its desired impact.

@dalepotter

This comment has been minimized.

Show comment
Hide comment
@dalepotter

dalepotter Mar 24, 2017

Collaborator

I would support automation of codelist updates for codelists that we consider to have robust governance and management processes, alongside machine-readable access. This would include the Country and Currency codelists (managed by the ISO). We should better understand the processes that lead to new versions of the FileFormat list (managed by the IANA).

Alongside this, we must have good test coverage for these automated processes. If they are run headless, there should be good logging and notification of actions taken. I've added this scoping to the list of weekly maintenance jobs, so that we can determine the roadmap to implementation.

Collaborator

dalepotter commented Mar 24, 2017

I would support automation of codelist updates for codelists that we consider to have robust governance and management processes, alongside machine-readable access. This would include the Country and Currency codelists (managed by the ISO). We should better understand the processes that lead to new versions of the FileFormat list (managed by the IANA).

Alongside this, we must have good test coverage for these automated processes. If they are run headless, there should be good logging and notification of actions taken. I've added this scoping to the list of weekly maintenance jobs, so that we can determine the roadmap to implementation.

@dalepotter

This comment has been minimized.

Show comment
Hide comment
@dalepotter

dalepotter Apr 24, 2017

Collaborator

Just to update on this issue - we are in discussions with the OECD regarding the publication of machine-readable codelists and are meeting to explore this further in early May. From the outcomes of these conversations, we will have a better view on how to take this work forward and we will update here accordingly.

Collaborator

dalepotter commented Apr 24, 2017

Just to update on this issue - we are in discussions with the OECD regarding the publication of machine-readable codelists and are meeting to explore this further in early May. From the outcomes of these conversations, we will have a better view on how to take this work forward and we will update here accordingly.

@andylolz

This comment has been minimized.

Show comment
Hide comment
@andylolz

andylolz Jul 5, 2017

Contributor

The approach taken in this PR for withdrawn codes really only works for the Currency list, because the source data helpfully includes withdrawn data. It’s a good starting point, though!

A more general approach (i.e. that doesn’t rely on the source data tracking withdrawals) would be something like:

  • generate new xml from source
  • read existing xml
  • if a new code is not on the existing list, set the activation date to today
  • if an existing, non-withdrawn code is not on the new list, add it to the new list, but mark as withdrawn with withdrawal date today
  • if an existing withdrawn code is on the new list, raise a weirdness warning (i.e. because something funky like code reuse has happened)
Contributor

andylolz commented Jul 5, 2017

The approach taken in this PR for withdrawn codes really only works for the Currency list, because the source data helpfully includes withdrawn data. It’s a good starting point, though!

A more general approach (i.e. that doesn’t rely on the source data tracking withdrawals) would be something like:

  • generate new xml from source
  • read existing xml
  • if a new code is not on the existing list, set the activation date to today
  • if an existing, non-withdrawn code is not on the new list, add it to the new list, but mark as withdrawn with withdrawal date today
  • if an existing withdrawn code is on the new list, raise a weirdness warning (i.e. because something funky like code reuse has happened)
@andylolz

This comment has been minimized.

Show comment
Hide comment
@andylolz

andylolz Jul 8, 2017

Contributor

Thanks to @markbrough for flagging http://data.okfn.org/data/core/country-codes (licensing concerns hopefully allayed by the README)

Similarly:

So I think the only third party datasets that don’t have openly licensed & machine readable versions, then, are LocationType, LocationType (category) & PolicySignificance. It’s trivial to scrape the first two – in fact here’s a quick scraper to demonstrate. Output here.

The source URL for PolicySignificance is broken, and I’m afraid I wasn’t able to find a working link.

Contributor

andylolz commented Jul 8, 2017

Thanks to @markbrough for flagging http://data.okfn.org/data/core/country-codes (licensing concerns hopefully allayed by the README)

Similarly:

So I think the only third party datasets that don’t have openly licensed & machine readable versions, then, are LocationType, LocationType (category) & PolicySignificance. It’s trivial to scrape the first two – in fact here’s a quick scraper to demonstrate. Output here.

The source URL for PolicySignificance is broken, and I’m afraid I wasn’t able to find a working link.

andylolz added a commit to andylolz/IATI-Codelists-NonEmbedded that referenced this pull request Aug 29, 2017

@stevieflow

This comment has been minimized.

Show comment
Hide comment
@stevieflow

stevieflow Oct 11, 2017

Contributor

@dalepotter can this PR be progressed?

Contributor

stevieflow commented Oct 11, 2017

@dalepotter can this PR be progressed?

@andylolz

This comment has been minimized.

Show comment
Hide comment
@andylolz

andylolz Oct 13, 2017

Contributor

@dalepotter can this PR be progressed?

👍 for the spirit of @stevieflow’s comment, but I’d instead suggest closing this PR and progressing #172 (if @Bjwebb agrees!) which builds on this work. Additionally, feedback/suggestions on #172 would be much appreciated.

There are several threads on IATI discuss that talk about the issue of keeping non-embedded codelists up-to-date, and I think it would be great to resolve this. Indeed, much of the work is already done thanks to @Bjwebb, @datasets and #172.

Contributor

andylolz commented Oct 13, 2017

@dalepotter can this PR be progressed?

👍 for the spirit of @stevieflow’s comment, but I’d instead suggest closing this PR and progressing #172 (if @Bjwebb agrees!) which builds on this work. Additionally, feedback/suggestions on #172 would be much appreciated.

There are several threads on IATI discuss that talk about the issue of keeping non-embedded codelists up-to-date, and I think it would be great to resolve this. Indeed, much of the work is already done thanks to @Bjwebb, @datasets and #172.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment