# IATI Publisher Data Getter


| Version | Date | Description |
|-|-|-|
| 1.0 | 2019-09-23 | Initial run, investigating document link and related activity schema validation errors |
| 1.1 | 2019-09-23 | Updating the ad-hoc analysis section for clarity, expanding ruleset validation presentation |
| 2.0 | 2019-10-07 | Added non-current filter following PWYF rules, added codelist outputs to the ruleset evaluation, and added an exported .xslx file for ruleset evaluation sheets. |
| 2.1 | 2019-11-27 | (Ben W) Use lxml's huge_tree param to support bigger files |
|3.0| 2019-01-17 | (Ben W) Add file upload option, run CoVE validation on current activities only, split current_dict into a function so we can test it in another notebook, and run some checks from the IATI Publishing Statistics.
|||

Copyright (C) 2019 Open Data Services Co-operative Limited

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.


## Imports, Downloading, Merging


### Imports

In [34]:
! pip install XlsxWriter



In [0]:
import pandas as pd
import requests as rq
import lxml.etree as ET
import json
import shutil
import copy

### Setup

In [0]:
# Warning huge_tree disables security restrictions, so should only be used on trusted XML files
parser = ET.XMLParser(huge_tree=True)

#define comment removal method
def remove_comments(etree):
  
  comments = etree.xpath('//comment()')

  for c in comments:
      p = c.getparent()
      p.remove(c)

  return etree

### Download or Upload

In [0]:
data_mode = "Download" #@param ["Download", "Upload"]

#### Download

Input the IATI Registry ID of a given IATI Publisher. 

Set `current_only` to filter the combined IATI to consider only activities which are 'current' according to the [PWYF ATI Methodology](https://github.com/pwyf/latest-index-indicator-definitions/issues/1).

Set `current_until` to provide a date which will be used to calculate the current period (12 months prior to this date).

Set `keep_non_current` if you want to download a copy of the non-current activities.

Exceptions can be added as an array of dataset_id strings such as this:

```json
exceptions: ["dataset1", "dataset2, "..."]
```

Currently only Activity files are supported.

In [0]:
if data_mode == "Download":
  registry_id = "dfid" #@param {type:"string"}
  filetype = "Activities" #@param ["Activities", "Organisations"]
  current_only = True #@param {type:"boolean"}
  if current_only:
    current_until = "2020-01-29" #@param {type:"date"}
    keep_non_current = True #@param {type:"boolean"}
  exceptions =  [] #@param {type:"raw"}

In [0]:
if data_mode == "Download":
  datasets = pd.read_csv("https://iatiregistry.org/csv/download/"+registry_id)

In [40]:
if data_mode == "Download":
  # remove unwanted datasets

  if filetype == "Activities":
    datasets = datasets[datasets['file-type'] != 'organisation']
  else: raise Exception('Currently, this notebook only supports IATI Activities, though could be easily modified to support Organisations')

  datasets = datasets[~datasets['registry-file-id'].isin(exceptions)]
  datasets = datasets.reset_index()

  print("Removed unwanted activities and setup comment-removal method")

Removed unwanted activities and setup comment-removal method


In [41]:
if data_mode == "Download":
  print("\nCombining {} IATI files \n".format(len(datasets['source-url'])))

  # Start with the first file, with comments removed
  big_iati = remove_comments(ET.fromstring(rq.get(datasets['source-url'][0]).content, parser=parser))

  # Start a dictionary to keep track of the additions
  merge_log = {datasets['source-url'][0]: len(big_iati.getchildren())}

  # Iterate through the 2nd through last file and
  # insert their activtities to into the first
  # and update the dictionary
  for url in datasets['source-url'][1:]:
      data = remove_comments(ET.fromstring(rq.get(url).content))
      merge_log[url] = len(data.getchildren())
      big_iati.extend(data.getchildren())

  # Print a small report on the merging
  print("Files Merged: ")
  for file, activity_count in merge_log.items():
      print("|-> {} activities from {}".format(activity_count, file))
  print("|--> {} in total".format(len(big_iati.getchildren())))

  with open("combined.xml", "wb+") as out_file:
      out_file.write(ET.tostring(big_iati, encoding='utf8', pretty_print=True))


Combining 125 IATI files 

Files Merged: 
|-> 10 activities from http://iati.dfid.gov.uk/iati_files/Country/DFID-Ivory-Coast-CI.xml
|-> 10 activities from http://iati.dfid.gov.uk/iati_files/Country/DFID-Niger-NE.xml
|-> 101 activities from http://iati.dfid.gov.uk/iati_files/Country/DFID-Jordan-JO.xml
|-> 11 activities from http://iati.dfid.gov.uk/iati_files/Country/DFID-Laos-LA.xml
|-> 11 activities from http://iati.dfid.gov.uk/iati_files/Country/DFID-Mali-ML.xml
|-> 11 activities from http://iati.dfid.gov.uk/iati_files/Country/DFID-Nicaragua-NI.xml
|-> 119 activities from http://iati.dfid.gov.uk/iati_files/Country/DFID-Tajikistan-TJ.xml
|-> 12 activities from http://iati.dfid.gov.uk/iati_files/Country/DFID-Russia-RU.xml
|-> 13 activities from http://iati.dfid.gov.uk/iati_files/Country/DFID-Chad-TD.xml
|-> 14 activities from http://iati.dfid.gov.uk/iati_files/Region/DFID-North-of-Sahara-regional-189.xml
|-> 14 activities from http://iati.dfid.gov.uk/iati_files/Country/DFID-United-King

#### Upload

In [0]:
if data_mode == "Upload":
  filename = "upload.xml" #@param {type:"string"}
  current_only = True #@param {type:"boolean"}
  if current_only == True:
    current_until = "2020-01-29" #@param {type:"date"}
    keep_non_current = False #@param {type:"boolean"}

Go to View->"Table of contents" if the left pane isn't open, click Files at the top, then click Upload. Upload the file, then update `filename` below to match the filename you uploaded. Note that uploaded files will be cleared when the session ends.

In [0]:
if data_mode == "Upload":
  shutil.copyfile(filename, 'combined.xml')
  big_iati = remove_comments(ET.parse('combined.xml', parser=parser).getroot())

### Filter current activities

In [0]:
import datetime as dt
from dateutil.relativedelta import relativedelta

selected_date = dt.datetime.strptime(current_until, "%Y-%m-%d")

def current_dict(activity):
  status_check = False
  planned_end_date_check = False
  actual_end_date_check = False
  transaction_date_check = False

  # print("Activity {} of {}".format(count, len(big_iati)))
  
  if activity.xpath("activity-status[@code=2]"):
    status_check = True

  if activity.xpath("activity-date[@type=3]/@iso-date"):
    date_time_obj = dt.datetime.strptime(activity.xpath("activity-date[@type=3]/@iso-date")[0], '%Y-%m-%d')
    if date_time_obj > (selected_date - relativedelta(years=1)):
      planned_end_date_check = True
  
  if activity.xpath("activity-date[@type=4]/@iso-date"):
    date_time_obj = dt.datetime.strptime(activity.xpath("activity-date[@type=4]/@iso-date")[0], '%Y-%m-%d')
    if date_time_obj > (selected_date - relativedelta(years=1)):
      actual_end_date_check = True

  if activity.xpath("transaction/transaction-type[@code=2 or @code=3 or @code=4]"):
    dates = activity.xpath("transaction[transaction-type[@code=2 or @code=3 or @code=4]]/transaction-date/@iso-date")
    date_truths = [dt.datetime.strptime(date, '%Y-%m-%d') > (selected_date - relativedelta(years=1)) for date in dates]
    if True in date_truths:
      transaction_date_check = True

  pwyf_current = status_check or planned_end_date_check or actual_end_date_check or transaction_date_check

  return {
    'iati-identifier': activity.findtext('iati-identifier'),
    'status_check': status_check, 
    'planned_end_date_check': planned_end_date_check, 
    'actual_end_date_check': actual_end_date_check, 
    'transaction_date_check': transaction_date_check,
    'pwyf_current': pwyf_current,
  }
  

In [0]:
# Filter out non-current activities, if appropriate
# See https://github.com/pwyf/latest-index-indicator-definitions/issues/1

log_columns = ["iati-identifier", "status_check", "planned_end_date_check", "actual_end_date_check", "transaction_date_check", "pwyf_current"]
count = 1
current_check_log = pd.DataFrame(columns=log_columns)

for activity in big_iati:
  current_check_log = current_check_log.append(current_dict(activity), ignore_index=True)
  count = count + 1
    
current_check_log.to_csv("current_check_log.csv")

In [0]:
big_iati_archived = copy.copy(big_iati)

In [0]:
import datetime as dt

selected_date = dt.datetime.strptime(current_until, "%Y-%m-%d")

def print_non_current(activity):
    print_output = ""

    iati_identifier = activity.findtext('iati-identifier')
    activity_status = activity.xpath("activity-status")[0].values()
    print_output += "-----------------------Non-Current Activity-------------------------\n"
    print_output += "iati-identifier: {}\n".format(iati_identifier)
    print_output += "activity-status: {}\n".format(activity_status)

    if activity.xpath("activity-date[@type=3]/@iso-date"):
      date_time_obj = dt.datetime.strptime(activity.xpath("activity-date[@type=3]/@iso-date")[0], '%Y-%m-%d')
      print_output += "activity-date[@type=3]/@iso-date: {}\n".format(date_time_obj)
      
    if activity.xpath("activity-date[@type=4]/@iso-date"):
      date_time_obj = dt.datetime.strptime(activity.xpath("activity-date[@type=4]/@iso-date")[0], '%Y-%m-%d')
      print_output += "activity-date[@type=4]/@iso-date: {}\n".format(date_time_obj)


    if activity.xpath("transaction/transaction-type[@code=2]"):
      dates = activity.xpath("transaction[transaction-type[@code=2]]/transaction-date/@iso-date")
      print_output += "transaction/transaction-type[@code=2]: {}\n".format(dates)

    if activity.xpath("transaction/transaction-type[@code=3]"):
      dates = activity.xpath("transaction[transaction-type[@code=3]]/transaction-date/@iso-date")
      print_output += "transaction/transaction-type[@code=3]: {}\n".format(dates)

    if activity.xpath("transaction/transaction-type[@code=4]"):
      dates = activity.xpath("transaction[transaction-type[@code=4]]/transaction-date/@iso-date")
      print_output += "transaction/transaction-type[@code=4]: {}\n".format(dates)

    return print_output
    

In [48]:
cur_length = len(big_iati)
if keep_non_current:
  non_current_iati = copy.copy(big_iati)

if current_only:
  for activity in big_iati:
    if activity.findtext('iati-identifier') in current_check_log.loc[current_check_log['pwyf_current'] == False, 'iati-identifier'].values: 
      activity.getparent().remove(activity)
  if keep_non_current:
    for activity in non_current_iati:
      if activity.findtext('iati-identifier') in current_check_log.loc[current_check_log['pwyf_current'] == True, 'iati-identifier'].values:  
        activity.getparent().remove(activity)
       

  print("Removed {} non-current activities from a total of {}.".format((cur_length-len(big_iati)),cur_length))
  print("{} current activities remain.".format(len(big_iati)))
  if keep_non_current:
    print("The {} non-current activities have been stored in non_current_details.xml".format((len(non_current_iati))))


else:
  print("As `current_only` is set to False, all retrieved activities have been kept")

Removed 13626 non-current activities from a total of 19477.
5851 current activities remain.
The 13626 non-current activities have been stored in non_current_details.xml
Removed 13626 non-current activities from a total of 19477.
5851 current activities remain.
The 13626 non-current activities have been stored in non_current_details.xml


In [0]:
with open("combined_current.xml", "wb+") as out_file:
  out_file.write(ET.tostring(big_iati, encoding='utf8', pretty_print=True))

In [0]:
if current_only and keep_non_current:
  with open("combined_non_current.xml", "wb+") as out_file:
    out_file.write(ET.tostring(non_current_iati, encoding='utf8', pretty_print=True))
  
  with open("non_current_details.txt", "w") as out_file:
    out_file.write("There are {} non-current activities.\n".format(len(non_current_iati)))
    for activity in non_current_iati:
      out_file.write(print_non_current(activity))

## Ad Hoc Analysis

This section can be used to evaluate specifica aspects of the total corpus of data, for instance, using `coverage_check()` you can look at the number of activities which include specific elements, or which satisfy certain contditions. This requires some python and XML knowledge. Some examples have been included below.

In [0]:
def coverage_check(tree, path, manual_list_entry=False):
  if manual_list_entry:
    denominator = len(tree)
    numerator = len(path)
  else:
    denominator = len(tree.getchildren())
    numerator = len(tree.xpath(path))

  coverage = numerator / denominator
  return denominator, numerator, coverage

In [52]:
coverage_check(big_iati, "iati-activity[transaction]")

(5851, 3939, 0.6732182532900359)

(5851, 3939, 0.6732182532900359)

In [53]:
coverage_check(big_iati, "iati-activity[capital-spend]")

(5851, 4570, 0.7810630661425397)

(5851, 4570, 0.7810630661425397)

In [54]:
# activities with a disbursement
coverage_check(big_iati, "iati-activity[transaction/transaction-type/@code = 3]")

(5851, 3567, 0.6096393778841224)

(5851, 3567, 0.6096393778841224)

In [55]:
# Manual entry of two lists to see the proportion of transactions which are disbursements
coverage_check(
    big_iati.xpath("iati-activity/transaction"), 
    big_iati.xpath("iati-activity/transaction[transaction-type/@code = 3]"), 
    True)

(54551, 37882, 0.6944327326721783)

(54551, 37882, 0.6944327326721783)

## Batch CoVE Validation

In [23]:
json_validation_filepath = 'validation.json'

url = 'https://iati.cove.opendataservices.coop/api_test'
files = {'file': open("combined_current.xml", 'rb')}
r = rq.post(url, files=files, data={"name": "combined_current.xml"})

print(r)

print("CoVE validation was successful." if r.ok else "Something went wrong.")

validation_json = r.json()

with open(json_validation_filepath, "w") as out_file:
    json.dump(validation_json, out_file)

print('Validation JSON file has been written to {}.'.format(
    json_validation_filepath))

<Response [200]>
CoVE validation was successful.
Validation JSON file has been written to validation.json.


In [24]:
ruleset_table = pd.DataFrame(data=validation_json['ruleset_errors'])
schema_table = pd.DataFrame(data=validation_json['validation_errors'])
embedded_codelist_table = pd.DataFrame(data=validation_json['invalid_embedded_codelist_values'])
non_embedded_codelist_table = pd.DataFrame(data=validation_json['invalid_non_embedded_codelist_values'])

print(
    "CoVE has found: \n* {} schema errors \n* {} ruleset errors \n* {} embedded codelist errors \n* {} non-embedded codelist errors".format(
    len(schema_table), 
    len(ruleset_table), 
    len(embedded_codelist_table), 
    len(non_embedded_codelist_table)))

print("\nWriting to validation_workbook.xlsx")
writer = pd.ExcelWriter('validation_workbook.xlsx', engine='xlsxwriter')
# Write each dataframe to a different worksheet.
schema_table.to_excel(writer, sheet_name='schema_table')
ruleset_table.to_excel(writer, sheet_name='ruleset_table')
embedded_codelist_table.to_excel(writer, sheet_name='embedded_codelist_table')
non_embedded_codelist_table.to_excel(writer, sheet_name='non_embedded_codelist_table')

# Close the Pandas Excel writer and output the Excel file.
writer.save()


CoVE has found: 
* 0 schema errors 
* 116 ruleset errors 
* 0 embedded codelist errors 
* 3 non-embedded codelist errors

Writing to validation_workbook.xlsx


### Schema Validation

In [25]:
schema_table

#### Custom Analysis

This section gives space to investigate schema errors identified in the secion above. It requires a small amount of tinkering in python.

In [0]:
# To view offending XML element, take the '/NN/' from the path above, add one, 
# and then modify the remaining content of the path to print a section of XML.
# print(ET.tostring(big_iati.xpath("iati-activity[61]/related-activity")[0].getparent()).decode())

Note the lack of description in the XML output above

### Ruleset Validation

In [27]:
# Concise Summary
ruleset_table.pivot_table(index=['rule',], aggfunc='count').drop(columns=["path", "ruleset", "explanation"])

Unnamed: 0_level_0,id
rule,Unnamed: 1_level_1
activity-date[@type='2']/@iso-date must be today or in the past,19
either sector or transaction/sector must be present,17
recipient-country/@percentage and recipient-region/@percentage must sum to 100%,80


In [28]:
# Full Table
ruleset_table.head()

Unnamed: 0,rule,id,explanation,path,ruleset
0,recipient-country/@percentage and recipient-re...,GB-1-202921,recipient-country|recipient-region/@percentage...,/iati-activities/iati-activity[57]/recipient-c...,Percentages must sum to 100%
1,recipient-country/@percentage and recipient-re...,GB-GOV-1-300535,recipient-country|recipient-region/@percentage...,/iati-activities/iati-activity[74]/recipient-c...,Percentages must sum to 100%
2,recipient-country/@percentage and recipient-re...,GB-1-204516,recipient-country|recipient-region/@percentage...,/iati-activities/iati-activity[83]/recipient-c...,Percentages must sum to 100%
3,recipient-country/@percentage and recipient-re...,GB-1-205210,recipient-country|recipient-region/@percentage...,/iati-activities/iati-activity[205]/recipient-...,Percentages must sum to 100%
4,recipient-country/@percentage and recipient-re...,GB-1-205210,recipient-country|recipient-region/@percentage...,/iati-activities/iati-activity[296]/recipient-...,Percentages must sum to 100%


## ATI Data Quality Testing

Download `combined.xml` and upload to [this testing tool](http://dataqualitytester.publishwhatyoufund.org/) to check data quality in line with the Aid Transparency Index Methodology

| Date | Link | Notes |
|-|-|-|
|YYYY-MM-DD|[Link](http://dataqualitytester.publishwhatyoufund.org/package/bb957674-6ccf-4635-a553-1d2dd0382075)| Some description of findings or link to notes / report |
||||
||||
||||
||||

## IATI Publisher Statistics

Run some code from the [OpenDataServices/iati-publishingstats-details](https://github.com/OpenDataServices/iati-publishingstats-details) repo, which produces per activity CSVs for the [IATI Publishing Statistics](http://publishingstats.iatistandard.org/) checks.

In [29]:
import os
os.chdir('/content')
!rm -r iati-publishingstats-details
!git clone https://github.com/OpenDataServices/iati-publishingstats-details.git
os.chdir('/content/iati-publishingstats-details')
!git submodule init
!git submodule update
!mkdir logs
# Note that the Publishing Statistics code requires Python 2, so needs a virtualenv
!sudo apt install python-virtualenv > logs/apt.log
!virtualenv .ve
!source .ve/bin/activate; pip install -r requirements.txt > logs/requirements.log
!source .ve/bin/activate; ./fetch_helpers.sh > logs/fetch_helpers.log 2>&1
!source .ve/bin/activate; python forward_looking_details.py ../combined.xml > forward_looking_details.csv
!source .ve/bin/activate; python comprehensiveness_is_current_details.py ../combined.xml > comprehensiveness_is_current_details.csv
os.chdir('/content')

rm: cannot remove 'iati-publishingstats-details': No such file or directory
Cloning into 'iati-publishingstats-details'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 30 (delta 12), reused 24 (delta 6), pack-reused 0[K
Unpacking objects: 100% (30/30), done.
Submodule 'IATI-Publishing-Statistics' (https://github.com/IATI/IATI-Publishing-Statistics.git) registered for path 'IATI-Publishing-Statistics'
Cloning into '/content/iati-publishingstats-details/IATI-Publishing-Statistics'...
Submodule path 'IATI-Publishing-Statistics': checked out '623498143ad809a57b87d357ea0dd0f65afd11b0'


debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 6.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: 

In [30]:
os.chdir('/content')
forward_looking_details = pd.read_csv('iati-publishingstats-details/forward_looking_details.csv')
forward_looking_details

Unnamed: 0,iati-identifier,First year to fail,End dates,Budget years
0,GB-1-204895,2020,2021-12-30,No budgets
1,GB-1-205200,2020,2022-12-31,No budgets
2,GB-1-205200-105,2020,2022-11-30,2017201820182018
3,GB-1-205201,2020,2025-06-30,No budgets
4,GB-1-205201-106,2020,2022-03-31,2018
...,...,...,...,...
1900,GB-GOV-1-300731-101,2021,2021-03-31,2018201920192019201920202020
1901,GB-GOV-1-300123,2020,2023-09-30,No budgets
1902,GB-GOV-1-300310,2020,2023-03-31,No budgets
1903,GB-GOV-1-300310-106,2020,2023-03-31,2019


In [31]:
comprehensiveness_is_current_details = pd.read_csv('iati-publishingstats-details/comprehensiveness_is_current_details.csv')
comprehensiveness_is_current_details

Unnamed: 0,iati-identifier,publishingstats_comprehensiveness_current
0,GB-1-202417,False
1,GB-1-202417-101,False
2,GB-1-202429,False
3,GB-1-202429-101,False
4,GB-1-202434,False
...,...,...
19472,GB-GOV-1-300599,True
19473,GB-GOV-1-300599-101,False
19474,GB-GOV-1-300686,True
19475,GB-GOV-1-300686-101,True


In [32]:
big_iati_comprehensiveness_current = copy.copy(big_iati_archived)

cur_length = len(big_iati_comprehensiveness_current)

if current_only:
  for activity in big_iati_comprehensiveness_current:
    if activity.findtext('iati-identifier') in comprehensiveness_is_current_details.loc[comprehensiveness_is_current_details['publishingstats_comprehensiveness_current'] == False, 'iati-identifier'].values:
      activity.getparent().remove(activity)
  
  print("Removed {} non-current activities from a total of {}.".format((cur_length-len(big_iati_comprehensiveness_current)),cur_length))
  print("{} current activities remain.".format(len(big_iati_comprehensiveness_current)))

else:
  print("As `current_only` is set to False, all retrieved activities have been kept")

Removed 14529 non-current activities from a total of 19477.
4948 current activities remain.


In [33]:
merged_currents = pd.merge(current_check_log, comprehensiveness_is_current_details, on="iati-identifier")
merged_currents.groupby(['pwyf_current', 'publishingstats_comprehensiveness_current']).size().unstack()
# Note: these numbers will not add up properly if there are duplicate iati-identifiers

publishingstats_comprehensiveness_current,False,True
pwyf_current,Unnamed: 1_level_1,Unnamed: 2_level_1
False,14285,1
True,966,5859
