# IATI Publisher Data Getter


| Version | Date | Description |
|-|-|-|
| 1.0 | 2019-09-23 | Initial run, investigating document link and related activity schema validation errors |
| 1.1 | 2019-09-23 | Updating the ad-hoc analysis section for clarity, expanding ruleset validation presentation |
| 2.0 | 2019-10-07 | Added non-current filter following PWYF rules, added codelist outputs to the ruleset evaluation, and added an exported .xslx file for ruleset evaluation sheets. |
| 2.1 | 2019-11-27 | (Ben W) Use lxml's huge_tree param to support bigger files |
|3.0| 2019-01-17 | (Ben W) Add file upload option, run CoVE validation on current activities only, split current_dict into a function so we can test it in another notebook, and run some checks from the IATI Publishing Statistics.
|4.0| 2020-02-20 | (Jared P) Add a file to set the current date used and XML output of non-current activities
|5.0| 2021-02-04 | (Jared P) Provide a filter for activity status

Copyright (C) 2019 Open Data Services Co-operative Limited

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.


## Imports, Downloading, Merging


### Imports

In [155]:
! pip install XlsxWriter



In [0]:
import pandas as pd
import requests as rq
import lxml.etree as ET
import json
import shutil
import copy

### Setup

In [0]:
# Warning huge_tree disables security restrictions, so should only be used on trusted XML files
parser = ET.XMLParser(huge_tree=True)

#define comment removal method
def remove_comments(etree):
  
  comments = etree.xpath('//comment()')

  for c in comments:
      p = c.getparent()
      p.remove(c)

  return etree

### Download or Upload

In [0]:
data_mode = "Download" #@param ["Download", "Upload"]

#### Download

Input the IATI Registry ID of a given IATI Publisher.

`activity_status_filter` should be a list of the the activity-status values that you want to be included in the output as per the [IATI Standard (v2.03)](http://reference.iatistandard.org/203/codelists/ActivityStatus/). For example: 

```
activity_status_filter: [3,4]
```

Set `current_only` to filter the combined IATI to consider only activities which are 'current' according to the [PWYF ATI Methodology](https://github.com/pwyf/latest-index-indicator-definitions/issues/1).

Set `current_until` to provide a date which will be used to calculate the current period (12 months prior to this date).

Set `keep_non_current` if you want to download a copy of the non-current activities.

Exceptions can be added as an array of dataset_id strings such as this:

```json
exceptions: ["dataset1", "dataset2, "..."]
```

Currently only Activity files are supported.

In [0]:
if data_mode == "Download":
  registry_id = "gac-amc" #@param {type:"string"}
  filetype = "Activities" #@param ["Activities", "Organisations"]
  activity_status_filter = [3,4] #@param {type: "raw"}
  current_only = False #@param {type:"boolean"}
  if current_only:
    current_until = "2020-03-31" #@param {type:"date"}
    keep_non_current = False #@param {type:"boolean"}
  exceptions =  [] #@param {type:"raw"}


In [0]:
if data_mode == "Download":
  datasets = pd.read_csv("https://iatiregistry.org/csv/download/"+registry_id)

In [161]:
if data_mode == "Download":
  # remove unwanted datasets

  if filetype == "Activities":
    datasets = datasets[datasets['file-type'] != 'organisation']
  else: raise Exception('Currently, this notebook only supports IATI Activities, though could be easily modified to support Organisations')

  datasets = datasets[~datasets['registry-file-id'].isin(exceptions)]
  datasets = datasets.reset_index()

  print("Removed unwanted activities and setup comment-removal method")

Removed unwanted activities and setup comment-removal method


In [162]:
if data_mode == "Download":
  print("\nCombining {} IATI files \n".format(len(datasets['source-url'])))

  # Start with the first file, with comments removed
  big_iati = remove_comments(ET.fromstring(rq.get(datasets['source-url'][0]).content, parser=parser))

  # Start a dictionary to keep track of the additions
  merge_log = {datasets['source-url'][0]: len(big_iati.getchildren())}

  # Iterate through the 2nd through last file and
  # insert their activtities to into the first
  # and update the dictionary
  for url in datasets['source-url'][1:]:
      data = remove_comments(ET.fromstring(rq.get(url).content))
      merge_log[url] = len(data.getchildren())
      big_iati.extend(data.getchildren())

  # Print a small report on the merging
  print("Files Merged: ")
  for file, activity_count in merge_log.items():
      print("|-> {} activities from {}".format(activity_count, file))
  print("|--> {} in total".format(len(big_iati.getchildren())))

  with open("combined.xml", "wb+") as out_file:
      out_file.write(ET.tostring(big_iati, encoding='utf8', pretty_print=True))


Combining 2 IATI files 

Files Merged: 
|-> 1572 activities from http://w05.international.gc.ca/projectbrowser-banqueprojets/iita-iati/dfatd-maecd_activit_status_2_3.xml
|-> 3396 activities from http://w05.international.gc.ca/projectbrowser-banqueprojets/iita-iati/dfatd-maecd_activit_status_4.xml
|--> 4968 in total


#### Upload

In [0]:
if data_mode == "Upload":
  filename = "upload.xml" #@param {type:"string"}
  activity_status_filter = [] #@param {type: "raw"}
  current_only = True #@param {type:"boolean"}
  if current_only == True:
    current_until = "2020-01-29" #@param {type:"date"}
    keep_non_current = False #@param {type:"boolean"}

Go to View->"Table of contents" if the left pane isn't open, click Files at the top, then click Upload. Upload the file, then update `filename` below to match the filename you uploaded. Note that uploaded files will be cleared when the session ends.

In [0]:
if data_mode == "Upload":
  shutil.copyfile(filename, 'combined.xml')
  big_iati = remove_comments(ET.parse('combined.xml', parser=parser).getroot())

### Filter current activities

In [0]:
import datetime as dt
from dateutil.relativedelta import relativedelta

if current_only:
  selected_date = dt.datetime.strptime(current_until, "%Y-%m-%d")
else:
  selected_date = dt.datetime.now()

def current_dict(activity):
  status_check = False
  planned_end_date_check = False
  actual_end_date_check = False
  transaction_date_check = False

  # print("Activity {} of {}".format(count, len(big_iati)))
  
  if activity.xpath("activity-status[@code=2]"):
    status_check = True

  if activity.xpath("activity-date[@type=3]/@iso-date"):
    date_time_obj = dt.datetime.strptime(activity.xpath("activity-date[@type=3]/@iso-date")[0], '%Y-%m-%d')
    if date_time_obj > (selected_date - relativedelta(years=1)):
      planned_end_date_check = True
  
  if activity.xpath("activity-date[@type=4]/@iso-date"):
    date_time_obj = dt.datetime.strptime(activity.xpath("activity-date[@type=4]/@iso-date")[0], '%Y-%m-%d')
    if date_time_obj > (selected_date - relativedelta(years=1)):
      actual_end_date_check = True

  if activity.xpath("transaction/transaction-type[@code=2 or @code=3 or @code=4]"):
    dates = activity.xpath("transaction[transaction-type[@code=2 or @code=3 or @code=4]]/transaction-date/@iso-date")
    date_truths = [dt.datetime.strptime(date, '%Y-%m-%d') > (selected_date - relativedelta(years=1)) for date in dates]
    if True in date_truths:
      transaction_date_check = True

  pwyf_current = status_check or planned_end_date_check or actual_end_date_check or transaction_date_check

  return {
    'iati-identifier': activity.findtext('iati-identifier'),
    'status_check': status_check, 
    'planned_end_date_check': planned_end_date_check, 
    'actual_end_date_check': actual_end_date_check, 
    'transaction_date_check': transaction_date_check,
    'pwyf_current': pwyf_current,
  }
  

In [0]:
# Filter out non-current activities, if appropriate
# See https://github.com/pwyf/latest-index-indicator-definitions/issues/1

log_columns = ["iati-identifier", "status_check", "planned_end_date_check", "actual_end_date_check", "transaction_date_check", "pwyf_current"]
count = 1
current_check_log = pd.DataFrame(columns=log_columns)

for activity in big_iati:
  current_check_log = current_check_log.append(current_dict(activity), ignore_index=True)
  count = count + 1
    
current_check_log.to_csv("current_check_log.csv")

In [0]:
big_iati_archived = copy.copy(big_iati)

In [0]:
import datetime as dt

if current_only:
  selected_date = dt.datetime.strptime(current_until, "%Y-%m-%d")
else:
  selected_date = dt.datetime.now()

def print_non_current(activity):
    print_output = ""

    iati_identifier = activity.findtext('iati-identifier')
    activity_status = activity.xpath("activity-status")[0].values()
    print_output += "-----------------------Non-Current Activity-------------------------\n"
    print_output += "iati-identifier: {}\n".format(iati_identifier)
    print_output += "activity-status: {}\n".format(activity_status)

    if activity.xpath("activity-date[@type=3]/@iso-date"):
      date_time_obj = dt.datetime.strptime(activity.xpath("activity-date[@type=3]/@iso-date")[0], '%Y-%m-%d')
      print_output += "activity-date[@type=3]/@iso-date: {}\n".format(date_time_obj)
      
    if activity.xpath("activity-date[@type=4]/@iso-date"):
      date_time_obj = dt.datetime.strptime(activity.xpath("activity-date[@type=4]/@iso-date")[0], '%Y-%m-%d')
      print_output += "activity-date[@type=4]/@iso-date: {}\n".format(date_time_obj)


    if activity.xpath("transaction/transaction-type[@code=2]"):
      dates = activity.xpath("transaction[transaction-type[@code=2]]/transaction-date/@iso-date")
      print_output += "transaction/transaction-type[@code=2]: {}\n".format(dates)

    if activity.xpath("transaction/transaction-type[@code=3]"):
      dates = activity.xpath("transaction[transaction-type[@code=3]]/transaction-date/@iso-date")
      print_output += "transaction/transaction-type[@code=3]: {}\n".format(dates)

    if activity.xpath("transaction/transaction-type[@code=4]"):
      dates = activity.xpath("transaction[transaction-type[@code=4]]/transaction-date/@iso-date")
      print_output += "transaction/transaction-type[@code=4]: {}\n".format(dates)

    return print_output
    

In [169]:
cur_length = len(big_iati)
if current_only and keep_non_current:
  non_current_iati = copy.copy(big_iati)

if current_only:
  for activity in big_iati:
    if activity.findtext('iati-identifier') in current_check_log.loc[current_check_log['pwyf_current'] == False, 'iati-identifier'].values: 
      activity.getparent().remove(activity)
  if keep_non_current:
    for activity in non_current_iati:
      if activity.findtext('iati-identifier') in current_check_log.loc[current_check_log['pwyf_current'] == True, 'iati-identifier'].values:  
        activity.getparent().remove(activity)

  print("Removed {} non-current activities from a total of {}.".format((cur_length-len(big_iati)),cur_length))
  print("{} current activities remain.".format(len(big_iati)))
  if keep_non_current:
    print("The {} non-current activities have been stored in non_current_details.xml".format((len(non_current_iati))))


else:
  print("As `current_only` is set to False, all retrieved activities have been kept")

As `current_only` is set to False, all retrieved activities have been kept


In [170]:
cur_length = len(big_iati)

if len(activity_status_filter) > 0:
  all_activity_statuses = [1,2,3,4,5,6]
  invert_statuses = [filtered_status for filtered_status in all_activity_statuses if filtered_status not in activity_status_filter]
  print("Removing all activities with activity-status code = {0}".format(invert_statuses))
  for status in invert_statuses:
    for activity in big_iati:
      if activity.xpath("activity-status/@code = {0}".format(status)): 
        activity.getparent().remove(activity)

  print("Removed {} which did not have the specified activity statuses from a total of {}.".format((cur_length-len(big_iati)),cur_length))
  print("{} current activities remain.".format(len(big_iati)))

Removing all activities with activity-status code = [1, 2, 5, 6]
Removed 1221 which did not have the specified activity statuses from a total of 4968.
3747 current activities remain.


In [0]:
with open("combined_current.xml", "wb+") as out_file:
  out_file.write(ET.tostring(big_iati, encoding='utf8', pretty_print=True))

In [0]:
if current_only and keep_non_current:
  with open("combined_non_current.xml", "wb+") as out_file:
    out_file.write(ET.tostring(non_current_iati, encoding='utf8', pretty_print=True))
  
  with open("non_current_details.txt", "w") as out_file:
    out_file.write("There are {} non-current activities.\n".format(len(non_current_iati)))
    for activity in non_current_iati:
      out_file.write(print_non_current(activity))

## Ad Hoc Analysis

This section can be used to evaluate specifica aspects of the total corpus of data, for instance, using `coverage_check()` you can look at the number of activities which include specific elements, or which satisfy certain contditions. This requires some python and XML knowledge. Some examples have been included below.

In [0]:
def coverage_check(tree, path, manual_list_entry=False):
  if manual_list_entry:
    denominator = len(tree)
    numerator = len(path)
  else:
    denominator = len(tree.getchildren())
    numerator = len(tree.xpath(path))

  coverage = numerator / denominator
  return denominator, numerator, coverage

In [174]:
coverage_check(big_iati, "iati-activity[transaction]")

(3747, 3745, 0.9994662396583934)

In [175]:
coverage_check(big_iati, "iati-activity[capital-spend]")

(3747, 830, 0.22151054176674673)

In [176]:
# activities with a disbursement
coverage_check(big_iati, "iati-activity[transaction/transaction-type/@code = 3]")

(3747, 3301, 0.8809714438217241)

In [177]:
# Manual entry of two lists to see the proportion of transactions which are disbursements
coverage_check(
    big_iati.xpath("iati-activity/transaction"), 
    big_iati.xpath("iati-activity/transaction[transaction-type/@code = 3]"), 
    True)

(21044, 17284, 0.8213267439650257)

## Batch CoVE Validation

In [178]:
json_validation_filepath = 'validation.json'

url = 'https://iati.cove.opendataservices.coop/api_test'
files = {'file': open("combined_current.xml", 'rb')}
r = rq.post(url, files=files, data={"name": "combined_current.xml"})

print(r)

print("CoVE validation was successful." if r.ok else "Something went wrong.")

validation_json = r.json()

with open(json_validation_filepath, "w") as out_file:
    json.dump(validation_json, out_file)

print('Validation JSON file has been written to {}.'.format(
    json_validation_filepath))

<Response [200]>
CoVE validation was successful.
Validation JSON file has been written to validation.json.


In [179]:
ruleset_table = pd.DataFrame(data=validation_json['ruleset_errors'])
schema_table = pd.DataFrame(data=validation_json['validation_errors'])
embedded_codelist_table = pd.DataFrame(data=validation_json['invalid_embedded_codelist_values'])
non_embedded_codelist_table = pd.DataFrame(data=validation_json['invalid_non_embedded_codelist_values'])

print(
    "CoVE has found: \n* {} schema errors \n* {} ruleset errors \n* {} embedded codelist errors \n* {} non-embedded codelist errors".format(
    len(schema_table), 
    len(ruleset_table), 
    len(embedded_codelist_table), 
    len(non_embedded_codelist_table)))

print("\nWriting to validation_workbook.xlsx")
writer = pd.ExcelWriter('validation_workbook.xlsx', engine='xlsxwriter')
# Write each dataframe to a different worksheet.
schema_table.to_excel(writer, sheet_name='schema_table')
ruleset_table.to_excel(writer, sheet_name='ruleset_table')
embedded_codelist_table.to_excel(writer, sheet_name='embedded_codelist_table')
non_embedded_codelist_table.to_excel(writer, sheet_name='non_embedded_codelist_table')

# Close the Pandas Excel writer and output the Excel file.
writer.save()


CoVE has found: 
* 0 schema errors 
* 21 ruleset errors 
* 0 embedded codelist errors 
* 62 non-embedded codelist errors

Writing to validation_workbook.xlsx


### Schema Validation

In [180]:
schema_table

#### Custom Analysis

This section gives space to investigate schema errors identified in the secion above. It requires a small amount of tinkering in python.

In [0]:
# To view offending XML element, take the '/NN/' from the path above, add one, 
# and then modify the remaining content of the path to print a section of XML.
# print(ET.tostring(big_iati.xpath("iati-activity[61]/related-activity")[0].getparent()).decode())

Note the lack of description in the XML output above

### Ruleset Validation

In [182]:
# Concise Summary
ruleset_table.pivot_table(index=['rule',], aggfunc='count').drop(columns=["path", "ruleset", "explanation"])

Unnamed: 0_level_0,id
rule,Unnamed: 1_level_1
"activity-date[@type=""2""]/@iso-date must be before activity-date[@type=""4""]/@iso-date",19
"activity-date[date @type=""1""] or activity-date[@type=""2""] must be present",2


In [183]:
# Full Table
ruleset_table.head()

Unnamed: 0,path,id,ruleset,explanation,rule
0,/iati-activities/iati-activity[159]/activity-d...,CA-3-M013793001,Start dates must be chronologically before end...,Start date (2013-01-02) must be before end dat...,"activity-date[@type=""2""]/@iso-date must be bef..."
1,/iati-activities/iati-activity[312]/activity-d...,CA-3-D004490001,Start dates must be chronologically before end...,Start date (2017-04-07) must be before end dat...,"activity-date[@type=""2""]/@iso-date must be bef..."
2,/iati-activities/iati-activity[825]/activity-d...,CA-3-A033637001,Start dates must be chronologically before end...,Start date (2007-03-27) must be before end dat...,"activity-date[@type=""2""]/@iso-date must be bef..."
3,/iati-activities/iati-activity[1149]/activity-...,CA-3-A035214001,Start dates must be chronologically before end...,Start date (2011-03-30) must be before end dat...,"activity-date[@type=""2""]/@iso-date must be bef..."
4,/iati-activities/iati-activity[1342]/activity-...,CA-3-D000578001,Start dates must be chronologically before end...,Start date (2014-03-28) must be before end dat...,"activity-date[@type=""2""]/@iso-date must be bef..."


## ATI Data Quality Testing

Download `combined.xml` and upload to [this testing tool](http://dataqualitytester.publishwhatyoufund.org/) to check data quality in line with the Aid Transparency Index Methodology

| Date | Link | Notes |
|-|-|-|
|YYYY-MM-DD|[Link](http://dataqualitytester.publishwhatyoufund.org/package/bb957674-6ccf-4635-a553-1d2dd0382075)| Some description of findings or link to notes / report |
||||
||||
||||
||||

## IATI Publisher Statistics

Run some code from the [OpenDataServices/iati-publishingstats-details](https://github.com/OpenDataServices/iati-publishingstats-details) repo, which produces per activity CSVs for the [IATI Publishing Statistics](http://publishingstats.iatistandard.org/) checks.

In [184]:
import os
os.chdir('/content')
!rm -r iati-publishingstats-details
!git clone https://github.com/OpenDataServices/iati-publishingstats-details.git
os.chdir('/content/iati-publishingstats-details')
!git submodule init
!git submodule update
!mkdir logs
# Note that the Publishing Statistics code requires Python 2, so needs a virtualenv
!sudo apt install python-virtualenv > logs/apt.log
!virtualenv .ve
!source .ve/bin/activate; pip install -r requirements.txt > logs/requirements.log
!source .ve/bin/activate; ./fetch_helpers.sh > logs/fetch_helpers.log 2>&1
!source .ve/bin/activate; python forward_looking_details.py ../combined.xml > forward_looking_details.csv
!source .ve/bin/activate; python comprehensiveness_is_current_details.py ../combined.xml > comprehensiveness_is_current_details.csv
os.chdir('/content')

Cloning into 'iati-publishingstats-details'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects:   3% (1/30)[Kremote: Counting objects:   6% (2/30)[Kremote: Counting objects:  10% (3/30)[Kremote: Counting objects:  13% (4/30)[Kremote: Counting objects:  16% (5/30)[Kremote: Counting objects:  20% (6/30)[Kremote: Counting objects:  23% (7/30)[Kremote: Counting objects:  26% (8/30)[Kremote: Counting objects:  30% (9/30)[Kremote: Counting objects:  33% (10/30)[Kremote: Counting objects:  36% (11/30)[Kremote: Counting objects:  40% (12/30)[Kremote: Counting objects:  43% (13/30)[Kremote: Counting objects:  46% (14/30)[Kremote: Counting objects:  50% (15/30)[Kremote: Counting objects:  53% (16/30)[Kremote: Counting objects:  56% (17/30)[Kremote: Counting objects:  60% (18/30)[Kremote: Counting objects:  63% (19/30)[Kremote: Counting objects:  66% (20/30)[Kremote: Counting objects:  70% (21/30)[Kremote: Counting objects:  73% (22/30)

In [185]:
os.chdir('/content')
forward_looking_details = pd.read_csv('iati-publishingstats-details/forward_looking_details.csv')
forward_looking_details

Unnamed: 0,iati-identifier,First year to fail,End dates,Budget years
0,CA-3-A021378002,2020,2020-09-30,20172019
1,CA-3-A032220001,2020,2020-12-31,200720082009201020172018
2,CA-3-A032561001,2020,2020-12-31,200820092010201120122017
3,CA-3-A032615001,2020,2020-09-30,"2007,2008,2009,2010,2011,2012,2013,2014,2017,2..."
4,CA-3-A032615005,2020,2020-09-30,2016201720182019
...,...,...,...,...
322,CA-3-P008053001,2021,2021-03-31,20192020
323,CA-3-P008097001,2020,2021-07-31,2019
324,CA-3-P008162001,2021,2021-11-30,201820192020
325,CA-3-P008241001,2020,2020-12-31,2019


In [186]:
comprehensiveness_is_current_details = pd.read_csv('iati-publishingstats-details/comprehensiveness_is_current_details.csv')
comprehensiveness_is_current_details

Unnamed: 0,iati-identifier,publishingstats_comprehensiveness_current
0,CA-3-A031268001,False
1,CA-3-A031470001,False
2,CA-3-A031708001,False
3,CA-3-A031708003,False
4,CA-3-A031717001,False
...,...,...
4963,CA-3-Z021047001,False
4964,CA-3-Z021048001,False
4965,CA-3-Z021059001,False
4966,CA-3-Z021065001,False


In [187]:
big_iati_comprehensiveness_current = copy.copy(big_iati_archived)

cur_length = len(big_iati_comprehensiveness_current)

if current_only:
  for activity in big_iati_comprehensiveness_current:
    if activity.findtext('iati-identifier') in comprehensiveness_is_current_details.loc[comprehensiveness_is_current_details['publishingstats_comprehensiveness_current'] == False, 'iati-identifier'].values:
      activity.getparent().remove(activity)
  
  print("Removed {} non-current activities from a total of {}.".format((cur_length-len(big_iati_comprehensiveness_current)),cur_length))
  print("{} current activities remain.".format(len(big_iati_comprehensiveness_current)))

else:
  print("As `current_only` is set to False, all retrieved activities have been kept")

As `current_only` is set to False, all retrieved activities have been kept


In [188]:
merged_currents = pd.merge(current_check_log, comprehensiveness_is_current_details, on="iati-identifier")
merged_currents.groupby(['pwyf_current', 'publishingstats_comprehensiveness_current']).size().unstack()
# Note: these numbers will not add up properly if there are duplicate iati-identifiers

publishingstats_comprehensiveness_current,False,True
pwyf_current,Unnamed: 1_level_1,Unnamed: 2_level_1
False,3480,134
True,508,846
