# Project to explore document-links in IATI data

# Description

IATI data contains a document-links element in both the organisation and activity schemas. Reporting organisations can populate these with links to different categories of supporting documents published online in a variety of formats and languages. It also checks what lies at the end of the links

This project explores the publication of document-links in the IATI corpus and provides an interactive section at the end where the 15 organisations that report the most document-links can be summarised.


This project consists of the following sections:

- Preparation: where a complete corpus of document-links data is downloaded from IATI Tables and merged 
- Analysis: where a subset of unique (by reporting organisation and URL) document-links created and analysed 
- Examine: where one of the fifteen most prolific publishers of documant-links can be selected and their publication is summarised
- Availability - where the URL of each document is sent a HEAD request, and the response recorded. Redirections are followed.


# Preparation

## Set up

Import the libraries used in the notebook and connect to https://iati-tables.codeforiati.org/  to pull data.

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
import requests
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import Counter
import aiohttp
import asyncio
import nest_asyncio
from tqdm.notebook import tqdm
import os
import concurrent.futures
import re
import random
import time
from datetime import datetime

sns.set_context('notebook')

#start noteql session
import noteql
# Restart postgres to make sure any existing connections get dropped
!sudo service postgresql restart
session = noteql.Session(datasette_url='https://datasette.tables.iatistandard.org/iati.json', connect_args={'connect_timeout': 1000})

In [2]:
# Creates a variable holding the datetime when we last started running the notebook. Similar to datetime when data was accessed
last_run_start_time = datetime.today().strftime('%Y-%m-%d %H:%M:%S')
last_run_start_time

## Downloads

There are seven places in the IATI Standard where document-links can be declared: at activity level, organisation level and in five different places within the results element. Each of these has its own table in IATI Tables. These tables are downloaded in their entirity. An additional 'source' column is added to hold the name of the table the link was accessed from for later analysis.

Organisation data downloaded from the Registry is also loaded into a DataFrame to enhance the organisation documents DataFrame and make its structure consistent with the other document-links DataFrames.

### Download the documentlink table in its entirity
This block uses pagination by fetching data in parallel chunks to retrieve all documentlinks, including duplicated urls.

In [141]:
datasette_url = 'https://datasette.tables.iatistandard.org'
count_url = f"{datasette_url}/iati.json?sql=SELECT+Count(*)+AS+TOTAL+FROM+documentlink"
response = requests.get(count_url)
total_count = response.json()['rows'][0][0] if response.status_code == 200. else None
print(f"Total rows to fetch:{total_count}")


def fetch_chunk(start_offset, size):
    query_url = f"{datasette_url}/iati.json?sql=SELECT+*+FROM+documentlink+LIMIT+{size}+OFFSET+{start_offset}"
    response = requests.get(query_url, timeout=120)
    if response.status_code == 200:
        return response.json()['rows']
    return []

# Set up chunks
chunk_size = 20000
offsets = list(range(0, total_count, chunk_size))

# Process in parallel
all_data = []
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    future_to_offset = {executor.submit(fetch_chunk, offset, chunk_size): offset for offset in offsets}
    for future in tqdm(concurrent.futures.as_completed(future_to_offset), total=len(offsets)):
        offset = future_to_offset[future]
        try:
            data = future.result()
            all_data.extend(data)
        except Exception as e:
            print(f"Error with offset {offset}: {e}")

In [144]:
if all_data:
    column_url = f"{datasette_url}/iati.json?sql=SELECT+*+FROM+documentlink+LIMIT+1"
    column_response = requests.get(column_url)
    columns = column_response.json()['columns']

In [147]:
df_activity_documents = pd.DataFrame(all_data, columns=columns)
df_activity_documents

In [150]:
## Add a new column, 'source' with a default value of 'documentlink'.
df_activity_documents = df_activity_documents.assign(source='activity')
df_activity_documents

In [151]:
## Which document-links are published the most in activity files?
df_activity_documents.groupby(['url']).count()

### Download the organisation_documentlink table in its entirity

In [152]:
%%nql SHOW df_org_documents=DF
select * from organisation_documentlink

In [153]:
## Add a new column, 'source' with a default value of 'organisation'.
df_org_documents = df_org_documents.assign(source='organisation')
df_org_documents

In [154]:
## Which document-links are published the most?
df_org_documents.groupby(['url']).count()

### Download the result_documentlink table

In [155]:
%%nql SHOW df_result_documentlink=DF
select * from result_documentlink

In [156]:
## Add a new column, 'source' with a default value of 'result_documentlink'.
df_result_documentlink = df_result_documentlink.assign(source='result_documentlink')
df_result_documentlink

In [157]:
## Which document-links are published the most in result_documentlink files?
df_result_documentlink.groupby(['url']).count()

### Download the result_indicator_documentlink table

In [158]:
%%nql SHOW df_result_indicator_documentlink=DF
select * from result_indicator_documentlink

In [159]:
## Add a new column, 'source' with a default value of 'result_indicator_documentlink'.
df_result_indicator_documentlink = df_result_indicator_documentlink.assign(source='result_indicator_documentlink')
df_result_indicator_documentlink

In [160]:
## Which document-links are published the most in result_indicator_documentlink files?
df_result_indicator_documentlink.groupby(['url']).count()

### Download the result_indicator_baseline_documentlink table

In [161]:
%%nql SHOW df_result_indicator_baseline_documentlink=DF
select * from result_indicator_baseline_documentlink


In [162]:
## Add a new column, 'source' with a default value of 'result_indicator_documentlink'.
df_result_indicator_baseline_documentlink = df_result_indicator_baseline_documentlink.assign(source='result_indicator_baseline_documentlink')
df_result_indicator_baseline_documentlink

In [163]:
## Which document-links are published the most in result_indicator_baseline_documentlink files?
df_result_indicator_baseline_documentlink.groupby(['url']).count()

### Download the result_indicator_period_actual_documentlink table

In [164]:
%%nql SHOW df_result_indicator_period_actual_documentlink=DF
select * from result_indicator_period_actual_documentlink

In [165]:
## Add a new column, 'source' with a default value of 'result_indicator_period_actual_documentlink'.
df_result_indicator_period_actual_documentlink = df_result_indicator_period_actual_documentlink.assign(source='result_indicator_period_actual_documentlink')
df_result_indicator_period_actual_documentlink

In [166]:
## Which document-links are published the most in result_indicator_period_actual_documentlink files?
df_result_indicator_period_actual_documentlink.groupby(['url']).count()

### Download the result_indicator_period_target_documentlink table

In [167]:
%%nql SHOW df_result_indicator_period_target_documentlink=DF
select * from result_indicator_period_target_documentlink

In [168]:
## Add a new column, 'source' with a default value of 'result_indicator_period_target_documentlink'.
df_result_indicator_period_target_documentlink = df_result_indicator_period_target_documentlink.assign(source='result_indicator_period_target_documentlink')
df_result_indicator_period_target_documentlink

In [169]:
## Which document-links are published the most in result_indicator_period_target_documentlink files?
df_result_indicator_period_target_documentlink.groupby(['url']).count()

### Download the organisation data from the IATI Registry 
Create a DataFrame from the iati_publishers_list2025-04-11.csv list downloaded from the Registry. This should be rewritten when we update the registry so that it pulls the data live from the API, which doesn't return all the data we want at the moment.

In [170]:
df_registry_organisations = _dntk.execute_sql(
  'SELECT *\nFROM \'iati_publishers_list2025-04-11.csv\'',
  'SQL_DEEPNOTE_DATAFRAME_SQL',
  audit_sql_comment='',
  sql_cache_mode='cache_disabled',
  return_variable_type='dataframe'
)
df_registry_organisations

In [171]:
## Rename the 'IATI Organisation Identifier' to 'reportingorg_ref' to match IATI Tables based DataFrames
df_registry_organisations = df_registry_organisations.rename(columns={'IATI Organisation Identifier': 'reportingorg_ref'})

## Adding and Merging
Manipulate the DataFrames so they have a consistent structure.

Enhance the df_org_documents DataFrame, adding a reportingorg_ref column by matching on the organisation data from the Registry. 

Concatenate this with df_activity_documents. Concatenate the results based DataFrames, dropping extraenous columns from the resulting DataFrame. Concatenate the two aggregate DataFrames.


In [172]:
## Delete the columns from df_activity_documents we won't be using. 'language' and 'category' are normalised so we drop them too
df_activity_documents = df_activity_documents.drop(columns=['akvo:photocredit', 'akvo:photoid', 'dataset', 'formatname', 'title', 'title_narrative','description_narrative', 'language', 'category'])
df_activity_documents.info()

In [173]:
## Delete the columns we won't be using from df_org_documents
df_org_documents = df_org_documents.drop(columns=['dataset', 'formatname', 'title_narrative','description_narrative', 'language'])
df_org_documents.info()

In [174]:
## Note: we need to add the reportingorg_ref to df_org_documents
df_activity_documents.columns.difference(df_org_documents.columns)

In [175]:
df_registry_organisations.info()

The organisations downloaded from the Registry don't have a 'prefix' column, which we will need to match on the organisation documents DataFrame to add the reportingorg_ref column to df_org_documents

Datasets Link in the df_registry_organisations DataFrame contains the prefix after the base link of 'https://iatiregistry.org/publisher/' 
We can create a new column 'prefix' by dropping this string from the Datasets Link column.

In [176]:
df_registry_organisations['prefix'] = df_registry_organisations['Datasets Link'].apply(
    lambda url: re.search(r'/([^/]+)$', url).group(1) if re.search(r'/([^/]+)$', url) else '')

In [177]:
df_registry_organisations

Update the organisation documents DataFrame to include the reportingorg_refs, matching on 'prefix'

In [178]:
df_org_documents = pd.merge(df_org_documents, df_registry_organisations[['prefix','reportingorg_ref']], on='prefix', how='left')
df_org_documents

Now we have updated the organisation documents DataFrame to include the reportingorg_refs we can concatenate it with the activity documents DataFrame to create the all documents DataFrame

In [179]:
df_all_documents = pd.concat([df_org_documents, df_activity_documents], axis=0, join='inner', ignore_index=False)
df_all_documents

Now we need to go through the same process for the results-based document link DFs

In [180]:
# Concatenate all of the result-based DataFrames
df_all_result_documentlink = pd.concat([
    df_result_documentlink, 
    df_result_indicator_documentlink, 
    df_result_indicator_baseline_documentlink, 
    df_result_indicator_period_actual_documentlink, 
    df_result_indicator_period_target_documentlink
], axis=0, join='inner', ignore_index=False)
df_all_result_documentlink

In [181]:
## Drop extraneous columns
to_drop = df_all_result_documentlink.columns.difference(df_all_documents.columns)
to_drop

In [182]:
df_all_result_documentlink = df_all_result_documentlink.drop(columns=to_drop)
df_all_result_documentlink.info()

In [183]:
df_all_documents = pd.concat([df_all_documents, df_all_result_documentlink], axis=0, join='inner', ignore_index=False)
df_all_documents.info()

Linking on reportingorg_ref we can add the Publisher and Organisation Type from df_registry_organisations.

In [184]:
df_all_documents = pd.DataFrame(pd.merge(df_all_documents, df_registry_organisations[['reportingorg_ref', 'Publisher', 'Organization Type']], on='reportingorg_ref', how='left'))
df_all_documents

In [185]:
total_document_links = df_all_documents.shape[0]
total_document_links

# Analysis

Most, but not all,  of the analysis below is either used or duplicated in the Streamlit application. It was developed here before transferring to the application. If run, it will produce different results to those in the application which is a snapshot in time and is not updated by the running of this notebook.

The next two blocks calculate the total number of activities and the number of activities that contain document-links. This is not included in the application.

In [186]:
%%nql 
SELECT COUNT(DISTINCT iatiidentifier) FROM activity

In [187]:
%%nql 
SELECT COUNT(DISTINCT iatiidentifier) FROM documentlink

## Who are publishing document links?


First look at all document links, then create a de-duplicated DataFrame that only contains rows with unique Publisher and url

In [188]:
## A group and count on publisher name across all document links, including duplicates.
df_org_counts = pd.DataFrame(df_all_documents['Publisher'].value_counts())
df_org_counts

In [189]:
total_reporting_orgs = df_org_counts.shape[0]
total_reporting_orgs

In [190]:
## Before de-duplicating, sort the values so that empty dates and formats are more likely to be dropped (this doesn't appear to make any difference)
df_all_documents.sort_values(['documentdate_isodate', 'format'], ascending=[True, True])

In [191]:
## Create a de-duplicated DataFrame, based on a combination of Publisher and url. 
df_unique_documents = df_all_documents.drop_duplicates(subset = ['Publisher', 'url'], keep = 'last').reset_index(drop=True)
df_unique_documents

In [192]:
# Create total_unique_document_links, a variable to hold the total unique document links.
total_unique_document_links = df_unique_documents.shape[0]
total_unique_document_links

In [193]:
# Calculate the number of reporting organisations in each Organisation Type.
df_reportingorg_counts = df_unique_documents.groupby(['Publisher', 'Organization Type']).size().reset_index(name='count')
org_type_counts = df_reportingorg_counts.groupby('Organization Type')['Publisher'].nunique().reset_index(name='Unique Reporting Organisations')
org_type_counts = org_type_counts.sort_values('Unique Reporting Organisations', ascending=False).reset_index()
org_type_counts

In [194]:
# Calculate the number of document links per Organisation Type
df_links_by_type = df_unique_documents.groupby('Organization Type').size().reset_index(name='Total Document Links')
df_links_by_type  = df_links_by_type .sort_values('Total Document Links', ascending=False).reset_index()
df_links_by_type 

In [195]:
## Calculate the number of document-links published per reporting organisation
df_org_counts = pd.DataFrame(df_unique_documents['Publisher'].value_counts().reset_index())
df_org_counts

In [196]:
## Create a DataFrame a_top_orgs that we can use later in the Explore section. 
a_top_orgs = df_org_counts.head(15)['Publisher']
a_top_orgs

## Where in the standard are document-links published?

In [197]:
## Where are document links published?
df_org_source_counts = pd.DataFrame(df_unique_documents['source'].value_counts().reset_index())
df_org_source_counts

## When were documents published
The definition of 'document-date' is "The date of publication of the document that is being linked to."

First, we want to know how many dates are published. Next we update the 'documentdate_isodate' element to yyyy-mm format and visualise the results.

Next steps: investigate how many blanks there actually are. We need to fix the first query below and then test that the following block converts all of the valid dates to YYYY-MM format

In [198]:
## How many document-links do not contain a document date?
empty_date_count = (df_unique_documents['documentdate_isodate'].str.strip() == '').sum()
empty_date_count

In [199]:
## Convert 'documentdate_isodate' to datetime format pattern YYYY-MM format
df_unique_documents['documentdate_isodate'] = df_unique_documents['documentdate_isodate'].str[:10]
df_unique_documents['documentdate_isodate'] = pd.to_datetime(df_unique_documents['documentdate_isodate'])
df_unique_documents['documentdate_isodate'] = df_unique_documents['documentdate_isodate'].dt.to_period('M')

In [200]:
## create a DataFrame, 'date' that groups publication dates by month and counts the instances per month
df_date = pd.DataFrame(df_unique_documents.groupby(['documentdate_isodate']).size().reset_index(name='count') \
                             .sort_values(['documentdate_isodate'], ascending=False))
df_date

## What format are documents declared as published in?

In [201]:
## How many document-links do not contain a format?
empty_format_count = (df_unique_documents['format'].str.strip() == '').sum()
empty_format_count

In [202]:
## create a DataFrame, 'df_format' that groups publication formats and counts them
df_format = pd.DataFrame(df_unique_documents.groupby(['format']).size().reset_index(name='count') \
                             .sort_values(['count'], ascending=False))
df_format

In [203]:
## How many rows declare a document format?
df_format['count'].sum()

## What language are documents declared as being written in?

Language has a one-to-many relationship with the document links element and so we will first analyse the documentlink_language table before merging it with all_unique_documents for further analysis

How many rows are there in the documentlink_language table in IATI Tables?

In [204]:
%%nql
select count(*) from documentlink_language

How many distinct document links are there in the documentlink_language table in IATI Tables?

In [205]:
%%nql
SELECT COUNT(DISTINCT _link_documentlink) 
FROM documentlink_language

In [206]:
## Given the large number of rows in the documentlink_language table we will 
## use the parallel processing/batch method developed by Leila.
datasette_url = 'https://datasette.tables.iatistandard.org'
count_url = f"{datasette_url}/iati.json?sql=SELECT+Count(*)+AS+TOTAL+FROM+documentlink_language"
response = requests.get(count_url)
total_count = response.json()['rows'][0][0] if response.status_code == 200. else None
print(f"Total rows to fetch:{total_count}")


def fetch_chunk(start_offset, size):
    query_url = f"{datasette_url}/iati.json?sql=SELECT+*+FROM+documentlink_language+LIMIT+{size}+OFFSET+{start_offset}"
    response = requests.get(query_url, timeout=120)
    if response.status_code == 200:
        return response.json()['rows']
    return []

# Set up chunks
chunk_size = 20000
offsets = list(range(0, total_count, chunk_size))

# Process in parallel
all_data = []
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    future_to_offset = {executor.submit(fetch_chunk, offset, chunk_size): offset for offset in offsets}
    for future in tqdm(concurrent.futures.as_completed(future_to_offset), total=len(offsets)):
        offset = future_to_offset[future]
        try:
            data = future.result()
            all_data.extend(data)
        except Exception as e:
            print(f"Error with offset {offset}: {e}")

In [207]:
## Get the column names
if all_data:
    column_url = f"{datasette_url}/iati.json?sql=SELECT+*+FROM+documentlink_language+LIMIT+1"
    column_response = requests.get(column_url)
    columns = column_response.json()['columns']

In [208]:
## Create a DataFrame from the rows in the documentlink_language table in IATI Tables
df_document_language = pd.DataFrame(all_data, columns=columns)
df_document_language

In [209]:
## Check the composition of the df_document_language DataFrame
df_document_language.info()

In [210]:
# Count the number of activities that have at least one language declared
n = len(pd.unique(df_document_language['_link_activity']))
print("Number of unique values in '_link_activity':", n)

In [211]:
## merge df_document_language into df_all_documents, restricting the columns merged to code, codename and _document_link. 
df_all_documents_lang = pd.DataFrame(pd.merge(df_all_documents.set_index('_link'), df_document_language[['code','codename','_link_documentlink']], right_on='_link_documentlink', left_index=True))
df_all_documents_lang

In [212]:
## Inspect the composition of the df_all_documents_lang DataFrame
df_all_documents_lang.info()

In [213]:
## De-duplicate df_all_documents_lang based on url and code and create a new DataFrame df_unique_languages
df_unique_languages = df_all_documents_lang.drop_duplicates(subset = ['url', 'code'], keep = 'last').reset_index(drop=True)
df_unique_languages

In [214]:
## Create a new DataFrame, df_language_count, based on a group count on the 'codename' column in the documentcategory DataFrame
df_language_count = pd.DataFrame(df_unique_languages.groupby(['codename']).size().reset_index(name='count') \
                             .sort_values(['count'], ascending=False))
df_language_count

In [215]:
## Group count on count
df_language_url_count_meta = pd.DataFrame(df_language_url_count.groupby(['count']).size().reset_index(name='meta_count') \
                             .sort_values(['meta_count'], ascending=False))
df_language_url_count_meta

## What categories of documents are being linked to?

Category has a one-to-many relationship with the document links element and so we will first analyse the documentlink_category table before merging it with all_unique_documents for further analysis

How many rows are there in the documentlink_category table in IATI Tables?

In [216]:
%%nql
select count(*) from documentlink_category

In [217]:
%%nql
SELECT COUNT(DISTINCT _link_documentlink) 
FROM documentlink_category

In [218]:
## Given the large number of rows in the documentlink_category table we will 
## use the parallel processing/batch method developed by Leila.
datasette_url = 'https://datasette.tables.iatistandard.org'
count_url = f"{datasette_url}/iati.json?sql=SELECT+Count(*)+AS+TOTAL+FROM+documentlink_category"
response = requests.get(count_url)
total_count = response.json()['rows'][0][0] if response.status_code == 200. else None
print(f"Total rows to fetch:{total_count}")


def fetch_chunk(start_offset, size):
    query_url = f"{datasette_url}/iati.json?sql=SELECT+*+FROM+documentlink_category+LIMIT+{size}+OFFSET+{start_offset}"
    response = requests.get(query_url, timeout=120)
    if response.status_code == 200:
        return response.json()['rows']
    return []

# Set up chunks
chunk_size = 20000
offsets = list(range(0, total_count, chunk_size))

# Process in parallel
all_data = []
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    future_to_offset = {executor.submit(fetch_chunk, offset, chunk_size): offset for offset in offsets}
    for future in tqdm(concurrent.futures.as_completed(future_to_offset), total=len(offsets)):
        offset = future_to_offset[future]
        try:
            data = future.result()
            all_data.extend(data)
        except Exception as e:
            print(f"Error with offset {offset}: {e}")

In [219]:
## Get the column names
if all_data:
    column_url = f"{datasette_url}/iati.json?sql=SELECT+*+FROM+documentlink_category+LIMIT+1"
    column_response = requests.get(column_url)
    columns = column_response.json()['columns']

In [220]:
## Create a DataFrame from the rows in the documentlink_category table in IATI Tables
df_document_category = pd.DataFrame(all_data, columns=columns)
df_document_category

In [221]:
## Check the composition of the df_document_category DataFrame
df_document_category.info()

In [222]:
# Count the number of activities that have at least one category declared
n = len(pd.unique(df_document_category['_link_activity']))
print("Number of unique values in '_link_activity':", n)

In [223]:
## merge df_document_category into df_all_documents, restricting the columns merged to code, codename and _document_link. 
df_all_documents_cat = pd.DataFrame(pd.merge(df_all_documents.set_index('_link'), df_document_category[['code','codename','_link_documentlink']], right_on='_link_documentlink', left_index=True))
df_all_documents_cat

In [224]:
## Inspect the composition of the df_all_documents_cat DataFrame
df_all_documents_cat.info()

In [225]:
## De-duplicate df_all_documents_cat based on url and code and create a new DataFrame df_unique_categories
df_unique_categories = df_all_documents_cat.drop_duplicates(subset = ['url', 'code'], keep = 'last').reset_index(drop=True)
df_unique_categories

Create a new DataFrame, df_category, based on a group count on the 'codename' column in the documentcategory DataFrame

In [226]:
df_category = pd.DataFrame(df_unique_categories.groupby(['codename']).size().reset_index(name='count') \
                             .sort_values(['count'], ascending=False))
df_category

In [227]:
## Group count on count
df_category_url_count_meta = pd.DataFrame(df_category_url_count.groupby(['count']).size().reset_index(name='meta_count') \
                             .sort_values(['meta_count'], ascending=False))
df_category_url_count_meta

# Examine

Look up an organisation's statistics

The top 15 organisations who report document links are in the dropdown below. Choose one to generate statistics for that organisation.

In [228]:
reporting_organisation = 'United Nations Development Programme (UNDP)'

In [229]:
reporting_organisation 

In [230]:
reporting_organisation_reference = df_registry_organisations.loc[df_registry_organisations['Publisher'] == reporting_organisation,'reportingorg_ref'].item()
reporting_organisation_reference

In [231]:
df_one_org = df_unique_documents.loc[df_unique_documents['Publisher'] == reporting_organisation]
df_one_org

In [232]:
print("Where do {} declare their document-links?".format(reporting_organisation))

In [233]:
df_one_org_source = pd.DataFrame(df_one_org['source'].value_counts().reset_index())
df_one_org_source

In [234]:
print("When do {} declare they published their document-links?".format(reporting_organisation))

In [235]:
df_date = pd.DataFrame(df_one_org.groupby(['documentdate_isodate']).size().reset_index(name='count') \
                             .sort_values(['documentdate_isodate'], ascending=False))
df_date

In [236]:
print("In what formats do {} publish their document-links?".format(reporting_organisation))

In [237]:
df_format = df_one_org.groupby(['format']).size().reset_index(name='count').sort_values(['count'], ascending=False)
df_format

In [238]:
print("In what languages do {} publish their document-links?".format(reporting_organisation))

In [239]:
df_top_org_language = pd.DataFrame(df_unique_languages.loc[df_unique_languages['reportingorg_ref'] == reporting_organisation_reference])
df_top_org_language

In [240]:
df_org_language = df_top_org_language.groupby(['codename']).size().reset_index(name='count').sort_values(['count'], ascending=False)
df_org_language

In [241]:
print("What category of documents do {} publish?".format(reporting_organisation))

In [242]:
df_top_org_cat = pd.DataFrame(df_unique_categories.loc[df_unique_categories['reportingorg_ref'] == reporting_organisation_reference])
df_top_org_cat

In [243]:
df_top_org_cat_group = df_top_org_cat.groupby(['codename']).size().reset_index(name='count').sort_values(['count'], ascending=False)
df_top_org_cat_group

# Availability

### Asynchronous URL Status Checker with Redirect Tracking

Description:
This script efficiently validates large batches of URLs using asynchronous aiohttp and asyncio. It performs non-blocking HTTP HEAD requests to quickly check URL statuses, with support for retrying on transient errors and handling rate limits (e.g., HTTP 429). The check_urls() function takes a DataFrame of URLs and returns detailed results, including HTTP status codes, final resolved URLs, and redirect behavior. It also detects domain changes and upgrades from HTTP to HTTPS, making the tool ideal for validating and auditing links in publishing, SEO, archiving, and data integrity workflows.

Head only without the delay

In [244]:
# nest_asyncio.apply()

# async def fetch_status_head_only(session, url, follow_redirects=True, max_retries=2):
#     for attempt in range(max_retries + 1):  # Allow retries
#         try:
#             async with session.head(url, timeout=10, allow_redirects=follow_redirects) as response:
#                 return url, response.status, str(response.url)

#         except aiohttp.ClientResponseError as e:
#             if e.status == 429:
#                 retry_after = int(e.headers.get("Retry-After", "5"))
#                 await asyncio.sleep(retry_after)
#                 continue  # Retry
#             else:
#                 return url, e.status, f"HTTP {e.status}: {str(e)[:80]}"

#         except Exception as e:
#             return url, None, str(e)[:100]

#     return url, 429, "Too Many Requests after retries"

# async def fetch_all_statuses(urls, batch_size=50, follow_redirects=True):
#     results = []
#     connector = aiohttp.TCPConnector(limit_per_host=batch_size)
#     async with aiohttp.ClientSession(connector=connector) as session:
#         tasks = []
#         for url in urls:
#             task = fetch_status_head_only(session, url, follow_redirects)
#             tasks.append(task)
#             if len(tasks) == batch_size:
#                 batch_results = await asyncio.gather(*tasks)
#                 results.extend(batch_results)
#                 tasks = []
#         if tasks:
#             batch_results = await asyncio.gather(*tasks)
#             results.extend(batch_results)
#     return results

# def check_urls(df, url_column='url', follow_redirects=True, batch_size=50):
#     urls = df[url_column].tolist()

#     print(f"Checking {len(urls)} URLs (HEAD requests only, follow_redirects={follow_redirects})...")
#     start_time = time.time()
#     results = asyncio.run(fetch_all_statuses(urls, batch_size, follow_redirects))
#     elapsed = time.time() - start_time

#     status_df = pd.DataFrame(results, columns=['url', 'HTTP_Response', 'Final_URL'])
#     status_df['url'] = status_df['url'].astype(str)
#     status_df['Final_URL'] = status_df['Final_URL'].astype(str)

#     result_df = df.merge(status_df, on=url_column, how="left")
#     result_df['Is_Redirect'] = result_df['url'] != result_df['Final_URL']

#     print(f"Completed in {elapsed:.2f} seconds")
#     print(f"Status code distribution:")
#     print(result_df['HTTP_Response'].value_counts().sort_index())

#     redirect_count = result_df['Is_Redirect'].sum()
#     print(f"Redirects: {redirect_count} / {len(result_df)}")

#     if redirect_count > 0:
#         redirected = result_df[result_df['Is_Redirect']].copy()

#         try:
#             redirected['original_domain'] = redirected['url'].apply(
#                 lambda x: x.split('/')[2] if isinstance(x, str) and len(x.split('/')) > 2 else '')
#             redirected['final_domain'] = redirected['Final_URL'].apply(
#                 lambda x: x.split('/')[2] if isinstance(x, str) and len(x.split('/')) > 2 else '')
#             redirected['domain_changed'] = redirected['original_domain'] != redirected['final_domain']
#             domain_changes = redirected['domain_changed'].sum()

#             print(f"Redirects to different domains: {domain_changes}")
#             print(f"Redirects within same domain: {len(redirected) - domain_changes}")

#             http_to_https = sum(
#                 (redirected['url'].str.startswith('http://')) &
#                 (redirected['Final_URL'].str.startswith('https://'))
#             )
#             print(f"HTTP to HTTPS upgrades: {http_to_https}")

#         except Exception as e:
#             print(f"Could not analyse domain changes: {str(e)}")

#     return result_df

(Head only without the delay with some modifications)

This script is a simplified version of the original async URL checker in the block above. It uses HEAD requests without delay, adds URL deduplication, and improves reporting with clearer logging and domain-level redirect analysis. The status checks are done on unique URLs, then merged back to the original DataFrame

In [245]:
# nest_asyncio.apply()
# async def fetch_status_head_only(session, url, follow_redirects=True, max_retries=2):
#     for attempt in range(max_retries + 1):
#         try:
#             async with session.head(url, timeout=10, allow_redirects=follow_redirects) as response:
#                 return url, response.status, str(response.url)
#         except aiohttp.ClientResponseError as e:
#             if e.status == 429:
#                 retry_after = int(e.headers.get("Retry-After", "5"))
#                 await asyncio.sleep(retry_after)
#                 continue
#             else:
#                 return url, e.status, f"HTTP {e.status}: {str(e)[:80]}"
#         except Exception as e:
#             return url, None, str(e)[:100]
#     return url, 429, "Too Many Requests after retries"

# async def fetch_all_statuses(urls, batch_size=50, follow_redirects=True):
#     results = []
#     connector = aiohttp.TCPConnector(limit_per_host=batch_size)
#     async with aiohttp.ClientSession(connector=connector) as session:
#         tasks = []
#         for url in urls:
#             task = fetch_status_head_only(session, url, follow_redirects)
#             tasks.append(task)
#             if len(tasks) == batch_size:
#                 batch_results = await asyncio.gather(*tasks)
#                 results.extend(batch_results)
#                 tasks = []
#         if tasks:
#             batch_results = await asyncio.gather(*tasks)
#             results.extend(batch_results)
#     return results

# def check_urls(df, url_column='url', follow_redirects=True, batch_size=50):
#     print(f"\nInitial DataFrame size: {len(df):,}")

#     # 1. Deduplicate URLs before checking
#     unique_urls_df = df[[url_column]].drop_duplicates().copy()
#     urls = unique_urls_df[url_column].tolist()
#     print(f"Unique URLs to check: {len(urls):,}")

#     # 2. Fetch status codes
#     print(f"Checking {len(urls):,} URLs (HEAD requests only, follow_redirects={follow_redirects})...")
#     start_time = time.time()
#     results = asyncio.run(fetch_all_statuses(urls, batch_size, follow_redirects))
#     elapsed = time.time() - start_time
#     print(f"Completed in {elapsed:.2f} seconds")

#     # 3. Create status DataFrame
#     status_df = pd.DataFrame(results, columns=['url', 'HTTP_Response', 'Final_URL'])
#     status_df['url'] = status_df['url'].astype(str)
#     status_df['Final_URL'] = status_df['Final_URL'].astype(str)
#     print(f"Status DataFrame created: {len(status_df):,} rows")

#     # 4. Merge back to original
#     merged_df = df.merge(status_df, on=url_column, how="left")
#     print(f"Merged result shape: {len(merged_df):,} rows (should match original df)")
#     merged_df['Is_Redirect'] = merged_df['url'] != merged_df['Final_URL']

#     # 5. Status breakdown
#     print("\nStatus code distribution:")
#     print(merged_df['HTTP_Response'].value_counts().sort_index())

#     redirect_count = merged_df['Is_Redirect'].sum()
#     print(f"\nRedirects found: {redirect_count:,} / {len(merged_df):,}")

#     # 6. redirect domain analysis
#     if redirect_count > 0:
#         redirected = merged_df[merged_df['Is_Redirect']].copy()

#         try:
#             redirected['original_domain'] = redirected['url'].apply(
#                 lambda x: x.split('/')[2] if isinstance(x, str) and len(x.split('/')) > 2 else '')
#             redirected['final_domain'] = redirected['Final_URL'].apply(
#                 lambda x: x.split('/')[2] if isinstance(x, str) and len(x.split('/')) > 2 else '')
#             redirected['domain_changed'] = redirected['original_domain'] != redirected['final_domain']
#             domain_changes = redirected['domain_changed'].sum()

#             print(f"Redirects to different domains: {domain_changes:,}")
#             print(f"Redirects within same domain: {len(redirected) - domain_changes:,}")

#             http_to_https = sum(
#                 (redirected['url'].str.startswith('http://')) &
#                 (redirected['Final_URL'].str.startswith('https://'))
#             )
#             print(f"HTTP to HTTPS upgrades: {http_to_https:,}")

#         except Exception as e:
#             print(f"Could not analyse domain changes: {str(e)}")

#     return merged_df

In [246]:
# df_status_all = check_urls(df_unique_documents, follow_redirects=True, batch_size=50)

Storing results

In [247]:
# Creates a variable to include in the app informing people when we the last running of the notebook finished. Similar to datetime when urls were checked.
last_run_finish_time = datetime.today().strftime('%Y-%m-%d %H:%M:%S')
last_run_finish_time

In [248]:
## Create a DataFrame to hold the Big Numbers we will use at the top of the app
df_big_numbers = pd.DataFrame({
    'total': ['total_document_links', 'total_reporting_orgs', 'total_unique_document_links', 'empty_date_count', 'last_run_start_time', 'last_run_finish_time'],
    'count': [total_document_links, total_reporting_orgs, total_unique_document_links, empty_date_count, last_run_start_time, last_run_finish_time]
})
df_big_numbers

In [249]:

def __deepnote_big_number__():
    import json
    import jinja2
    from jinja2 import meta

    def render_template(template):
        parsed_content = jinja2.Environment().parse(template)

        required_variables = meta.find_undeclared_variables(parsed_content)

        context = {
            variable_name: globals().get(variable_name)
            for variable_name in required_variables
        }

        result = jinja2.Environment().from_string(template).render(context)

        return result

    rendered_title = render_template("Total number of document links published")
    rendered_comparison_title = render_template("")

    return json.dumps({
        "comparisonTitle": rendered_comparison_title,
        "comparisonValue": "",
        "title": rendered_title,
        "value": f"{total_document_links}"
    })

__deepnote_big_number__()


In [250]:

def __deepnote_big_number__():
    import json
    import jinja2
    from jinja2 import meta

    def render_template(template):
        parsed_content = jinja2.Environment().parse(template)

        required_variables = meta.find_undeclared_variables(parsed_content)

        context = {
            variable_name: globals().get(variable_name)
            for variable_name in required_variables
        }

        result = jinja2.Environment().from_string(template).render(context)

        return result

    rendered_title = render_template("Total number of reporting organisations who publish document links")
    rendered_comparison_title = render_template("")

    return json.dumps({
        "comparisonTitle": rendered_comparison_title,
        "comparisonValue": "",
        "title": rendered_title,
        "value": f"{total_reporting_orgs}"
    })

__deepnote_big_number__()


In [251]:

def __deepnote_big_number__():
    import json
    import jinja2
    from jinja2 import meta

    def render_template(template):
        parsed_content = jinja2.Environment().parse(template)

        required_variables = meta.find_undeclared_variables(parsed_content)

        context = {
            variable_name: globals().get(variable_name)
            for variable_name in required_variables
        }

        result = jinja2.Environment().from_string(template).render(context)

        return result

    rendered_title = render_template("Total number of unique document links published")
    rendered_comparison_title = render_template("")

    return json.dumps({
        "comparisonTitle": rendered_comparison_title,
        "comparisonValue": "",
        "title": rendered_title,
        "value": f"{total_unique_document_links}"
    })

__deepnote_big_number__()


In [252]:
# ## Write data to file
# df_unique_documents.to_csv('/work/data/unique.csv')
# df_big_numbers.to_csv('/work/data/big_numbers.csv')
# a_top_orgs.to_csv('/work/data/top_orgs.csv')
# df_status_all.to_csv('/work/data/unique_status.csv')
# df_unique_languages.to_csv('/work/data/unique_languages.csv')
# df_unique_categories.to_csv('/work/data/unique_categories.csv')

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=a4a779ab-3f0e-431e-9e2f-034dcb7b14b1' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>