ICPAC Countries Data Fetching Script (HDX)
Overview
This script is designed to download CSV files related to specific ICPAC countries from the Humanitarian Data Exchange (HDX) and organize them into separate folders for each country. The ICPAC countries included are Djibouti, Eritrea, Ethiopia, Kenya, Somalia, South Sudan, Sudan, Uganda, Burundi, Rwanda, and Tanzania.

Script Enhancements
The script has been enhanced with several features to ensure robust data fetching and organization:

Enhanced Logging: The script now includes print statements that log the process of fetching data for each country. This helps in tracking which country's data is being processed and whether datasets are found.

Error Handling: Added try-except blocks around the requests.get calls to handle potential HTTP errors and other exceptions. This ensures that the script can handle issues gracefully and provides meaningful error messages.

Modified Search Parameters: Adjusted the search parameters to include a simple query for the country name (q) and a filter for CSV resources (fq). This change aims to improve the relevance of the search results.

Directory Structure: For each ICPAC country, a directory is created to store the corresponding CSV files. This organization helps in maintaining a clear structure for the downloaded data.

Download Function: The function download_file is used to handle the downloading of files. It checks the response status and writes the content to a file in chunks to handle large files efficiently.

Usage
The script automatically create a directory named hdx_csv_files with subdirectories for each ICPAC country, where the downloaded CSV files will be stored.

Import Libraries and Define Constants

In [1]:
import requests
import os
from datetime import datetime

# HDX endpoint URL
hdx_url = "https://data.humdata.org/api/3/action/package_search"

# List of ICPAC countries
icpac_countries = [
    "Djibouti", "Eritrea", "Ethiopia", "Kenya", "Somalia", "South Sudan",
    "Sudan", "Uganda", "Burundi", "Rwanda", "Tanzania"
]

# List of tags to capture
tags_to_capture = {
    "affected area", "climate hazards", "climate-weather", "conflict-violence",
    "covid-19", "cyclones-hurricanes-typhoons", "damage assessment", "disability",
    "disaster risk reduction-drr", "disease", "drought", "earthquake-tsunami",
    "el nino-el nina", "environment", "epidemics-outbreaks", "facilities-infrastructure",
    "fatalities", "flooding-storm surge", "forced displacement", "funding",
    "hazards and risk", "horn of africa", "humanitarian access", "humanitarian needs overview-hno",
    "humanitarian response plan-hrp", "hydrology", "internally displaced persons-idp",
    "languages", "malaria", "natural disasters", "severity"
}

 Define Helper Functions

In [2]:
# Function to download a file from a URL
def download_file(url, save_path):
    try:
        response = requests.get(url, stream=True)
        if response.status_code == 200:
            with open(save_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=1024):
                    f.write(chunk)
            print(f"Downloaded: {save_path}")
        else:
            print(f"Failed to download {url}: {response.status_code}")
    except Exception as e:
        print(f"Error downloading {url}: {str(e)}")

# Function to create subfolders for similar titles
def create_subfolder(file_name, country_dir):
    file_title = file_name.split('_')[0]
    subfolder_dir = os.path.join(country_dir, file_title)
    os.makedirs(subfolder_dir, exist_ok=True)
    return subfolder_dir

# Function to check if the dataset is within the date range (1960 to 2024/07/29)
def is_within_date_range(dataset_date_str, start_year=1960, end_date_str='2024-07-29T23:59:59'):
    try:
        dataset_date = datetime.strptime(dataset_date_str, '%Y-%m-%dT%H:%M:%S')
        start_date = datetime(year=start_year, month=1, day=1)
        end_date = datetime.strptime(end_date_str, '%Y-%m-%dT%H:%M:%S')
        return start_date <= dataset_date <= end_date
    except ValueError:
        return False

# Function to check if any of the dataset's tags match the tags to capture
def has_matching_tags(dataset_tags):
    if not isinstance(dataset_tags, list):
        return False
    for tag in dataset_tags:
        if isinstance(tag, dict):
            tag_name = tag.get('name', '').lower()
            if tag_name in tags_to_capture:
                return True
    return False


Fetch and Process Data

In [3]:
# Loop through each ICPAC country and retrieve data
for country in icpac_countries:
    print(f"Fetching data for {country}...")
    start = 0
    rows = 100  # Number of results to fetch per request

    # Create a directory for each country to save the files
    country_dir = os.path.join("hdx_files", country)
    os.makedirs(country_dir, exist_ok=True)

    while True:
        params = {
            "q": country,
            "start": start,
            "rows": rows
        }

        # Fetch datasets from HDX
        try:
            response = requests.get(hdx_url, params=params)
            response.raise_for_status()  # Raises an HTTPError for bad responses
            datasets = response.json().get('result', {}).get('results', [])
            if not datasets:
                print(f"No more datasets found for {country}")
                break
            for dataset in datasets:
                # Check if the dataset has any of the specified tags
                if has_matching_tags(dataset.get('tags', [])):
                    resources = dataset.get('resources', [])
                    for resource in resources:
                        file_url = resource.get('url')
                        file_format = resource.get('format', 'unknown').lower()
                        if file_format in ['csv', 'xlsx', 'xls']:
                            file_name = resource.get('name', 'downloaded_file').replace('/', '_') + '.' + file_format
                            subfolder_dir = create_subfolder(file_name, country_dir)
                            save_path = os.path.join(subfolder_dir, file_name)
                            download_file(file_url, save_path)
            start += rows  # Move to the next set of results
        except requests.exceptions.HTTPError as http_err:
            print(f"HTTP error occurred for {country}: {http_err}")
            break
        except Exception as err:
            print(f"An error occurred for {country}: {err}")
            break

print("All files have been processed and organized by country and similar titles.")


Fetching data for Djibouti...
Downloaded: hdx_files/Djibouti/Infrastructure Indicators for Djibouti.csv/Infrastructure Indicators for Djibouti.csv
Downloaded: hdx_files/Djibouti/QuickCharts-Infrastructure Indicators for Djibouti.csv/QuickCharts-Infrastructure Indicators for Djibouti.csv
Downloaded: hdx_files/Djibouti/Environment Indicators for Djibouti.csv/Environment Indicators for Djibouti.csv
Downloaded: hdx_files/Djibouti/QuickCharts-Environment Indicators for Djibouti.csv/QuickCharts-Environment Indicators for Djibouti.csv
Downloaded: hdx_files/Djibouti/List of airports in Djibouti (HXL tags).csv/List of airports in Djibouti (HXL tags).csv
Downloaded: hdx_files/Djibouti/List of airports in Djibouti (no HXL tags).csv/List of airports in Djibouti (no HXL tags).csv
Downloaded: hdx_files/Djibouti/djibouti/djibouti_political_violence_events_and_fatalities_by_month-year.xlsx
Downloaded: hdx_files/Djibouti/djibouti/djibouti_civilian_targeting_events_and_fatalities_by_month-year.xlsx
Down

Zip the Output Directory

In [4]:
!zip -r hdx_files.zip hdx_files
from google.colab import files
files.download('hdx_files.zip')


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: hdx_files/Eritrea/Sub-Saharan/Sub-Saharan_health_facilities.xlsx.xlsx (deflated 3%)
  adding: hdx_files/Eritrea/Demographics and locations of forcibly displaced and stateless persons residing in Eritrea.csv/ (stored 0%)
  adding: hdx_files/Eritrea/Demographics and locations of forcibly displaced and stateless persons residing in Eritrea.csv/Demographics and locations of forcibly displaced and stateless persons residing in Eritrea.csv (deflated 80%)
  adding: hdx_files/Eritrea/sogeh/ (stored 0%)
  adding: hdx_files/Eritrea/sogeh/sogeh_aggregiondata_2020.xlsx.xlsx (deflated 22%)
  adding: hdx_files/Eritrea/sogeh/sogeh_aggregiondata_allyears.xlsx.xlsx (deflated 64%)
  adding: hdx_files/Eritrea/sogeh/sogeh_aggcountrydata_2020.xlsx.xlsx (deflated 5%)
  adding: hdx_files/Eritrea/sogeh/sogeh_aggregiondata_2021.xlsx.xlsx (deflated 81%)
  adding: hdx_files/Eritrea/sogeh/sogeh_aggcountrydata_2021.xlsx.xlsx (deflated 35%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>