# Search procedure sheet for HDX using Python

Set of scripts to install, configure and use Python modules to search in Humanitarian Data eXchange database.

In [22]:
import pandas as pd
import numpy as np
import pandas as pd

Load package for regular expression management :      
https://docs.python.org/3/library/re.html

In [None]:
import re

Load package for Humanitarian Data Exchange plateform connexion management (see wiki for links to sources and tutorial) :

In [None]:
from hdx.utilities.easy_logging import setup_logging
from hdx.api.configuration import Configuration
from hdx.data.dataset import Dataset

## Setup

Setup has to be performed only once :

In [23]:
setup_logging()

In [25]:
Configuration has to be made only once 
Configuration.create(hdx_site="prod", user_agent="BDAUTRIF_HDX-client-Proj", hdx_read_only=True)

## Search 

Common example of a research in the HDX database (client web-site and tutorials, see wiki) :

In [None]:
datasets = Dataset.search_in_hdx("WHO", rows=1000)

More parameters avalilable by "get_*" methods of Dataset object :      
See https://github.com/Ben-zie/HDX_Proj-1/blob/main/Python_HDX-Proj-1.ipynb

Stock "results" in a dataframe :

In [None]:
results = pd.DataFrame(datasets)

### Variables in the 'result' dataframe :

 Several variables are available in results when inserted in a dataframe :

|FIELD|TYPE|FIELD|TYPE|
|--|--|--|--|
|archived                              |bool|maintainer                          |object|
|batch                               |object|metadata_created                    |object|
|caveats                             |object|metadata_modified                   |object|
|cod_level                           |object|methodology                         |object|
|creator_user_id                     |object|methodology_other                   |object|
|customviz                           |object|name                                |object|
|data_update_frequency               |object|notes                               |object|
|dataseries_name                     |object|num_resources                        |int64|
|dataset_date                        |object|num_tags                             |int64|
|dataset_preview                     |object|organization                        |object|
|dataset_source                      |object|overdue_date                        |object|
|due_date                            |object|owner_org                           |object|   
|groups                              |object|package_creator                     |object|
|has_geodata                           |bool|pageviews_last_14_days               |int64|
|has_quickcharts                       |bool|private                               |bool|
|has_showcases                         |bool|qa_checklist                        |object|
|id                                  |object|qa_completed                          |bool|
|indicator                           |object|quality                             |object|
|is_requestdata_type                   |bool|relationships_as_object             |object|
|isopen                                |bool|relationships_as_subject            |object| 
|last_modified                       |object|review_date                         |object|
|license_id                          |object|solr_additions                      |object| 
|license_other                       |object|state                               |object| 
|license_title                       |object|subnational                         |object| 
|license_url                         |object|tags                                |object| 
|title                               |object|updated_by_script                   |object|
|total_res_downloads                  |int64|url                                 |object|
|type                                |object|version                             |object|
|date_var                        |datetime64|                        || 

Methods exist to get infos and parameters from a specific dataset (see further in this article). Following scripts aim to get informations and selections directly from a list of results from a research.

# Sort results 

## By keyword :

In [None]:
def search_keyword (x, field = 'title', key = None) :
    """
    
    Parameters
    ----------
    x : dataframe
        original dataframe containing results of research on the HDX database.
    field : str, optional
        Filed in which look after the keyword ; default is 'title'.
    key : str
        Keyword to look after ; default is None.

    Returns
    -------
    dat : dataframe
        List of items matching the research.
    """

    dat = x[x[field].str.contains(key, case = False)]
    return dat

## By type :

In [None]:
def search_geodata (x, y) :
    """

    Parameters
    ----------
    x : dataframe (pandas)
        Source dataframe generated with results elements
    y : bool
        A boolean which tell if (Yes / No) you want to select elements corresponding to geodatas

    Returns
    -------
    x : dataframe
        Items found in HDX (from results passed in arguments) that correspond / dont correspond to geodatas.

    """
    x = x[x['has_geodata'] == y]
    return x

## By source :

In [None]:
def search_sources (x) :
    """

    Parameters
    ----------
    x : dataframe
        dataframe generated with reluts of a research in the HDX database.

    Returns
    -------
    tmp : list
        list of each sources presents in the results.

    """
    tmp = []
    pool_sources = x.dataset_source
    for i in pool_sources : tmp.append(re.findall(r'[^,]+(?=,|$)', i))
    tmp = list(np.concatenate(tmp))
    tmp = list(set(tmp))
    return tmp


def select_sources (x, y) :
    """

    Parameters
    ----------
    x : dataframe
        dataframe generatied with results of a research in the HDX database.
    y : str
        pattern to look for in the sources.

    Returns
    -------
    dataframe
        selection of items containing y in theire sources.

    """
    res = []
    for i in range(0,len(x)) :
        tmp = re.search(y, x.loc[i].dataset_source)
        if tmp :
            res.append(i)
    return x.loc[res]


## By date :

Set a function to get the dataset date from ['dataset_date'] (for datasets referenced by an intervall period, the last one is chosen) and returned as a vector :

In [None]:
def get_date (x): 
    """
    Parameters : a dataframe containing research results from the HDX databse
    Returns : dates of different datasets in a vector object
    """
    x_dates = x.get('dataset_date')
    dates = []
    for i in x_dates : 
        dates.append(re.findall(r'(\d{4}-\d{2}-\d{2})(?!.*\d{4}-\d{2}-\d{2}T)', i))
    x['date_var'] = list(np.concatenate(dates).flat)
    x['date_var'] = pd.to_datetime(x['date_var'])
    return x

Set a function to add a variable to dataset, containing date and sort dataset by this column :

In [None]:
def sort_by_date (x) :
    """
    Parameters : a dataframe containing research results from the HDX database
    Returns : given dataset sorted by the last entry date in datas
    """
    x = get_date(x)
    x = x.sort_values(by='date_var',ascending=False)
    return x

## Read module :

Read dataset and get ressources :

In [26]:
dataset = Dataset.read_from_hdx("novel-coronavirus-2019-ncov-cases")
print(dataset.get_reference_period())

{'startdate': datetime.datetime(2020, 1, 22, 0, 0, tzinfo=datetime.timezone.utc), 'enddate': datetime.datetime(2023, 3, 9, 23, 59, 59, tzinfo=datetime.timezone.utc), 'startdate_str': '2020-01-22T00:00:00+00:00', 'enddate_str': '2023-03-09T23:59:59+00:00', 'ongoing': False}


In [29]:
period = dataset.get_reference_period()

In [31]:
locations = dataset.get_location_iso3s()

In [33]:
locations = dataset.get_location_names()

In [34]:
tags = dataset.get_tags()