# Procedure sheet for search in HDX database using Python

Set of scripts to install, configure and use Python modules to search in Humanitarian Data eXchange database.

In [22]:
import pandas as pd
import numpy as np
import pandas as pd

Load package for regular expression management :      
https://docs.python.org/3/library/re.html

In [None]:
import re

Load package for Humanitarian Data Exchange plateform connexion management (see wiki for links to sources and tutorial) :

In [None]:
from hdx.utilities.easy_logging import setup_logging
from hdx.api.configuration import Configuration
from hdx.data.dataset import Dataset

## Setup

Setup has to be performed only once :

In [23]:
setup_logging()

In [25]:
Configuration has to be made only once 
Configuration.create(hdx_site="prod", user_agent="BDAUTRIF_HDX-client-Proj", hdx_read_only=True)

## Search module

Common search in HDX using client method :

In [None]:
datasets = Dataset.search_in_hdx("WHO", rows=1000)

More parameters avalilable by "get_*" methods of Dataset object :      
See https://github.com/Ben-zie/HDX_Proj-1/blob/main/Python_HDX-Proj-1.ipynb

Stock "results" in a dataframe :

In [None]:
results = pd.DataFrame(datasets)

### Variables in result dataframe :


['archived', 'batch', 'caveats', 'cod_level', 'creator_user_id', 'customviz', 'data_update_frequency', 'dataseries_name', 'dataset_date', 'dataset_preview', 'dataset_source', 'due_date', 'groups', 'has_geodata', 'has_quickcharts', 'has_showcases', 'id', 'is_requestdata_type', 'isopen', 'last_modified', 'license_id', 'license_other', 'license_title', 'license_url', 'maintainer', 'metadata_created', 'metadata_modified', 'methodology', 'methodology_other', 'name', 'notes', 'num_resources', 'num_tags', 'organization', 'overdue_date', 'owner_org', 'package_creator', 'pageviews_last_14_days', 'private', 'qa_checklist', 'qa_completed', 'relationships_as_object', 'relationships_as_subject', 'review_date', 'solr_additions', 'state', 'subnational', 'tags', 'title', 'total_res_downloads', 'type', 'updated_by_script', 'url', 'version']

## Sorting results by date :

Using package "re" for regular expressions :

Set a function to get the dataset date from ['dataset_date'] (for datasets referenced by an intervall period, the last one is chosen) and returned as a vector :

In [None]:
def get_date (x): 
    """
    Parameters : a dataframe containing research results from the HDX databse
    Returns : dates of different datasets in a vector object
    """
    x_dates = x.get('dataset_date')
    dates = []
    for i in x_dates : 
        dates.append(re.findall(r'(\d{4}-\d{2}-\d{2})(?!.*\d{4}-\d{2}-\d{2}T)', i))
    x['date_var'] = list(np.concatenate(dates).flat)
    x['date_var'] = pd.to_datetime(x['date_var'])
    return x

Set a function to add a variable to dataset, containing date and sort dataset by this column :

In [None]:
def sort_by_date (x) :
    """
    Parameters : a dataframe containing research results from the HDX databse
    Returns : given dataset sorted by the last entry date in datas
    """
    x = get_date(x)
    x = x.sort_values(by='date_var',ascending=False)
    return x

## Read module :

Read dataset and get ressources :

In [26]:
dataset = Dataset.read_from_hdx("novel-coronavirus-2019-ncov-cases")
print(dataset.get_reference_period())

{'startdate': datetime.datetime(2020, 1, 22, 0, 0, tzinfo=datetime.timezone.utc), 'enddate': datetime.datetime(2023, 3, 9, 23, 59, 59, tzinfo=datetime.timezone.utc), 'startdate_str': '2020-01-22T00:00:00+00:00', 'enddate_str': '2023-03-09T23:59:59+00:00', 'ongoing': False}


In [29]:
period = dataset.get_reference_period()

In [31]:
locations = dataset.get_location_iso3s()

In [33]:
locations = dataset.get_location_names()

In [34]:
tags = dataset.get_tags()