# Search procedure sheet for HDX using Python

Set of scripts to install, configure and use Python modules to search in Humanitarian Data eXchange database.

In [22]:
import pandas as pd
import numpy as np
import pandas as pd

Load package for regular expression management :      
https://docs.python.org/3/library/re.html

In [None]:
import re

Load package for Humanitarian Data Exchange plateform connexion management (see wiki for links to sources and tutorial) :

In [None]:
from hdx.utilities.easy_logging import setup_logging
from hdx.api.configuration import Configuration
from hdx.data.dataset import Dataset

## Setup

Setup has to be performed only once :

In [23]:
setup_logging()

In [25]:
Configuration has to be made only once 
Configuration.create(hdx_site="prod", user_agent="BDAUTRIF_HDX-client-Proj", hdx_read_only=True)

## Search 

Functions for searching in the HDX database (client web-site and tutorials, see wiki). 

**First, make a configuration** :

In [None]:
datasets = Dataset.search_in_hdx("WHO", rows=1000)

More parameters avalilable by "get_*" methods of Dataset object :        
See :            
https://github.com/Ben-zie/HDX_Proj-1/blob/main/Python_HDX-Proj-1.ipynb

**Stock "results" in a dataframe** :

In [None]:
results = pd.DataFrame(datasets)

### Variables in the 'result' dataframe :

 Several variables are available in results when inserted in a dataframe :

|FIELD|TYPE|FIELD|TYPE|
|--|--|--|--|
|archived                              |bool|maintainer                          |object|
|batch                               |object|metadata_created                    |object|
|caveats                             |object|metadata_modified                   |object|
|cod_level                           |object|methodology                         |object|
|creator_user_id                     |object|methodology_other                   |object|
|customviz                           |object|name                                |object|
|data_update_frequency               |object|notes                               |object|
|dataseries_name                     |object|num_resources                        |int64|
|dataset_date                        |object|num_tags                             |int64|
|dataset_preview                     |object|organization                        |object|
|dataset_source                      |object|overdue_date                        |object|
|due_date                            |object|owner_org                           |object|   
|groups                              |object|package_creator                     |object|
|has_geodata                           |bool|pageviews_last_14_days               |int64|
|has_quickcharts                       |bool|private                               |bool|
|has_showcases                         |bool|qa_checklist                        |object|
|id                                  |object|qa_completed                          |bool|
|indicator                           |object|quality                             |object|
|is_requestdata_type                   |bool|relationships_as_object             |object|
|isopen                                |bool|relationships_as_subject            |object| 
|last_modified                       |object|review_date                         |object|
|license_id                          |object|solr_additions                      |object| 
|license_other                       |object|state                               |object| 
|license_title                       |object|subnational                         |object| 
|license_url                         |object|tags                                |object| 
|title                               |object|updated_by_script                   |object|
|total_res_downloads                  |int64|url                                 |object|
|type                                |object|version                             |object|
|date_var                        |datetime64|                        || 

Methods exist to get infos and parameters from a specific dataset (see further in this article). Following scripts aim to get informations and selections directly from a list of results from a research.

# Sort results 

## By keyword :

**SOURCE : online python course on KAGGLE plateform** :

In [None]:
def word_search(datasets_list, patern, field=None):
    """
    Desc : function to search a (unique) KEYWORD in a field
    Input :  a LIST of textual documents (here NOTES categorie of a DATASET_LIST) + a KEYWORD to look after in each
    Output : a LIST of indexes of KEYWORD matching documents
    """
    
*# list to hold the indices of matching documents
    indices = [] 
    datasets_list_field = datasets_list[field]
 # Iterate through the indices (i) and elements (doc) of documents
    for i, dataset in enumerate(datasets_list_field):
 # Split the string doc into a list of words (according to whitespace)
        tokens = dataset.split()
 # Make a transformed list where we 'normalize' each word to facilitate matching.
 # Periods and commas are removed from the end of each word, and it's set to all lowercase.
        normalized = [token.rstrip('.,').lower() for token in tokens]
 # Is there a match ? If so, update the list of matching indices.
        if patern.lower() in normalized:
            indices.append(i)
    return indices

## SORTING BY DATE MODULE :
*Using package "re" for regular expressions :*

Set a function to get the dataset date from ['dataset_date'] (for datasets referenced by an intervall period, the last one is chosen) and returned as a vector :

In [None]:
def set_date (x): 
    """
    Function : sets a DATE_BEFORE_VAR and a DATE_AFTER_VAR colums to dataset
    Those are date-formated : %Y-%m-%d
    Parameters : a dataframe containing research results from the HDX databse
    Returns : dates of different datasets in a vector object
    """
# Extract column from results dataset :
    x_dates = x.get('dataset_date')
# Get the pattern of each dates (FROM / TO) :
# TODO : verify compatibility with database regarding from positionning in singles
    x_dates = x_dates.str.findall(r'(\d{4}-\d{2}-\d{2})T')
# 'Serialize' the two columns :
    x_dates = x_dates.apply(lambda x : pd.Series(x))
# Insert it in the original dataset :
    x['date_before_var'] = pd.to_datetime(x_dates[0],)
    x['date_after_var']  = pd.to_datetime(x_dates[1],)
# Which we return :
    print("set_date all done")
    return x

**Set a function to add a variable to dataset, containing date and sort dataset by this column** :

In [None]:
def sort_by_date (x, ascending=True, inplace = False) :
    """
    Function : Sorts a dataset by dates. Mainly a filter for 'select_by_date' function.
    Parameters : a dataframe containing research results from the HDX databse
    Returns : given dataset sorted by the last entry date in datas
    """
# Call the 'set_date' function to have right columns to look in :
    x = set_date(x)
# Use formated columns to sort dataset with :
    x = x.sort_values(by='date_before_var', ascending = ascending, inplace = inplace)
    return x

**Aggregated time-searching functions** :

In [None]:
def select_by_date (results,
                    start_date,
                    end_date,
                    ascending = False,
                    inplace = False) :
    # From the 'results' dataset, set and sort the dates in formated columns :
    x = sort_by_date(results, ascending = ascending)
    # Cases for search FROM, TO or IN BETWEEN dates :
    if start_date != None and end_date != None :
        # Select DataFrame rows between two dates using DataFrame.isin()
        y = x[x["date_after_var"].isin(pd.date_range(start_date, end_date))]
        print('option 1 : find items in interval')
        # Source : https://sparkbyexamples.com/pandas/pandas-select-dataframe-rows-between-two-dates/
    elif start_date != None :
        # Selecting lower to the limit-date :
        mask = (x['date_after_var'] > start_date)
        y = x[mask]
        print('option 2 : find items upper the limit-date')
    elif end_date != None :
        # Selecting upper to the limit-date :
        mask = (x['date_after_var'] < end_date)
        y = x[mask]
        print('option 3 : find items lower the limit-date')
    else : 
        # No limit-date to search for
        print('no limits pointed for selection')
        y=None
    return y

## By type :

In [None]:
def search_geodata (x, y) :
    """

    Parameters
    ----------
    x : dataframe (pandas)
        Source dataframe generated with results elements
    y : bool
        A boolean which tell if (Yes / No) you want to select elements corresponding to geodatas

    Returns
    -------
    x : dataframe
        Items found in HDX (from results passed in arguments) that correspond / dont correspond to geodatas.

    """
    x = x[x['has_geodata'] == y]
    return x

## By source :

In [None]:
def search_sources (x) :
    """

    Parameters
    ----------
    x : dataframe
        dataframe generated with reluts of a research in the HDX database.

    Returns
    -------
    tmp : list
        list of each sources presents in the results.

    """
    tmp = []
    pool_sources = x.dataset_source
    for i in pool_sources : tmp.append(re.findall(r'[^,]+(?=,|$)', i))
    tmp = list(np.concatenate(tmp))
    tmp = list(set(tmp))
    return tmp


def select_sources (x, y) :
    """

    Parameters
    ----------
    x : dataframe
        dataframe generatied with results of a research in the HDX database.
    y : str
        pattern to look for in the sources.

    Returns
    -------
    dataframe
        selection of items containing y in theire sources.

    """
    res = []
    for i in range(0,len(x)) :
        tmp = re.search(y, x.loc[i].dataset_source)
        if tmp :
            res.append(i)
    return x.loc[res]


## Read module :

Read dataset and get ressources :

In [26]:
dataset = Dataset.read_from_hdx("novel-coronavirus-2019-ncov-cases")
print(dataset.get_reference_period())

{'startdate': datetime.datetime(2020, 1, 22, 0, 0, tzinfo=datetime.timezone.utc), 'enddate': datetime.datetime(2023, 3, 9, 23, 59, 59, tzinfo=datetime.timezone.utc), 'startdate_str': '2020-01-22T00:00:00+00:00', 'enddate_str': '2023-03-09T23:59:59+00:00', 'ongoing': False}


In [29]:
period = dataset.get_reference_period()

In [31]:
locations = dataset.get_location_iso3s()

In [33]:
locations = dataset.get_location_names()

In [34]:
tags = dataset.get_tags()