In [None]:
from nbdev import *

%nbdev_default_export discovery

Cells will be exported to istatapi.discovery,
unless a different module is specified after an export flag: `%nbdev_export special.module`


In [None]:
%nbdev_hide
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
%nbdev_export
from istatapi.base import ISTAT
from istatapi.utils import make_tree, strip_ns
import pandas as pd
import pprint
from tqdm import tqdm

# Discovery

> Functions used to discover and explore the data exposed by ISTAT webservice

This module implements functions to discover the data exposed by ISTAT. To do so, `istatapi` make metadata requests to the API endpoints. The `Discovery` module provides useful methods to parse and analyze API responses. It makes use of the library `pandas` and returns data in the `DataFrame` format, making it convenient for interactive and exploratory analysis in Jupyter Notebooks.

Two main classes are implemented in the `Discovery` module: `DataFlows` and `DataSet`. They both collect methods to explore ISTAT data, their usage and methods are documented below.

## Explore available datasets

In order to understand what data we can retrieve from ISTAT database, the _Data Discovery_ service allows us to look into the list of datasets provided by ISTAT. To do so, we can set the ```resource``` Path parameter to be equal to ```dataflow```. The API text response will then be parsed with Python to display the full list of dataflows available for exploration. These procedures are wrapped in Python and handled by the `DataFlow` class.

In [None]:
%nbdev_export
def parse_dataflows(response):
    """parse the `response` containing all the available datasets and return a list of dataflows."""
    tree = make_tree(response)
    strip_ns(tree)
    root = tree.root

    dataflows_l = []
    for dataflow in root.iter("Dataflow"):
        id = dataflow.get("id")
        version = dataflow.get("version")
        structure_id = [ref.get("id") for ref in dataflow.iter("Ref")][0]

        # iter over names and get the descriptions
        for name in dataflow.findall("Name"):
            lang = name.get("{http://www.w3.org/XML/1998/namespace}lang")
            if lang == "en":
                description_en = name.text
            # if lang == 'it':
            # description_it = name.text

        dataflow_dict = {
            "df_id": id,
            "version": version,
            "description": description_en,
            # "description_it": description_it,
            "df_structure_id": structure_id,
        }

        dataflows_l.append(dataflow_dict)

    return dataflows_l


def all_available(dataframe=True):
    """Return all available dataflows"""
    path = "dataflow/IT1"
    client = ISTAT()
    response = client._request(path=path)
    dataflows = parse_dataflows(response)

    if dataframe == True:
        dataflows = pd.DataFrame(dataflows)

    return dataflows


def search_dataset(keyword):
    """Search available dataflows that contain `keyword`. Return these dataflows in a DataFrame"""
    dataflows = all_available()[
        all_available()["description"].str.contains(keyword, case=False)
    ]

    return dataflows

The simplest way to get a full list of the dataflows provided by ISTAT is to access the DataFrame in `DataFlows.available`, which calls the method `DataFlows.all_available()` and returns a list of all the explorable dataflows, together with their IDs and descriptions.

In [None]:
available_datasets = all_available()
available_datasets.head()

Unnamed: 0,df_id,version,description,df_structure_id
0,101_1015,1.2,Crops,DCSP_COLTIVAZIONI
1,101_1030,1.0,"PDO, PGI and TSG quality products",DCSP_DOPIGP
2,101_1033,1.0,slaughtering,DCSP_MACELLAZIONI
3,101_1039,1.0,Agritourism - municipalities,DCSP_AGRITURISMO_COM
4,101_1077,1.0,"PDO, PGI and TSG products: operators - munici...",DCSP_DOPIGP_COM


In [None]:
show_doc(search_dataset)

<h4 id="search_dataset" class="doc_header"><code>search_dataset</code><a href="__main__.py#L48" class="source_link" style="float:right">[source]</a></h4>

> <code>search_dataset</code>(**`keyword`**)

Search available dataflows that contain `keyword`. Return these dataflows in a DataFrame

This method looks for `keyword` inside all datasets descriptions. By default, the `keyword` needs to be an english word.

In [None]:
search_dataset(keyword="Unemployment")

Unnamed: 0,df_id,version,description,df_structure_id
79,151_874,1.0,Unemployment rate - monthly data,DCCV_TAXDISOCCUMENS1
81,151_884,1.0,Unemployment rate - quarterly seasonally adjus...,DCCV_TAXDISOCCUDE1
83,151_914,1.1,Unemployment rate,DCCV_TAXDISOCCU1
84,151_929,1.1,Unemployment,DCCV_DISOCCUPT1
159,172_926,1.1,Supplementary indicators to unemployment,DCCV_COMPL1


## Data Structures and Information about available Datasets

The API provides a lot of information on its datasets. The `DataSet` class implements methods to retrieve this info. The followings are the main functionalities implemented by the class:

1. Find The ID of the ```dataflow``` (dataset) containing the data that needs to be returned
2. Display The ```DataSet``` of the ```dataflow``` we want to query (the dimensions of our ```dataflow```). Dimension is another name for a column or a variable.
3. Explain the names of the dimensions.
4. display the possible values of the dimensions.

In [None]:
%nbdev_export
class DataSet(ISTAT):
    """Class that implements methods to retrieve informations (metadata) about a Dataset"""

    def __init__(self, dataflow_identifier):
        super().__init__()
        self.resource = "datastructure"
        self.all_available = all_available()  # df with all the available dataflows
        self.identifiers = self.set_identifiers(dataflow_identifier)
        self.dimensions = self.get_dimensions(description=False).dimension.tolist()
        self.values = self.available_dimensions_values()
        self.filters = self.default_filters()
        # self.dimensions_values = self.available_dimensions_values()

        # TODO: returning all metadata related to the dataflow contained in 'Header'

    def set_identifiers(self, dataflow_identifier):
        """"""
        if dataflow_identifier[3] == "_":
            return self.set_from_id(dataflow_identifier)
        elif dataflow_identifier[4] == "_":
            return self.set_from_structure_id(dataflow_identifier)
        else:
            if type(dataflow_identifier) == str:
                return self.set_from_description(dataflow_identifier)
            else:
                raise ValueError(dataflow_identifier)

    def set_from_id(self, df_id):
        mask = self.all_available["df_id"] == df_id
        df = self.all_available[mask]
        return df.to_dict(orient="records")[0]

    def set_from_structure_id(self, df_structure_id):
        mask = self.all_available["df_structure_id"] == df_structure_id
        df = self.all_available[mask]
        return df.to_dict(orient="records")[0]

    def set_from_description(self, description):
        mask = self.all_available["description"] == description
        df = self.all_available[mask]
        return df.to_dict(orient="records")[0]

    def parse_dimensions(self, response):
        """Parse the `response` containing a dataflow's dimensions and return them in a list"""
        tree = make_tree(response)
        strip_ns(tree)
        root = tree.root

        dimensions_l = []
        for dimension in root.iter("Dimension"):
            dimension_name = dimension.attrib["id"]

            dimension_id = [
                enumeration.find("Ref").get("id")
                for enumeration in dimension.iter("Enumeration")
            ][0]

            dimension_dict = {"dimension": dimension_name, "dimension_ID": dimension_id}

            dimensions_l.append(dimension_dict)

        return dimensions_l

    def get_dimensions(self, dataframe=True, description=True):
        """Return the dimensions and their descriptions of a specific dataflow."""
        df_structure_id = self.identifiers["df_structure_id"]

        path_parts = [self.resource, self.agencyID, df_structure_id]
        path = "/".join(path_parts)
        response = self._request(path=path)
        dimensions = self.parse_dimensions(response)

        if dataframe == True:
            dimensions = pd.DataFrame(dimensions)

        if description == True:
            dimensions_description = self.dimensions_description(dimensions)
            dimensions = dimensions.merge(dimensions_description, on="dimension_ID")

        return dimensions

    def dimensions_description(self, dimensions):
        """Return a dataframe with the descriptions of `dimensions`"""
        resource = "codelist"
        dimensions_l = dimensions.dimension_ID.tolist()
        descriptions_l = []

        for dimension_id in dimensions_l:
            path_parts = [resource, self.agencyID, dimension_id]
            path = "/".join(path_parts)
            response = self._request(path=path)
            tree = make_tree(response)
            strip_ns(tree)
            root = tree.root

            description = [x for x in root.iter("Codelist")][0]
            # description_it = description.findall('Name')[0].text
            description = description.findall("Name")[1].text

            description_dict = {
                "dimension_ID": dimension_id,
                "description": description,
            }
            descriptions_l.append(description_dict)

        dimensions_descriptions = pd.DataFrame(descriptions_l)

        return dimensions_descriptions

    def available_dimensions_values(self):
        """Return a dictionary with available values for each dimension in the dataset"""
        resource = "availableconstraint"
        df_id = self.identifiers["df_id"]
        path_parts = [
            resource,
            df_id,
            "?references=all&detail=full",
        ]  # TODO: pass them as parameters
        path = "/".join(path_parts)
        response = self._request(path=path)
        tree = make_tree(response)
        strip_ns(tree)
        root = tree.root

        dimensions_values = {}

        for dimension in root.iter("Codelist"):
            dimension_id = dimension.get("id")

            values = {}
            value_id_l, value_descr_l = [], []

            for value in dimension.iter("Code"):
                value_id = value.get("id")
                value_descr = [name.text for name in value.findall("Name")][1]
                value_id_l.append(value_id)
                value_descr_l.append(value_descr)

            values["values_ids"] = value_id_l
            values["values_description"] = value_descr_l
            dimensions_values[dimension_id] = values

        for dimension_id in list(dimensions_values.keys()):
            dimension = self.get_dimension_name(dimension_id)
            dimensions_values[dimension] = dimensions_values.pop(dimension_id)

        return dimensions_values

    def get_dimension_values(self, dimension, dataframe=True):
        """Return the available values of a single dimension in the dataset"""
        dimension_dict = self.values[dimension]
        dimension_df = pd.DataFrame.from_dict(dimension_dict)
        return dimension_df if dataframe else dimension_dict

    def get_dimension_id(self, dimension):
        """Convert `dimension` to `dimension_id`"""
        dimensions_df = self.dataset.get_dimensions(description=False)
        mask = dimensions_df["dimension"] == dimension
        dimension_ID = dimensions_df[mask]["dimension_ID"]
        return dimension_ID.values[0]

    def get_dimension_name(self, dimension_id):
        """Convert `dimension_id` to `dimension`"""
        dimensions_df = self.get_dimensions(description=False)
        mask = dimensions_df["dimension_ID"] == dimension_id
        dimension = dimensions_df[mask]["dimension"]
        return dimension.values[0]

    def dimension_values_all(self, dimension, dataframe=True):
        """Return ALL the possible values of a dimension in the dataset"""
        # TODO: convert dimension_ID to dimension name to be consistent. Remove? not really useful
        resource = "codelist"
        agencyID = "IT1"
        dimension_ID = self.get_dimension_id(dimension)
        path_parts = [resource, self.agencyID, dimension_ID]
        path = "/".join(path_parts)
        response = self._request(path=path)
        tree = make_tree(response)
        strip_ns(tree)
        root = tree.root

        values = []
        for value in root.iter("Code"):
            value_id = value.get("id")
            # value_it = value.findall('Name')[0].text
            value = value.findall("Name")[1].text

            value_dict = {"value_ID": value_id, "description": value}

            values.append(value_dict)

        if dataframe == True:
            values = pd.DataFrame(values)

        return values

    def default_filters(self):
        default_filters = {}  # initiate filter with empty dict
        # no filter equals all values (default)
        for dimension in self.dimensions:
            default_filters[dimension] = "."
        return default_filters

    def set_filters(self, **kwargs):
        """"""
        # add kwargs in case passed
        for arg, arg_value in kwargs.items():
            self.filters[arg.upper()] = arg_value

In [None]:
show_doc(DataSet.get_dimensions)

<h4 id="DataSet.get_dimensions" class="doc_header"><code>DataSet.get_dimensions</code><a href="__main__.py#L65" class="source_link" style="float:right">[source]</a></h4>

> <code>DataSet.get_dimensions</code>(**`dataframe`**=*`True`*, **`description`**=*`True`*)

Return the dimensions and their descriptions of a specific dataflow.

The class takes `df_id`, `df_structure_id` or `df_description`as inputs.

In [None]:
ds = DataSet(dataflow_identifier="151_914")
dimensions = ds.get_dimensions()
dimensions.head()

Unnamed: 0,dimension,dimension_ID,description
0,FREQ,CL_FREQ,Frequency
1,CITTADINANZA,CL_CITTADINANZA,Citizenship
2,DURATA_DISOCCUPAZ,CL_DURATA,Duration
3,CLASSE_ETA,CL_ETA1,Age class
4,ITTER107,CL_ITTER107,Territory


we can also ook at the dimensions of a dataflow by simply accessing its attribute `dimensions`. However, we won't have dimensions' descriptions here.

In [None]:
ds = DataSet(dataflow_identifier="151_914")
ds.dimensions

['FREQ',
 'CITTADINANZA',
 'DURATA_DISOCCUPAZ',
 'CLASSE_ETA',
 'ITTER107',
 'SESSO',
 'TIPO_DATO',
 'TITOLO_STUDIO']

In [None]:
# show_doc(DataSet.dimension_values)

**Please Note**: This function Return ALL the possible values of a dimension. In most cases, many of these values will not be present in your dataset.

In [None]:
# dz = DataSet(dataflow_identifier="slaughtering")
# dimensions = dz.get_dimensions()
# dimensions

In [None]:
# dz.dimension_values(dimension="ITTER107").head()

In [None]:
show_doc(DataSet.available_dimensions_values)

<h4 id="DataSet.available_dimensions_values" class="doc_header"><code>DataSet.available_dimensions_values</code><a href="__main__.py#L111" class="source_link" style="float:right">[source]</a></h4>

> <code>DataSet.available_dimensions_values</code>()

Return a dictionary with available values for each dimension in the dataset

In [None]:
dz = DataSet(dataflow_identifier="151_914")
values_dict = dz.available_dimensions_values()

# for dimension, available_values in list(values_dict.items())[:3]:
#     print(dimension, available_values)

pprint.pprint(values_dict)

{'CITTADINANZA': {'values_description': ['italiano-a', 'straniero-a', 'totale'],
                  'values_ids': ['ITL', 'FRG', 'TOTAL']},
 'CLASSE_ETA': {'values_description': ['18-29 years',
                                       '15 years and over',
                                       '15-74 years',
                                       '55 years and over',
                                       '25-34 years',
                                       '20-24 years',
                                       '15-29 years',
                                       '45-54 years',
                                       '35-54 years',
                                       '35 years and over',
                                       '15-64 years',
                                       '55-64 years',
                                       '15-24 anni',
                                       '35-44 anni'],
                'values_ids': ['Y18-29',
                               'Y_GE15',
      

In [None]:
show_doc(DataSet.get_dimension_values)

<h4 id="DataSet.get_dimension_values" class="doc_header"><code>DataSet.get_dimension_values</code><a href="__main__.py#L150" class="source_link" style="float:right">[source]</a></h4>

> <code>DataSet.get_dimension_values</code>(**`dimension`**, **`dataframe`**=*`True`*)

Return the available values of a single dimension in the dataset

In [None]:
dz = DataSet(dataflow_identifier="151_914")
dz.get_dimension_values('CLASSE_ETA').head()

Unnamed: 0,values_ids,values_description
0,Y18-29,18-29 years
1,Y_GE15,15 years and over
2,Y15-74,15-74 years
3,Y_GE55,55 years and over
4,Y25-34,25-34 years


## Full example

In [None]:
ds = DataSet(dataflow_identifier="151_914")
ds.identifiers

{'df_id': '151_914',
 'version': '1.1',
 'description': 'Unemployment  rate',
 'df_structure_id': 'DCCV_TAXDISOCCU1'}

In [None]:
dimensions = ds.get_dimensions()
dimensions

Unnamed: 0,dimension,dimension_ID,description
0,FREQ,CL_FREQ,Frequency
1,CITTADINANZA,CL_CITTADINANZA,Citizenship
2,DURATA_DISOCCUPAZ,CL_DURATA,Duration
3,CLASSE_ETA,CL_ETA1,Age class
4,ITTER107,CL_ITTER107,Territory
5,SESSO,CL_SEXISTAT1,Gender
6,TIPO_DATO,CL_TIPO_DATO_FOL,Data type FOL
7,TITOLO_STUDIO,CL_TITOLO_STUDIO,Level of education


In [None]:
# dimension_value = ds.dimension_values(dimension="ITTER107")
# len(dimension_value)

In [None]:
values_dict = ds.available_dimensions_values()
values_dict["DURATA_DISOCCUPAZ"]

{'values_ids': ['TOTAL', 'M_GE12'],
 'values_description': ['total', '12 mesi e più']}

In [None]:
%nbdev_hide
export.notebook2script()

Converted 00_base.ipynb.
Converted 01_discovery.ipynb.
Converted 02_utils.ipynb.
Converted 03_retrieval.ipynb.
Converted index.ipynb.
