In [None]:
#hide
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
from nbdev import *
# default_exp discovery

In [None]:
#export
from istatapi.base import ISTAT
from istatapi.utils import make_tree, strip_ns
import pandas as pd

# Discovery

> Functions used to discover and explore the data exposed by ISTAT webservice

This module implements functions to discover the data exposed by ISTAT. To do so, `istatapi` make metadata requests to the API endpoints. The `Discovery` module provides useful methods to parse and analyze API responses. It makes use of the library `pandas` and returns data in the `DataFrame` format, making it convenient for interactive and exploratory analysis in Jupyter Notebooks.

The URL structure to formulate a metadata REST request is:

```http://sdmx.istat.it/SDMXWS/rest/resource/agencyID/resourceID/version/itemID?queryStringParameters```

HEre is a description of the parameters for the request:

| Path parameter       | Description                                                                                                                                                                                                                                                                                                                                                                                                                                   | Default |
|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
| resource (mandatory) | The type of metadata to be returned. Values: datastructure, metadatastructure, categoryscheme, conceptscheme, codelist, hierarchicalcodelist, organisationscheme, agencyscheme, dataproviderscheme, dataconsumerscheme, organisationunitscheme, dataflow, metadataflow, reportingtaxonomy, provisionagreement, structureset, process, categorisation, contentconstraint, attachmentconstraint, actualconstraint, allowedconstraint, structure |         |
| agencyID             | Agency maintaining the artefact (e.g.: SDMX)                                                                                                                                                                                                                                                                                                                                                                                                  | all     |
| resourceID           | Artefact ID (e.g.: CL_FREQ)                                                                                                                                                                                                                                                                                                                                                                                                                   | all     |
| version              | Arterfact version (e.g.: 1.0)                                                                                                                                                                                                                                                                                                                                                                                                                 | latest  |
| itemID               | ID of the item (for item schemes) or hierarchy (for hierarchical codelists) to be returned.                                                                                                                                                                                                                                                                                                                                                   | all     |
| queryStringParameters (detail)               | Desired amount of information. Values: allstubs, referencestubs, allcompletestubs, referencecompletestubs, referencepartial, full.                                                                                                                                                                                                                                                                                                            | full    |
| queryStringParameters (references)           | References to be returned with the artefact. Values: none, parents, parentsandsiblings, children, descendants, all, any type of resource.  

Two main classes are implemented in the `Discovery` module: `DataFlows` and `DataStructure`. They both collect methods to explore ISTAT data, their usage and methods are documented below.

## Explore available datasets

In order to understand what data we can retrieve from ISTAT database, the _Data Discovery_ service allows us to look into the list of datasets provided by ISTAT. To do so, we can set the ```resource``` Path parameter to be equal to ```dataflow```. The API text response will then be parsed with Python to display the full list of dataflows available for exploration. These procedures are wrapped in Python and handled by the `DataFlow` class.

In [None]:
#export
class DataFlows(ISTAT):
    """Base class to explore available Dataflows"""
    def __init__(self):
        super().__init__()
        self.resource = "dataflow"
        self.available = self.all_available()

    def parse_dataflows(self, response):
        """parse the `response` containing all the available datasets and return a list of dataflows."""
        tree = make_tree(response)
        strip_ns(tree)
        root = tree.root
        
        dataflows_l = []
        for dataflow in root.iter('Dataflow'):
            id = dataflow.get('id')
            version = dataflow.get('version')
            structure_id = [ref.get('id') for ref in dataflow.iter('Ref')][0]

            #iter over names and get the descriptions 
            for name in dataflow.findall('Name'):
                lang = name.get("{http://www.w3.org/XML/1998/namespace}lang")
                if lang == 'en':
                    description_en = name.text
                if lang == 'it':
                    description_it = name.text

            dataflow_dict = {
            "df_id": id,
            "version": version,
            "description_en": description_en,
            "description_it": description_it,
            "df_structure_id": structure_id, 
            }

            dataflows_l.append(dataflow_dict)
            
        return dataflows_l
        
    def all_available(self, dataframe = True):
        """Return all available dataflows"""
        path = 'dataflow/IT1'
        response = self._request(path = path)
        dataflows = self.parse_dataflows(response)
        
        if dataframe == True: dataflows = pd.DataFrame(dataflows)
            
        return dataflows  
    
    def search(self, keyword, lang = 'en'):
        """Search available dataflows that contain `keyword`. Return these dataflows in a DataFrame"""
        if lang not in ['en', 'it']:
            raise ValueError("'lang' must be 'en' (English) or 'it' (Italian)")
        if lang == 'it':
            dataflows = self.available[self.available['description_it'].str.contains(keyword, case=False)]
        if lang == 'en': 
            dataflows = self.available[self.available['description_en'].str.contains(keyword, case=False)]
        
        return dataflows

The simplest way to get a full list of the dataflows provided by ISTAT is to access the DataFrame in `DataFlows.available`, which calls the method `DataFlows.all_available()` and returns a list of all the explorable dataflows, together with their IDs and descriptions.

In [None]:
datasets = DataFlows()
datasets.available.head()

Unnamed: 0,df_id,version,description_en,description_it,df_structure_id
0,101_1015,1.2,Crops,Coltivazioni,DCSP_COLTIVAZIONI
1,101_1030,1.0,"PDO, PGI and TSG quality products",Prodotti agroalimentari di qualità DOP/IGP e STG,DCSP_DOPIGP
2,101_1033,1.0,slaughtering,Macellazioni,DCSP_MACELLAZIONI
3,101_1039,1.0,Agritourism - municipalities,Agriturismo - comuni,DCSP_AGRITURISMO_COM
4,101_1077,1.0,"PDO, PGI and TSG products: operators - munici...",Prodotti di qualità DOP/IGP e STG: operatori -...,DCSP_DOPIGP_COM


In [None]:
show_doc(DataFlows.search)

<h4 id="DataFlows.search" class="doc_header"><code>DataFlows.search</code><a href="__main__.py#L51" class="source_link" style="float:right">[source]</a></h4>

> <code>DataFlows.search</code>(**`keyword`**, **`lang`**=*`'en'`*)

Search available dataflows that contain `keyword`. Return these dataflows in a DataFrame

This method looks for `keyword` inside all datasets descriptions. By default, the `keyword` needs to be an english word (```lang = 'en'```), but the method also supports italian words by simply changing the argument `lang` to `'it'`. If other languages are used, an error will be thrown.

In [None]:
datasets.search('Unemployment', lang = 'en')

Unnamed: 0,df_id,version,description_en,description_it,df_structure_id
79,151_874,1.0,Unemployment rate - monthly data,Tasso di disoccupazione - dati mensili,DCCV_TAXDISOCCUMENS1
81,151_884,1.0,Unemployment rate - quarterly seasonally adjus...,Tasso di disoccupazione - dati trimestrali des...,DCCV_TAXDISOCCUDE1
83,151_914,1.1,Unemployment rate,Tasso di disoccupazione,DCCV_TAXDISOCCU1
84,151_929,1.1,Unemployment,Disoccupati,DCCV_DISOCCUPT1
159,172_926,1.1,Supplementary indicators to unemployment,Indicatori complementari,DCCV_COMPL1


In [None]:
datasets = DataFlows()
datasets.search('Disoccupazione', lang = 'it')

Unnamed: 0,df_id,version,description_en,description_it,df_structure_id
79,151_874,1.0,Unemployment rate - monthly data,Tasso di disoccupazione - dati mensili,DCCV_TAXDISOCCUMENS1
81,151_884,1.0,Unemployment rate - quarterly seasonally adjus...,Tasso di disoccupazione - dati trimestrali des...,DCCV_TAXDISOCCUDE1
83,151_914,1.1,Unemployment rate,Tasso di disoccupazione,DCCV_TAXDISOCCU1


## Data Structures and Information about available Datasets

The API provides a lot of information on its datasets. The `DataStructure` class implements methods to retrieve this info. The followings are the main functionalities implemented by the class:

1. Find The ID of the ```dataflow``` (dataset) containing the data that needs to be returned
2. Display The ```datastructure``` of the ```dataflow``` we want to query (the dimensions of our ```dataflow```). Dimension is another name for a column or a variable.
3. Explain the names of the dimensions.
4. display the possible values of the dimensions.

In [None]:
#export
class DataStructure(ISTAT):
        """Class that implements methods to retrieve informations about a Dataset"""
        def __init__(self):
            super().__init__()
            self.resource = "datastructure"
            self.available = DataFlows().available #df with all the available dataflows
            #TODO: Initiate the class with a specific dataset. Retrieve informations only on it (maybe use a dataset loader)
            
        def get_df_structure_id(self, lang = 'en', **kwargs):
            """Return the `df_structure_id` of a dataflow from its `df_id` or `df_description`"""
            valid_args = ['df_description', 'df_id']
        
            arg = [*kwargs][0]
            arg_value = [x for x in kwargs.values()][0]
            
            #arguments errors
            if arg not in valid_args: raise ValueError(f"{arg} is not a valid argument. Use one of: {', '.join(valid_args)}")
            elif lang not in ['en', 'it']: raise ValueError("'lang' must be 'en' (English) or 'it' (Italian)")
            
            if arg == "df_description":
                if lang == 'en' :  mask = self.available["description_en"] == arg_value
                else: mask = self.available["description_it"] == arg_value
            
            else:
                mask = self.available[arg] == arg_value
                
            df_structure_id = self.available[mask].df_structure_id.values[0]
            return df_structure_id
        
        def get_df_id(self, lang = 'en', **kwargs):
            """Return the `df_id` of a dataflow from its `df_description` or `df_structure_id`"""
            valid_args = ['df_description', 'df_structure_id']
            
            arg = [*kwargs][0]
            arg_value = [x for x in kwargs.values()][0]  
            
            #arguments errors
            if arg not in valid_args: raise ValueError(f"{arg} is not a valid argument. Use one of: {', '.join(valid_args)}")
            elif lang not in ['en', 'it']: raise ValueError("'lang' must be 'en' (English) or 'it' (Italian)")
            
            if arg == "df_description":
                if lang == 'en' :  mask = self.available["description_en"] == arg_value
                else: mask = self.available["description_it"] == arg_value
            
            else:
                mask = self.available[arg] == arg_value
                
            df_id = self.available[mask].df_id.values[0]
            return df_id
        
        def get_df_description(self, lang = 'en', **kwargs):
            """Return the `df_description` of a dataflow from its `df_id` or `df_structure_id`"""
            valid_args = ['df_id', 'df_structure_id']
            
            arg = [*kwargs][0]
            arg_value = [x for x in kwargs.values()][0]  
            
            #arguments errors
            if arg not in valid_args: raise ValueError(f"{arg} is not a valid argument. Use one of: {', '.join(valid_args)}")
            elif lang not in ['en', 'it']: raise ValueError("'lang' must be 'en' (English) or 'it' (Italian)")
            
            mask = self.available[arg] == arg_value
                
            if lang == 'en' : df_description = self.available[mask].description_en.values[0]
            else: df_description = self.available[mask].description_it.values[0]
            return df_description      
        
        def parse_dimensions(self, response):
            """Parse the `response` containing a dataflow's dimensions and return them in a list"""
            tree = make_tree(response)
            strip_ns(tree)
            root = tree.root
            print(root)
            
            dimensions_l = []
            for dimension in root.iter('Dimension'):
                dimension_name = dimension.attrib['id']
                
                dimension_id = [enumeration.find('Ref').get('id') for enumeration in dimension.iter('Enumeration')][0]
                
                dimension_dict = {'dimension' : dimension_name, 
                                  'dimension_ID' : dimension_id}
                
                dimensions_l.append(dimension_dict)
            
            return(dimensions_l)
            
        def dimensions_description(self, dimensions):
            """Return a dataframe with the descriptions of `dimensions`"""
            resource = "codelist"
            dimensions_l = dimensions.dimension_ID.tolist()
            descriptions_l = []
            
            for dimension_id in dimensions_l:
                path_parts = [resource, self.agencyID, dimension_id]
                path = "/".join(path_parts)
                response = self._request(path = path)
                tree = make_tree(response)
                strip_ns(tree)
                root = tree.root
                
                description = [x for x in root.iter('Codelist')][0]
                description_it = description.findall('Name')[0].text
                description_en = description.findall('Name')[1].text

                description_dict = {'dimension_ID' : dimension_id, 
                                   'description_en' : description_en,
                                   'description_it' : description_it}
                descriptions_l.append(description_dict)
            
            dimensions_descriptions = pd.DataFrame(descriptions_l)
            
            return dimensions_descriptions      
        
        def get_dimensions(self, dataframe = True, **kwargs):
            """Return a Dataframe containing the dimensions (and their descriptions) of a specific dataflow."""
            valid_args = ['df_id', 'df_structure_id', 'df_description']
            
            arg = [*kwargs][0]
            arg_value = [x for x in kwargs.values()][0]
            
            #arguments errors
            if arg not in valid_args: raise ValueError(f"{arg} is not a valid argument. Use one of: {', '.join(valid_args)}")
                
            if arg != 'df_structure_id': df_structure_id = self.get_df_structure_id(**{arg : arg_value})
            else: df_structure_id = arg_value
            
            path_parts = [self.resource, self.agencyID, df_structure_id]
            path = "/".join(path_parts)
            response = self._request(path = path)  
            dimensions = self.parse_dimensions(response)
        
            if dataframe == True: dimensions = pd.DataFrame(dimensions)
            dimensions_description = self.dimensions_description(dimensions)
            dimensions = dimensions.merge(dimensions_description, on='dimension_ID')
            
            return dimensions 
        
        def get_dimension_values(self, dimension_ID, dataframe = True):
            """Return the possible values of a dimension"""
            resource = "codelist"
            path_parts = [resource, self.agencyID, dimension_ID]
            path = "/".join(path_parts)
            response = self._request(path = path)
            tree = make_tree(response)
            strip_ns(tree)
            root = tree.root
            
            values = []
            for value in root.iter('Code'):
                value_id = value.get('id')
                value_it = value.findall('Name')[0].text
                value_en = value.findall('Name')[1].text
                
                value_dict = {'value_ID' : value_id, 
                           'description_en' : value_en,
                           'description_it' : value_it}
        
                values.append(value_dict)
            
            if dataframe == True : values = pd.DataFrame(values)
            return values          
                      

In [None]:
show_doc(DataStructure.get_df_structure_id)

<h4 id="DataStructure.get_df_structure_id" class="doc_header"><code>DataStructure.get_df_structure_id</code><a href="__main__.py#L10" class="source_link" style="float:right">[source]</a></h4>

> <code>DataStructure.get_df_structure_id</code>(**`lang`**=*`'en'`*, **\*\*`kwargs`**)

Return the `df_structure_id` of a dataflow from its `df_id` or `df_description`

In [None]:
structure = DataStructure()
#structure.get_df_structure_id(df_description = "Unemployment  rate") == structure.get_df_structure_id(df_id = "151_914")
structure.get_df_structure_id(df_description = "Unemployment  rate")

'DCCV_TAXDISOCCU1'

In [None]:
show_doc(DataStructure.get_df_id)

<h4 id="DataStructure.get_df_id" class="doc_header"><code>DataStructure.get_df_id</code><a href="__main__.py#L31" class="source_link" style="float:right">[source]</a></h4>

> <code>DataStructure.get_df_id</code>(**`lang`**=*`'en'`*, **\*\*`kwargs`**)

Return the `df_id` of a dataflow from its `df_description` or `df_structure_id`

In [None]:
structure.get_df_id(df_description = "Unemployment  rate")

'151_914'

In [None]:
show_doc(DataStructure.get_df_description)

<h4 id="DataStructure.get_df_description" class="doc_header"><code>DataStructure.get_df_description</code><a href="__main__.py#L52" class="source_link" style="float:right">[source]</a></h4>

> <code>DataStructure.get_df_description</code>(**`lang`**=*`'en'`*, **\*\*`kwargs`**)

Return the `df_description` of a dataflow from its `df_id` or `df_structure_id`

In [None]:
structure.get_df_description(df_id = '151_914')

'Unemployment  rate'

In [None]:
show_doc(DataStructure.get_dimensions)

<h4 id="DataStructure.get_dimensions" class="doc_header"><code>DataStructure.get_dimensions</code><a href="__main__.py#L116" class="source_link" style="float:right">[source]</a></h4>

> <code>DataStructure.get_dimensions</code>(**`dataframe`**=*`True`*, **\*\*`kwargs`**)

Return a Dataframe containing the dimensions (and their descriptions) of a specific dataflow.

The method takes `df_id`, `df_structure_id` or `df_description`as inputs. These inputs are used to identify the specific dataset that we want to retrieve the dimensions of. 

In [None]:
structure = DataStructure()
structure.get_dimensions(df_structure_id = "DCCV_TAXDISOCCU1")

<Element 'Structure' at 0x2e05ec2a248>


Unnamed: 0,dimension,dimension_ID,description_en,description_it
0,FREQ,CL_FREQ,Frequency,Frequenza
1,CITTADINANZA,CL_CITTADINANZA,Citizenship,Cittadinanza
2,DURATA_DISOCCUPAZ,CL_DURATA,Duration,Durata
3,CLASSE_ETA,CL_ETA1,Age class,Classe di età
4,ITTER107,CL_ITTER107,Territory,Territorio
5,SESSO,CL_SEXISTAT1,Gender,Sesso
6,TIPO_DATO,CL_TIPO_DATO_FOL,Data type FOL,Tipo dato FOL
7,TITOLO_STUDIO,CL_TITOLO_STUDIO,Level of education,Titolo di studio


In [None]:
show_doc(DataStructure.get_dimension_values)

<h4 id="DataStructure.get_dimension_values" class="doc_header"><code>DataStructure.get_dimension_values</code><a href="__main__.py#L140" class="source_link" style="float:right">[source]</a></h4>

> <code>DataStructure.get_dimension_values</code>(**`dimension_ID`**, **`dataframe`**=*`True`*)

Return the possible values of a dimension

In [None]:
structure = DataStructure()
structure.get_dimension_values(dimension_ID = "CL_SEXISTAT1")

Unnamed: 0,value_ID,description_en,description_it
0,T,total,totale
1,F,females,femmine
2,M,males,maschi
3,1,males,maschi
4,NRP,no responce,non indicato
5,3,n.a.,non indicato
6,9,total,totale
7,2,females,femmine
