# Retrieving data

This notebook shows the python client library equivalents to the raw HTTP calls described in the [retrieving data](https://edelweissdata.com/docs/retrieving-data) walkthrough on the offical EdelweissData documentation website.

This walkthrough shows how to retrieve a dataset and its data from EdelweissData. One of EdelweissData's principles is that for most interactions there should be a nice graphical user interface as well as a well designed REST like API. For this notebook, we will look at the API requests equivalent to browsing a single dataset in the EdelweissData™ DataExplorer. We'll retrieve information about the "[COVID-19 timeseries data for Germany by state (RKI data)](https://edelweissdata.com/dataset/8dde2785-8a2a-4847-80b8-982a691954d6:106)" dataset - if you follow the link you can explore it interactively in the DataExplorer to get an idea about the data.

In this walkthrough we'll concentrate on public datasets that can be accessed without authorization. If you want to access private datasets, refer to the [authentication notebook](/docs/authentication) for details.

## Datasets in EdelweissData

You can think of datasets in EdelweissData™ as having 4 components:
* The **data table** itself
* A **dataset description** that explains the data in human readable form - much like a README.md on a github repository
* **Metadata** formatted as a JSON document. This can be used to store information about the author, equipment used to collect the data or whatever else you think may be useful.
* The **schema** of the data that stores information about the data columns. This includes the data type, an optional description and statistics for numeric values or aggregations for categorical values.

Each of these 4 components gets it's own tab in the DataExplorer:
![Data Explorer image](https://edelweissdata.com/images/data-explorer-tabs.png)

In the python API, datasets are modelled as two classes - InProgressDataset and PublishedDataset. They both allow access to the metadata, description, schema, name etc of the dataset and can be retrieved by calling `API.get_in_progress_dataset()` and `API.get_published_dataset()` respectively, both of which require a datasetId and in the case of the PublishedDataset also a version identifier (see below). Only a PublishedDataset can return the **data table** though, so in the context of retrieving data you will always deal with PublishedDatasets. Retrieving the data is done by calling the `get_data()` member of a PublishedDataset which returns a Pandas dataframe of the dataset.

The **datasetId** is a UUID identifier. The **version** is either the integer number that identifies the version of the dataset you want to retrieve (starting at 1 and incremented every time a new version is published), or the special **latest** identifier which will get whatever is the most recent published version of a dataset. Versions are immutable, meaning they are guaranteed by the system not to change once they are published, so you can have exact reproducibility of analysis by referencing a concrete version or using whatever the most up-to-date version of a dataset is if you prefer that.

## API initialization

(See the [setup notebook](setup.ipynb) for details on how to install, initialize and authorize the library)

In [1]:
from edelweiss_data import API, QueryExpression as Q

# Set this to the url of the Edelweiss Data server you want to interact with
edelweiss_api_url = 'https://api.edelweissdata.com'

api = API(edelweiss_api_url)

## Retrieving the dataset

To retrieve a dataset (containing the description, schema and metadata among other pieces of information) you request it by ID and version. Here we want to inspect the "COVID-19 timeseries for Germany by state (RKI data)" dataset with id `8dde2785-8a2a-4847-80b8-982a691954d6` at version 106:

In [2]:
id = "8dde2785-8a2a-4847-80b8-982a691954d6"
version = 106
covid19_germany_dataset = api.get_published_dataset(id, version)
covid19_germany_dataset

<PublishedDataset '8dde2785-8a2a-4847-80b8-982a691954d6':106 - COVID-19 timeseries data for Germany by state (RKI data)>

In [4]:
covid19_germany_dataset.name

'COVID-19 timeseries data for Germany by state (RKI data)'

In [5]:
covid19_germany_dataset.metadata

{'category': 'covid-19',
 'columnNames': {'daily-cases': 'Cases',
  'daily-deaths': 'Deaths',
  'date': 'Date',
  'region': 'State',
  'total-cases': 'TotalCases',
  'total-deaths': 'TotalDeaths'},
 'dataBackgroundInformation': 'https://npgeo-corona-npgeo-de.hub.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0',
 'datetimeRetrieved': '2020-08-26 09:35:47.668202+00:00',
 'estimatedReportingCutoff': '2020-08-25 22:00:00+00:00',
 'keywords': ['covid-19', 'cases', 'deaths', 'Germany'],
 'license': 'https://creativecommons.org/licenses/by-nc-sa/4.0/',
 'originalDataCollectionAgency': 'https://www.rki.de',
 'regions': ['Baden-Württemberg',
  'Bayern',
  'Berlin',
  'Brandenburg',
  'Bremen',
  'Hamburg',
  'Hessen',
  'Mecklenburg-Vorpommern',
  'Niedersachsen',
  'Nordrhein-Westfalen',
  'Rheinland-Pfalz',
  'Saarland',
  'Sachsen',
  'Sachsen-Anhalt',
  'Schleswig-Holstein',
  'Thüringen'],
 'upstreamSource': 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.cs

In [7]:
covid19_germany_dataset.schema.columns

[<Column State:xsd:string>,
 <Column Date:xsd:string>,
 <Column Cases:xsd:integer>,
 <Column Deaths:xsd:integer>,
 <Column TotalCases:xsd:integer>,
 <Column TotalDeaths:xsd:integer>]

## Retrieving the data

For the tabular data component of the dataset there are a few more options since this can result in a lot of data potentially being transmitted.

The basic mechanism to retrieve data is to use the `PublishedDataset.get_data()` method. There are various parameters on this method that allow you to only retrieve a limited number of rows, filter rows with a powerful query language (more on this in a later notebook) or ask for only a subset of columns to be returned. The data is returned as a Pandas dataframe. If you were using raw HTTP requests you would have to perform paging to retrieve bigger datasets with multiple requests but the edelweiss_data python client library takes care of these details for you.

In [8]:
covid19_germany_data = covid19_germany_dataset.get_data()
covid19_germany_data

Unnamed: 0,State,Date,Cases,Deaths,TotalCases,TotalDeaths
1,Baden-Württemberg,2020/02/24 00:00:00,1,0,1,0
2,Baden-Württemberg,2020/02/25 00:00:00,2,0,3,0
3,Baden-Württemberg,2020/02/26 00:00:00,4,0,7,0
4,Baden-Württemberg,2020/02/27 00:00:00,3,0,10,0
5,Baden-Württemberg,2020/02/28 00:00:00,7,0,17,0
...,...,...,...,...,...,...
2708,Thüringen,2020/08/21 00:00:00,9,0,3547,186
2709,Thüringen,2020/08/22 00:00:00,5,0,3552,186
2710,Thüringen,2020/08/23 00:00:00,4,0,3556,186
2711,Thüringen,2020/08/24 00:00:00,10,0,3566,186


The data retrieval endpoint accepts a DataQuery that allows you to order the data differently, filter the rows or return only selected columns (for details about the Query Language refer to the [Query Language walkthrough](/docs/query-language)).

Here is an example that filters to only the rows where the State column is "Berlin", orders by the number of new cases descending and returns only the first 10 rows:

In [11]:
filter = Q.exact_search(Q.column("State"), "Berlin")
order_by = Q.column("Cases")
worst_10_days_berlin = covid19_germany_dataset.get_data(condition = filter, order_by = [(order_by)], ascending = [False], limit = 10)
worst_10_days_berlin

Unnamed: 0,State,Date,Cases,Deaths,TotalCases,TotalDeaths
398,Berlin,2020/03/26 00:00:00,286,4,2032,23
397,Berlin,2020/03/25 00:00:00,245,2,1746,19
406,Berlin,2020/04/03 00:00:00,239,17,3564,74
405,Berlin,2020/04/02 00:00:00,237,7,3325,57
403,Berlin,2020/03/31 00:00:00,213,3,2901,42
399,Berlin,2020/03/27 00:00:00,205,5,2237,28
410,Berlin,2020/04/07 00:00:00,202,15,4096,105
396,Berlin,2020/03/24 00:00:00,199,5,1501,17
400,Berlin,2020/03/28 00:00:00,197,3,2434,31
404,Berlin,2020/04/01 00:00:00,187,8,3088,50
