# Searching for data: the databrowser method

All files available on in the project are scanned and indexed via a data search server. This allows you to query the server with almost immediate response time. To search for data you can either use the `databrowser` method of the, `freva` python module. Let's import the `freva` module first:

In [1]:
import freva

Now inspect the help menu:

In [2]:
help(freva.databrowser)

Help on function databrowser in module freva._databrowser:

databrowser(*, attributes: 'bool' = False, all_facets: 'bool' = False, facet: 'Optional[Union[str, list[str]]]' = None, multiversion: 'bool' = False, relevant_only: 'bool' = False, batch_size: 'int' = 5000, count: 'bool' = False, time: 'str' = '', time_select: 'str' = 'flexible', **search_facets: 'Union[str, Path, int, list[str]]') -> 'Union[dict[Any, dict[Any, Any]], Iterator[str], int]'
    Find data in the system.
    
    You can either search for files or data facets (variable, model, ...)
    that are available. The query is of the form key=value. <value> might
    use *, ? as wildcards or any regular expression.
    
    Parameters
    ----------
    **search_facets: Union[str, Path, in, list[str]]
        The facets to be applied in the data search. If not given
        the whole dataset will be queried.
    time: str
        Special search facet to refine/subset search results by time.
        This can be a string rep

The databrowser expects a list of `key=value` pairs. The order of the pairs doesn’t really matter. Most important is that you don’t need to split the search according to the type of data you are searching for. You can search for any files, both observations, reanalysis, and model data, all at the same time. Also important is that all searches are case insensitive. You can also search for attributes themselves instead of file paths. For example you can search for the list of variables available that satisfies a certain constraint (e.g. sampled 6hr, from a certain model, etc).

In [3]:
files = freva.databrowser(project="observations", variable="pr", model="cp*")
files

<generator object SolrFindFiles._search at 0x7fc9e8185380>

This will return a so called iterator. The advantage of an iterator is that the data can be loaded into memory if needed. Nothing is pre loaded. To access the files you can either loop through the Iterator or convert it to a list:

In [4]:
list(files)

['/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022300-201609022330.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022200-201609022230.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022100-201609022130.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022000-201609022030.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609021900-201609021930.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_c

In some cases it might be useful to know how much files are found in the `databrowser` for certain search constraints. In such cases you can use the `count` flag to count the number of found files instead of getting the files themselves.

In [5]:
freva.databrowser(project="observations", variable="pr", model="cp*", count=True)

24

Sometimes it might be useful to subset the data you’re interested in by time. To do so you can use the time search key to subset time steps and whole time ranges. For example let’s get the for certain time range:

In [6]:
list(freva.databrowser(project="observations", time="2016-09-02T22:15 to 2016-10"))

['/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022300-201609022330.nc',
 '/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022200-201609022230.nc']

The default method for selecting time periods is flexible, which means all files are selected that cover at least start or end date. The strict method implies that the entire search time period has to be covered by the files. Using the strict method in the example above would only yield on file because the first file contains time steps prior to the start of the time period:

In [7]:
list(freva.databrowser(project="observations", time="2016-09-02T22:15 to 2016-10", time_select="strict"))

['/home/runner/work/freva/freva/.docker/data/observations/grid/CPC/CPC/cmorph/30min/atmos/30min/r1i1p1/v20210618/pr/pr_30min_CPC_cmorph_r1i1p1_201609022300-201609022330.nc']

The time format has to follow the ISO-8601 standard. Time ranges are indicated by the to keyword such as 2000 to 2100 or 2000-01 to 2100-12 and alike. Single time steps are given without the to keyword.

You might as well want to know about possible values that an attribute can take after a certain search is done. For this you use the `facet` flag (facets are the attributes used to search for and sub set the data). For example to see all facets that are available in the observations project:

In [8]:
freva.databrowser(project="observations", all_facets=True)

{'variable': ['pr'],
 'time_frequency': ['30min'],
 'cmor_table': ['30min'],
 'realm': ['atmos'],
 'institute': ['cpc'],
 'ensemble': ['r1i1p1'],
 'experiment': ['cmorph'],
 'dataset': ['testdata'],
 'product': ['grid'],
 'model': ['cpc'],
 'project': ['observations']}

Likewise you can inspect all model `facet` flags in the databrowser:

In [9]:
freva.databrowser(facet="model")

{'model': ['access-cm2',
  'cpc',
  'mpi-esm1-2-lr',
  'mpi-m-mpi-esm-lr-clmcom-cclm4-8-17-v1',
  'ncc-noresm1-m-gerics-remo2015-v1',
  'nodc',
  'um-ra2t']}

__Note__: If you don't give a search constraints like in the case above the command will query the whole data server.

You can also retrieve information on how many facets are found by the databrowser by giving the count flag

In [10]:
freva.databrowser(facet="model", count=True)

{'model': {'access-cm2': 1,
  'cpc': 24,
  'mpi-esm1-2-lr': 1,
  'mpi-m-mpi-esm-lr-clmcom-cclm4-8-17-v1': 10,
  'ncc-noresm1-m-gerics-remo2015-v1': 2,
  'nodc': 1,
  'um-ra2t': 10}}

Reverse search is also be possible. You can query the metadata of a given file:

In [11]:
file_to_query = next(freva.databrowser()) # Get a file
file_to_query

'/tmp/user_data/user-runner/eur-11b/clex/UM-RA2T/Bias-correct/hr/user_data/hr/r0i0p0/v20221207/tas/tas_hr_UM-RA2T_Bias-correct_r0i0p0_197001041800-197001050300.nc'

In [12]:
freva.databrowser(file=file_to_query, all_facets=True)

{'variable': ['tas'],
 'time_frequency': ['hr'],
 'cmor_table': ['hr'],
 'realm': ['user_data'],
 'institute': ['clex'],
 'ensemble': ['r0i0p0'],
 'experiment': ['bias-correct'],
 'dataset': ['crawl_my_data'],
 'product': ['eur-11b'],
 'model': ['um-ra2t'],
 'project': ['user-runner']}

## Example: Using the databrowser to open datasets with xarray

In [13]:
import xarray as xr
dset = xr.open_mfdataset(freva.databrowser(variable="pr", project="observations"), combine="by_coords")
dset

Unnamed: 0,Array,Chunk
Bytes,51.83 MiB,2.16 MiB
Shape,"(48, 412, 687)","(2, 412, 687)"
Dask graph,24 chunks in 49 graph layers,24 chunks in 49 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 51.83 MiB 2.16 MiB Shape (48, 412, 687) (2, 412, 687) Dask graph 24 chunks in 49 graph layers Data type float32 numpy.ndarray",687  412  48,

Unnamed: 0,Array,Chunk
Bytes,51.83 MiB,2.16 MiB
Shape,"(48, 412, 687)","(2, 412, 687)"
Dask graph,24 chunks in 49 graph layers,24 chunks in 49 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
