# Data structures
GeoST uses standardized internal data structures and data validation to ensure that the
functionality that GeoST offers can always reliably be applied. This user guide section 
dives deeper into GeoST data structures.

## Collection objects
As shown in the first [introduction](../getting_started/introduction.ipynb#concept) to GeoST,
data is held in so-called `Collection` objects, the core objects of GeoST, which contain header
and data tables. Basically, the two can be described as:

* the *header table* describes metadata and spatial information.
* the *data table* contains the logged data.

The header and data tables have a one-to-many relationship: one survey (e.g. borehole) is
one row in the header and multiple rows in the data. 

Typically available types of subsurface data comprise point-like data such as boreholes,
cpts, well logs and line-like data such as seismics, GPR, EM.  Different data sources are
related to specific Collection objects. For example, borehole data is held in a
[`BoreholeCollection`](../api_reference/borehole_collection.rst) and CPT data in a
[`CptCollection`](../api_reference/cpt_collection.rst) (see figure below). 

<p align="left">
    <img src="../_static/object_hierarchy.png" alt="GeoST object hierarchy" title="GeoST object hierarchy" width="1000" />
</p>

While working with a Collection, making selections may alter the header and data tables,
Collections automatically maintain alignment between the two. Therefore, users can safely
make selections and analyse the data while being sure of consistency. It is recommended to
work with collections by default, unless you specifically only need to work with the header
or data table. By default, read functions for different types of data return a collection
(see: [Reading data](./reading_data.ipynb)). So for example, reading sample data of boreholes
available in GeoST shows that the resulting object is a BoreholeCollection. Additionally, we
show that a Collection also contains horizontal and vertical spatial references.

In [None]:
import geost

# Load the Utrecht Science Park example borehole data
boreholes_collection = geost.data.boreholes_usp()

# boreholes_collection is an instance of BoreholeCollection and contains 67 boreholes
print(boreholes_collection)

# Print data types of header and data attributes
print(f"Data type header: {type(boreholes_collection.header)}")
print(f"Data type data: {type(boreholes_collection.data)}")

# Print the horizontal and vertical reference systems
print(boreholes_collection.horizontal_reference)
print(boreholes_collection.vertical_reference)

### Header table
Header tables are a Geopandas [`GeoDataFrame`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html) instance and hold spatial information, in a "geometry" column, and
metadata such as the surface level, end-depth and others. The geometry column of the header
contains point geometries case of boreholes and CPTs and linestring geometries in for instance
seismic data. Each entry (row in the Geodataframe) corresponds to one specific survey:
e.g. one borehole or one seismic line. 

A header table requires a bare minimum of data columns to be present to ensure that all
built-in methods of a Collection can be used:

| Column name | Validation criteria | Description |
| ----------- | ------------------- | ----------- |
| nr | Must be interpretable as string | Identification name/number/code of the point survey |
| x | Must be of numeric type (int or float) | X-coordinate |
| y | Must be of numeric type (int or float) | Y-coordinate |
| surface | Must be of numeric type (int or float) and higher than end depth | Surface elevation of the point survey in m +NAP |
| end | Must be of numeric type (int or float) and lower than surface elevation | End depth of the point survey in m +NAP |
| geometry | `shapely.geometry.Point` in case of point data | Geometry object of the survey location |

The header is not limited to just these columns. Any number of columns can be added to give
additional information on surveys. Some analysis methods may add information to the header. For instance, the method [`BoreholeCollection.get_area_labels`](../api_reference/generated/geost.base.BoreholeCollection.get_area_labels.rst) has an argument `include_in_header` which, if set
to true, adds a column with results to the header GeoDataFrame. Otherwise, it will return a separate DataFrame.

If you're only interested in survey locations and/or metadata, it is adviced to directly
work with the header object to avoid some additional overhead caused by a parent collection 
object (overhead is caused by checks of the header against data after every operation to 
ensure header/data alignment). Read functions for point and line data (see: [Reading data](./reading_data.ipynb)) return a corresponding collection object by default, but you can assign
only the header to a variable in order to continue with just the header data. See the example below.

In [None]:
# Load the Utrecht Science Park example borehole data and only assign the header data.
boreholes_header = geost.data.boreholes_usp().header

# Print the first rows of the header data.
boreholes_header.head()

### Data table
Data tables are a Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) instance and hold all the logged data of any survey. In GeoST we mainly distinguish between **"layered"** and **"discrete"** data:

* *Layered* data contains data that is logged in terms of layers (i.e. depth intervals over which properties are the same) with **"top"** and **"bottom"** information for each layer.
* *Discrete* data contains data that is logged over discrete intervals (e.g. every 20 cm) with **"depth"** information for each measurement. One point or line survey (i.e. one row in the header) can be associated with multiple rows of data. E.g. a single borehole with 10 described layers is represented by one row in the header Geodataframe and ten rows in the data DataFrame. 

Just like the header, a data table also requires a bare minimum of columns to be present to ensure
that all built-in methods of a Collection can be applied. In case of "layered" data:

| Column name | Validation criteria | Description |
| ----------- | ------------------- | ----------- |
| nr | Must be interpretable as string | Identification name/number/code of the point survey |
| x | Must be of numeric type (int or float) | X-coordinate |
| y | Must be of numeric type (int or float) | Y-coordinate |
| surface | Must be of numeric type (int or float) and higher than end depth | Surface elevation of the point survey in m |
| end | Must be of numeric type (int or float) and lower than surface elevation | End depth of the point survey in m |
| top | Must be of numeric type (int or float); starts at 0; is increasing | Elevation of layer top. The first layer always starts at 0 and increases downwards |
| bottom | Must be of numeric type (int or float); is larger than top; is increasing | Elevation of layer bottom |

If the table contains inclined data, such as boreholes taken at a specific angle which means the x,y-coordinates of the top of a layer is not exactly at the same location as the bottom, the columns below must additionally be present:

| Column name | Validation criteria | Description |
| ----------- | ------------------- | ----------- |
| x_bot | Must be of numeric type (int or float) | X-coordinate of layer bottom (only required if survey does not point straight down) |
| y_bot | Must be of numeric type (int or float) | X-coordinate of layer bottom (only required if survey does not point straight down) |

In case the data table holds "discrete" data the columns below must be present to ensure that all built-in methods work. Note that the only difference is the "depth" column instead of the "top" and "bottom" columns.

| Column name | Validation criteria | Description |
| ----------- | ------------------- | ----------- |
| nr | Must be interpretable as string | Identification name/number/code of the point survey |
| x | Must be of numeric type (int or float) | X-coordinate |
| y | Must be of numeric type (int or float) | Y-coordinate |
| surface | Must be of numeric type (int or float) and higher than end depth | Surface elevation of the point survey in m |
| end | Must be of numeric type (int or float) and lower than surface elevation | End depth of the point survey in m |
| depth | Must be of numeric type (int or float); is increasing | Depth where the measurement was taken |

Also the data table is not limited to the columns above and all additional columns contain the actual data with measurements for each layer or at each depth.

If you're only interested in the measurements and don't need to work with geometries or
any other additional header data, it is adviced to directly work with the data table to 
avoid some additional overhead caused by a Collection object (overhead is caused by 
checks of the header against data after every operation to ensure header/data alignment). 
The different read functions for data (see: [Reading data](./reading_data.ipynb))
return a corresponding collection object by default, but you can assign only the Pandas `DataFrame` of the data table is returned to continue with just the data. See the example below. Some
read functions, such as [`read_borehole_table`](../api_reference/generated/geost.read_borehole_table.rst) provide the argument `as_collection` which defaults to True, but can be set to False to
only return the data table in this example.

In [None]:
# Load the Utrecht Science Park example borehole data and only assign the data.
boreholes_data = geost.data.boreholes_usp().data

# Print the first few rows of boreholes data.
boreholes_data.head()

## GeoST Accessors
When you only need to work with one of the header or data tables, all the functionality
available in Collections is also available to the header [GeoDataFrame](https://geopandas.org/en/stable/docs/reference/geodataframe.html) and data [DataFrame](https://pandas.pydata.org/docs/reference/frame.html) tables. This is achieved by so-called "accessors". Under the hood, every Collection method uses these accessors and therefore, methods specifically operate on the header, or the on the data table and the Collection then resolves the alignment between the two.

For the header table and associated header methods, the [`.gsthd`](../api_reference/header_accessors.rst) accessor is available and for the data table, the [`.gstda`](../api_reference/data_accessors.rst) accessor is available. Below we demonstrate shortly how these work by comparing the usage of Collection methods with those from the accessors.

In [None]:
# Create separate `collection`, `header` and `data` variables for the demonstration.
collection = geost.data.boreholes_usp()
header = collection.header
data = collection.data

Let's first compare the usage of the [`select_within_bbox`](../api_reference/generated/geost.base.BoreholeCollection.select_within_bbox.rst) which is a method that operates on the header table. After selecting from the header table, the [`.gsthd`](../api_reference/header_accessors.rst) accessor remains available for making further selections or chaining selections for example.

In [None]:
collection_select = collection.select_within_bbox(139_500, 455_000, 140_000, 455_500)
header_select = header.gsthd.select_within_bbox(139_500, 455_000, 140_000, 455_500)

print(collection_select)  # Selection result is a BoreholeCollection
print(type(header_select))  # Selection result is a GeoDataFrame

header_select.gsthd  # Selection result also has the gsthd accessor and methods available

For the data accessor, it works exactly the same way. We demonstrate this by comparing the [`slice_by_values`](../api_reference/generated/geost.base.BoreholeCollection.slice_by_values.rst) method, which operates on the data table. After selecting from the data table, the  [`.gstda`](../api_reference/data_accessors.rst) accessor remains available for making further selections or chaining selections for example.

In [None]:
# Select boreholes which contain sand anywhere as the main lithology.
collection_select = collection.slice_by_values("lith", "Z")
data_select = data.gstda.slice_by_values("lith", "Z")

print(collection_select)  # Selection result is a BoreholeCollection
print(type(data_select))  # Selection result is a GeoDataFrame

data_select.gstda  # Selection result also has the gstda accessor and methods available

## Model data
GeoST also supports working with model data and offers methods to combine these data with
point and line data. Model data does not follow the same header/data approach as point
and line data. Instead there are generic model classes, of which some have an
implementation that adds specific functionality for that model. An example of this is
the [`VoxelModel`](../api_reference/voxelmodel.rst) as a generic model class and [`GeoTOP`](../api_reference/bro_geotop.rst)
being a specific implementation of a voxel model. GeoST currently supports the following 
generic models and implementations:

**Generic models and implementations**
* *[`VoxelModel`](../api_reference/voxelmodel.rst)*: Class for voxel models, with data 
stored in the `ds` attribute, an [`Xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html).
    * Implementations: [`GeoTOP`](../api_reference/bro_geotop.rst)
* *`LayerModel`*: Class for layer models, not yet implemented
    * Implementations: None

<p align="left">
    <img src="../_static/object_hierarchy_models.png" alt="GeoST vmodel object hierarchy" title="GeoST model object hierarchy" width="1000" />
</p>

### Voxel models
The [`VoxelModel`](../api_reference/voxelmodel.rst) class stores data in the `ds` 
attribute, which is an [`Xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html).
A custom voxel model can be instantiated from a NetCDF file. For this, see the documentation of the 
[`VoxelModel.from_netcdf`](../api_reference/generated/geost.models.VoxelModel.from_netcdf.rst) class constructor.
An instance of [`VoxelModel`](../api_reference/voxelmodel.rst) offers basic methods for 
selecting, slicing and exporting models.

For more guidance on using a Voxel model within GeoST, see the [BRO GeoTOP](../user_guide/bro_geotop.ipynb)
section in the user guide.


