# Data structures
GeoST uses standardized internal data structures and data validation to ensure that the
functionality that GeoST offers can always reliably be applied. This user guide section 
dives deeper into GeoST data structures.

## Collection objects
As shown in the begin [introduction](../getting_started/introduction.ipynb#concept) to GeoST,
data is held in so-called `Collection` objects, the core objects of GeoST, which contain header
and data tables. Basically, the two can be described as:

* the header table describes metadata and spatial information.
* the data table contains the logged data.

The header and data tables have a one-to-many relationship: one borehole is one row in the
header and multiple rows in the data. 

Typically available types of subsurface data comprise point-like data such as boreholes,
cpts, well logs and line-like data such as seismics, GPR, EM.  Different data sources are
related to specific Collection objects. For example, borehole data is held in a
[`BoreholeCollection`](../api_reference/borehole_collection.rst) and CPT data in a
[`CptCollection`](../api_reference/cpt_collection.rst). While working with
a Collection, making selections may alter the header and data tables, Collections automatically
maintain alignment between the two. Therefore, users can safely make selections and analyse
the data while being sure of consistency. It is recommended to work with collections by
default, unless you specifically only need to work with the header or data table. By default,
read functions for different types of data return a collection (see: [Reading data](./reading_data.ipynb)).

In [1]:
import geost

# Load the Utrecht Science Park example borehole data
boreholes_collection = geost.data.boreholes_usp()

# boreholes_collection is an instance of BoreholeCollection and contains 67 boreholes
print(boreholes_collection)
print(boreholes_collection.horizontal_reference)
print(boreholes_collection.vertical_reference)

BoreholeCollection:
# header = 67
EPSG:28992
5709


### Header table
Header tables are a Geopandas [`GeoDataFrame`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html) instance and hold spatial information, in a "geometry" column, and other
metadata such as the surface level, end-depth and others. The geometry column of the header
contains point geometries case of boreholes and CPTs and linestring geometries in for instance
seismic data. Each entry (row in the Geodataframe) corresponds to one point or line survey:
e.g. one borehole or one seismic line. 

A header table requires a bare minimum of data columns to be present to ensure that all
built-in methods of a Collection can be used:

| Column name | Validation criteria | Description |
| ----------- | ------------------- | ----------- |
| nr | Must be interpretable as string | Identification name/number/code of the point survey |
| x | Must be of numeric type (int or float) | X-coordinate |
| y | Must be of numeric type (int or float) | Y-coordinate |
| surface | Must be of numeric type (int or float) and higher than end depth | Surface elevation of the point survey in m +NAP |
| end | Must be of numeric type (int or float) and lower than surface elevation | End depth of the point survey in m +NAP |
| geometry | `shapely.geometry.Point` in case of point data | Point geometry of the survey location |

The header is not limited to just these columns. Any number of columns can be added to give
additional information on surveys. Some analysis methods may add information to the header. For instance, the method [`BoreholeCollection.get_area_labels`](../api_reference/generated/geost.base.BoreholeCollection.get_area_labels.rst) has an argument `include_in_header` which, if set
to true, adds a column with results to the header Geodataframe.

If you're only interested in survey locations and/or metadata, it is adviced to directly
work with the header object to avoid additional overhead caused by a parent collection 
object (overhead is caused by checks of the header against data after every operation to 
ensure header/data alignment). Read functions for point and line data (see: [Reading data](./reading_data.ipynb)) return a corresponding collection object by default, but you can assign
only the header to a variable in order to continue with just the header data. See the example below.

In [2]:
# Load the Utrecht Science Park example borehole data and only assign the header data.
boreholes_header = geost.data.boreholes_usp().header

# Print the first rows of the header data.
boreholes_header.head()

Unnamed: 0,nr,x,y,surface,end,geometry
0,B31H0541,139585.0,456000.0,1.2,-9.9,POINT (139585 456000)
1,B31H0611,139600.0,455060.0,1.2,-23.0,POINT (139600 455060)
2,B31H0718,139950.0,455200.0,1.3,-271.2,POINT (139950 455200)
3,B31H0803,139675.0,455087.0,2.16,-4.84,POINT (139675 455087)
4,B31H0806,139684.0,455384.0,1.0,-49.5,POINT (139684 455384)


### Data table
Data tables are a Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) instance and hold all the logged data of any survey. In GeoST we mainly distinguish between **"layered"** and **"discrete"** data:

* *Layered* data contains data that is logged in terms of layers (i.e. depth intervals over which properties are the same) with **"top"** and **"bottom"** information for each layer.
* *Discrete* data contains data that is logged over discrete intervals (e.g. every 20 cm) with **"depth"** information for each measurement. One point or line survey (i.e. one row in the header) can be associated with multiple rows of data. E.g. a single borehole with 10 described layers is represented by one row in the header Geodataframe and ten rows in the data DataFrame. 

Just like the header, a data table also requires a bare minimum of columns to be present to ensure
that all built-in methods of a Collection can be applied. In case of "layered" data:

| Column name | Validation criteria | Description |
| ----------- | ------------------- | ----------- |
| nr | Must be interpretable as string | Identification name/number/code of the point survey |
| x | Must be of numeric type (int or float) | X-coordinate |
| y | Must be of numeric type (int or float) | Y-coordinate |
| surface | Must be of numeric type (int or float) and higher than end depth | Surface elevation of the point survey in m |
| end | Must be of numeric type (int or float) and lower than surface elevation | End depth of the point survey in m |
| top | Must be of numeric type (int or float); starts at 0; is increasing | Elevation of layer top. The first layer always starts at 0 and increases downwards |
| bottom | Must be of numeric type (int or float); is larger than top; is increasing | Elevation of layer bottom |

If the table contains inclined data, such as boreholes taken at a specific angle which means the x,y-coordinates of the top of a layer is not exactly at the same location as the bottom, the columns below must additionally be present:

| Column name | Validation criteria | Description |
| ----------- | ------------------- | ----------- |
| x_bot | Must be of numeric type (int or float) | X-coordinate of layer bottom (only required if survey does not point straight down) |
| y_bot | Must be of numeric type (int or float) | X-coordinate of layer bottom (only required if survey does not point straight down) |

In case the data table holds "discrete" data the columns below must be present to ensure that all built-in methods work. Note that the only difference is the "depth" column instead of the "top" and "bottom" columns.

| Column name | Validation criteria | Description |
| ----------- | ------------------- | ----------- |
| nr | Must be interpretable as string | Identification name/number/code of the point survey |
| x | Must be of numeric type (int or float) | X-coordinate |
| y | Must be of numeric type (int or float) | Y-coordinate |
| surface | Must be of numeric type (int or float) and higher than end depth | Surface elevation of the point survey in m |
| end | Must be of numeric type (int or float) and lower than surface elevation | End depth of the point survey in m |
| depth | Must be of numeric type (int or float); is increasing | Depth where the measurement was taken |

Also the data table is not limited to the columns above and all additional columns contain the actual data with measurements for each layer or at each depth.

If you're only interested in the measurements and don't need to work with geometries or
any other additional header data, it is adviced to directly work with the data table to 
avoid additional overhead caused by a Collection object (overhead is caused by 
checks of the header against data after every operation to ensure header/data alignment). 
The different read functions for data (see: [Reading data](./reading_data.ipynb))
return a corresponding collection object by default, but you can assign only the Pandas `DataFrame` of the data table is returned to continue with just the data. See the example below. Some
read functions, such as [`read_borehole_table`](../api_reference/generated/geost.read_borehole_table.rst) provide the argument `as_collection` which defaults to True, but can be set to False to
only return the data table in this example.

In [3]:
# Load the Utrecht Science Park example borehole data and only assign the data.
boreholes_data = geost.data.boreholes_usp().data

# Print the first few rows of boreholes data.
boreholes_data.head()

Unnamed: 0,nr,x,y,surface,end,top,bottom,lith,zm,zmk,...,cons,color,lutum_pct,plants,shells,kleibrokjes,strat_1975,strat_2003,strat_inter,desc
0,B31H0541,139585.0,456000.0,1.2,-9.9,0.0,0.2,K,,,...,,ON,,0,0,0,,EC,,[TEELAARDE#***#****#*] ..........................
1,B31H0541,139585.0,456000.0,1.2,-9.9,0.2,0.6,K,,,...,,BR,,0,0,0,,EC,,[KLEI#***#****#*] grysbruin.
2,B31H0541,139585.0,456000.0,1.2,-9.9,0.6,0.95,V,,,...,,BR,,0,0,0,,NI,,[VEEN#***#****#*] donkerbruin.
3,B31H0541,139585.0,456000.0,1.2,-9.9,0.95,2.8,Z,,ZMFO,...,,GR,,0,0,0,,EC,,[ZAND#***#****#*] FYN TOT matig fyn# iets slib...
4,B31H0541,139585.0,456000.0,1.2,-9.9,2.8,4.2,Z,,ZFC,...,,BR,,0,0,0,,BXWI,,[ZAND#***#****#*] fyn# grysbruin.


In [4]:
# Load the Utrecht Science Park example CPT data and only assign the data.
cpt_data = geost.data.cpts_usp().data

# Print the first few rows of CPT data.
cpt_data.head()


NOTE: Header has been reset to align with data because AUTO_ALIGN is enabled in the GeoST configuration.


Validation dropped 19916 row(s) for schema 'Discrete data non-inclined'.
Dropped indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203,

Unnamed: 0,nr,x,y,vertical_datum,surface,cone_penetration_test_fk,cone_penetration_test_result_pk,penetration_length,depth,elapsed_time,...,magnetic_inclination,magnetic_declination,local_friction,pore_ratio,temperature,pore_pressure_u1,pore_pressure_u2,pore_pressure_u3,friction_ratio,end
14841,CPT000000074699,139645.149045,455662.028026,NAP,8.3,71493,98496641,0.01,0.01,,...,,,0.002,,,,,,2.2,0.0
14842,CPT000000074699,139645.149045,455662.028026,NAP,8.3,71493,98496642,0.03,0.03,,...,,,0.001,,,,,,1.0,0.0
14843,CPT000000074699,139645.149045,455662.028026,NAP,8.3,71493,98496643,0.05,0.05,,...,,,0.0,,,,,,,0.0
14844,CPT000000074699,139645.149045,455662.028026,NAP,8.3,71493,98496644,0.07,0.07,,...,,,0.002,,,,,,1.5,0.0
14845,CPT000000074699,139645.149045,455662.028026,NAP,8.3,71493,98496645,0.09,0.09,,...,,,0.005,,,,,,2.6,0.0


## GeoST Accessors
When you only need to work with one of the header or data tables, all the functionality
available in Collections are also available to the header [GeoDataFrame](https://geopandas.org/en/stable/docs/reference/geodataframe.html) and data [DataFrame](https://pandas.pydata.org/docs/reference/frame.html). This is achieved by so-called "accessors".

## Point and line data
To describe point data (e.g. boreholes, well logs, cpts) and line data (e.g. seismics, 
GPR, EM) you need a minimal amount of information on the identification and position of each
point/line (`Header`). For each point/line there are measurements or descriptions available 
of the subsurface (`Data`). The following header and data objects are used to
describe point and line data:


These basic objects are used to build `Collections`. E.g. a [`BoreholeCollection`](../api_reference/borehole_collection.rst)
is built from the combination of a [`PointHeader`](../api_reference/point_header.rst)
object and a [`LayeredData`](../api_reference/layered_data.rst) objects. The below 
figure gives a complete overview of the object hierarchy in GeoST for point and line data.

<p align="left">
    <img src="../_static/object_hierarchy.png" alt="GeoST object hierarchy" title="GeoST object hierarchy" width="1000" />
</p>

## Model data
GeoST supports working with model data and offers methods to combine these data with
point and line data. Model data does not follow the same header/data approach as point
and line data. Instead there are generic model classes, of which some have an
implementation that adds specific functionality for that model. An example of this is
the [`VoxelModel`](../api_reference/voxelmodel.rst) as a generic model class and [`GeoTOP`](../api_reference/bro_geotop.rst)
being a specific implementation of a voxel model. GeoST currently support the following 
generic models and implementations:

**Generic models and implementations**
* *[`VoxelModel`](../api_reference/voxelmodel.rst)*: Class for voxel models, with data 
stored in the `ds` attribute, an [`Xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html).
    * Implementations: [`GeoTOP`](../api_reference/bro_geotop.rst)
* *`LayerModel`*: Class for layer models, not yet implemented
    * Implementations: None

<p align="left">
    <img src="../_static/object_hierarchy_models.png" alt="GeoST vmodel object hierarchy" title="GeoST model object hierarchy" width="1000" />
</p>

### Voxel models
The [`VoxelModel`](../api_reference/voxelmodel.rst) class stores data in the `ds` 
attribute, which is an [`Xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html).
A custom voxel model can be instantiated from a NetCDF file. For this, see the documentation of the 
[`VoxelModel.from_netcdf`](../api_reference/generated/geost.models.VoxelModel.from_netcdf.rst) class constructor.
An instance of [`VoxelModel`](../api_reference/voxelmodel.rst) offers basic methods for 
selecting, slicing and exporting models.

For more guidance on using a Voxel model within GeoST, see the [BRO GeoTOP](../user_guide/bro_geotop.ipynb)
section in the user guide.


