# SDK Reference Table - `table()` - read

Ocean Data Platform offers both API and Python SDK interfaces. This notebook highlights the Python SDK.

## Installation

If you are not working in the [ODP Workspace](https://workspace.hubocean.earth/), you need to first install the Python SDK package.

```bash
pip install -U odp-sdk
```

## Client Initialization

In [1]:
# Import the Client class from the odp.client module
from odp.client import Client

# Create an instance of the Client class
client = Client()



If you are using our [ODP Workspaces](https://workspace.hubocean.earth/) you are automatically authenticated, but if you are working outside the initiation of the Client will open a browser to performance authentication process.
If you are working outside ODP Workspaces and you don't want to open the browser to authenticate you can use API Key authentication.
You can generate an API key in the Ocean Data Platform web interface, under your user profile.
```python
client = Client(api_key="your-api-key")
````

## Dataset Access

With an initialized `Client` you can access different datasets by using the datasets' UUID. The easiest way is to use the [ODP Catalog](https://app.hubocean.earth/catalog) to search for datasets and find the UUID (click API).

For the Table examples we are using a dataset from Brazil provided from one of our partners: 

**Example Dataset**: PGS Biota Data - Mammal and Turtle Observations (Darwin Core Format)  
**Dataset ID (UUID)**: `1d801817-742b-4867-82cf-5597673524eb`  

**Columns in Table**:`occurrenceID`,`verbatimIdentification`,`scientificName`,`scientificNameID`,`lifeStage`,`individualCount`,`basisOfRecord`,`minimumDepthInMeters`,`eventDate`,`occurrenceRemarks`,`decimalLongitude`,`decimalLatitude`,`footprintWKT`,`license`,`occurrenceStatus`,`geodeticDatum`,`datasetName`,`institutionCode`,`otherCatalogNumbers`

## Get Dataset

In [2]:
# Get dataset
dataset = client.dataset("1d801817-742b-4867-82cf-5597673524eb")

The `dataset` from this UUID will be used in the examples below.

## Get Dataset Schema and Statistics

### schema()

In [None]:
# Get table schema
schema = dataset.table.schema()  # Returns pyarrow.Schema or None
print(f"Available columns: {schema}")

### stats()

In [None]:
# Get table statistics  
stats = dataset.table.stats()
print(f"Total observations: {stats.num_rows:,}")  
print(f"Dataset size: {stats.size:,} bytes")

## Query Table Data

You query the Table data by using `table.select()` followed by how you want to receive the results.

There are three different ways of receiving the results from the query:

Single batch
- A GeoPandas GeoDataFrame containing all the data (for smaller datasets or a quick view of the data): `dataset.table.select().all().dataframe()`

Streaming batches
- A stream of GeoPandas GeoDataFrames (if the dataset is too large): `dataset.table.select().dataframes()`
- A stream of PyArrow RecordBatches (a more performant way that is recommended): `dataset.table.select().batches()`

### Single batch (for smaller datasets)
A useful way to get the data directly into a single Pandas DataFrame if you are dealing with small datasets.

In [None]:
# Get all marine observations as single pandas DataFrame
result_dataframe = dataset.table.select().all().dataframe()
print(f"Complete dataset: {len(result_dataframe)} marine observations") 

You can protect from memory overflow by setting:
* max_row (the maximum of rows to be returned in the Python DataFrame
* max_time (time out threshold)

In [None]:
# Get all marine observations as single pandas DataFrame
result_dataframe = dataset.table.select().all(max_rows=10_000_000_000, max_time=30.0).dataframe()
print(f"Complete dataset: {len(result_dataframe)} marine observations") 

### Streaming batches query
Ocean datasets are usually quite large and often it is better to get the data streaming.

You have two different ways of streaming the: Pandas DataFrames (dataframes), and PyArrow RecordBatch (batches). Working with PyArrow has a performance advantage (memory efficient) and recommended if you are familiar with PyArrow (https://arrow.apache.org/docs/python/index.html), but Pandas is often what most users are more familiar with. However, it easy to convert from PyArrow RecordBatch to Pandas DataFrame.

The streaming is design to allow you stop the streaming at any point when you are done with the operations.

In [None]:
# Iterate by Pandas DataFrames - convenient for analysis
for dataframe_batch in dataset.table.select().dataframes():
    print(f"Analyzing DataFrame batch: {len(dataframe_batch)} observations")
    # Marine biology analysis on chunk
    depth_stats = dataframe_batch['minimumDepthInMeters'].describe()
    print(f"Depth statistics: {depth_stats}")

In [None]:
# Iterate by batches (PyArrow RecordBatch) - memory efficient for large datasets
import pyarrow as pa

for batch in dataset.table.select().batches():
    print(f"Processing batch with {batch.num_rows} observations")
    # Convert to Pandas
    df_batch = batch.to_pandas()
    # Process marine species in this batch
    unique_species = df_batch['scientificName'].nunique()
    print(f"Found {unique_species} unique species in this batch")

### Basic select() Operations

Within the select() method you can pass operators to narrow down:
* Comparison: `AND`, `OR`, `NOT` 
* Logical: `>`, `<`, `>=`, `<=`, `==`, `!=` as well as `IS NULL`,`IS NOT NULL`
* Geospatial: `within`, `intersects`, `contains`

Examples are shown with single batch method, but works in the same way for streaming methods.

In [None]:
# Select specific
dataframe = dataset.table.select("scientificName == 'Balaenoptera'").all().dataframe()
print(f"Number of rows: {dataframe.shape[0]}")

In [None]:
# Explicit parameter
dataframe = dataset.table.select(filter="scientificName == 'Balaenoptera'").all().dataframe()
print(f"Number of rows: {dataframe.shape[0]}")

In [None]:
# Select specific columns which is more efficient than selecting all columns and filtering in Python
dataframe = dataset.table.select(
    "minimumDepthInMeters > 100", 
    cols=["scientificName", "lifeStage", "minimumDepthInMeters", "eventDate"]
).all().dataframe()
print(f"Number of rows: {dataframe.shape[0]}")

In [None]:
# Select with multiple variables
dataframe = dataset.select(
    "scientificName == 'Balaenoptera' AND minimumDepthInMeters > 100"
).all().dataframe()
print(f"Number of rows: {dataframe.shape[0]}")

In [None]:
# Select with named bind variables for safe, efficient queries:
dataframe = dataset.table.select(
    "scientificName == $species",
    vars={
        "species": "Balaenoptera"
    }
).all().dataframe()
print(f"Number of rows: {dataframe.shape[0]}")

In [None]:
# How to work with geo
dataframe = dataset.table.select(
    'footprintWKT within $area', 
    vars={"area": "POLYGON((-37 -12, -45 -26, -40 -28, -33 -13, -37 -12))"},
).all().dataframe()
print(f"Number of rows: {dataframe.shape[0]}")

### Aggregations
Some description on aggregations
- max
- min
- sum
- count
- mean (average)

Geo aggregations
- h3 (read more about it here: https://h3geo.org/

In [None]:
# Aggregate by a column
dataframe = dataset.table.aggregate(
    group_by="lifeStage",
    aggr={"minimumDepthInMeters": "mean"}
)
print(dataframe)

In [None]:
# Aggregate by a column combined with a query
dataframe = dataset.table.aggregate(
    group_by="scientificName",
    filter="scientificName IS NOT NULL AND minimumDepthInMeters IS NOT NULL",
    aggr={
        "minimumDepthInMeters": "mean"
    }
)
print(dataframe.iloc[0:5])

In [None]:
# Aggregate without grouping
dataframe = dataset.table.aggregate(
    group_by='"TOTAL"',  # Special value
    aggr={ 
        "minimumDepthInMeters": "max"
    }
)
print(dataframe)

In [None]:
# Aggregate by h3 hexagons
dataframe = dataset.table.aggregate(
    group_by="h3(footprintWKT, 5)", # Arguments: Column containing the geometry, and resolution between 0 and 15 https://h3geo.org/docs/core-library/restable/
    filter="footprintWKT IS NOT NULL",
    aggr={
        "minimumDepthInMeters": "mean"
    }
)
print(dataframe.iloc[0:])

### Performance tips



1. Use column selection: Only select columns you need.
2. Use bind variables: More efficient than string concatenation (safer as well).
3. Filter early: Apply filters in the query rather than in Python.
4. Consider aggregation: Use aggregate() instead of selecting all data and aggregating in Python.
5. Use Streaming: Handle large datasets by streaming and iterate them.  

### Error handling 

In [None]:
# 
try:
    result = dataset.table.select("invalid_column =! 5").all().dataframe()
except ValueError as e:
    print(f"Query error: {e}")