# Accessing the Clean Air Framework Datastore
Data is stored on an object store on the JASMIN cloud as a "dataset".
A dataset is composed of all the datafiles that hold the data (e.g. netcdf or CSV files) plus metadata about the dataset.
Metadata can be accessed independently of the data, so that you don't have to download the entire dataset if you don't need it.

Data objects are provided to represent the stored data.
Datasets are encapsulated by the `DataSet` class, whilst metadata is encapsulated by the `Metadata` class.
Access to stored datasets and metadata are via the `S3FSDataSetStore` and `S3FSMetdataStore` classes respectively.
(`s3fs` is the underlying library used to access the object store)

Both data access classes provide a similar interface, working with their respective data objects.

## Credentials for Object Store Access
Anonymous access is enabled for the data bucket, but to be able to upload data credentials with write permissions are 
required.

Documentation on `s3fs`'s credentials are here: https://s3fs.readthedocs.io/en/latest/#credentials
As it uses Amazon AWS's `botocore` library, the credential config is the same as for `botocore` & `boto3`.

In summary, the options are (in order of increasing precedence):

* `~/.aws/credentials` file (refer to [Configuration and credential file settings](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)) 
with contents such as:
```
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```
* Setting the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables
* Explicitly creating an `s3fs.S3FileSystem` instance and passing them as arguments (refer to next section for details)

## Creating a DataSetStore or MetadataStore
### Helper Functions
Helper functions are available that create instances of both `S3FSDataSetStore` and `S3FSMetadataStore`:
* `create_dataset_store`
* `create_metadata_store`

In [6]:
from clean_air.data.storage import create_dataset_store, create_metadata_store

TEST_BUCKET_NAME = "test-data"
dataset_store = create_dataset_store(storage_bucket_name=TEST_BUCKET_NAME)
metadata_store = create_metadata_store(storage_bucket_name=TEST_BUCKET_NAME)

These functions support the same parameters that are used to customise the resulting instances:

| Parameter Name        | Description | Default |
| --------------------- | ----------- | ------- |
| `storage_bucket_name` | Name of the bucket where datasets are stored | `caf-data` |
| `local_storage_path`  | Path to a writeable directory to store local copies of dataset files | A temporary directory in the systems `tmp` folder |
| `endpoint_url`        | the object store service endpoint URL. Changes depending on whether accessing data from inside or outside JASMIN, or using data stored on another AWS S3 compatible object store | `clean_air.data.storage.JasminEndpointUrls.EXTERNAL` |
| `anon`                | Whether to use anonymous access or credentials. `anon=False` is required for write access | `True` |

### Advanced options
The helper function creates a `DataSetStore` in its most common configurations, but for more control you can create an 
instance by using the constructor directly and providing the required arguments.

Here's an example of how to pass in credentials programmatically:

In [7]:
import s3fs

from clean_air.data.storage import JasminEndpointUrls
from clean_air.data.storage import S3FSDataSetStore, S3FSMetadataStore

fs = s3fs.S3FileSystem(
    key="AKIAIOSFODNN7EXAMPLE",
    secret="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    client_kwargs={"endpoint_url": JasminEndpointUrls.EXTERNAL}
)

metadata_store_with_custom_credentials = S3FSMetadataStore(fs, storage_bucket_name=TEST_BUCKET_NAME)
dataset_store_with_custom_credentials = S3FSDataSetStore(fs, metadata_store=metadata_store_with_custom_credentials,
                                                         storage_bucket_name=TEST_BUCKET_NAME)

## Uploading Data

In [8]:
from shapely.geometry import Polygon
import tempfile
from pathlib import Path

from clean_air.models import Metadata, DataSet

# Credentials will need to be configured correctly for this to work
dataset_store_with_write_access = create_dataset_store(TEST_BUCKET_NAME, anon=False)

# Create new dataset
with tempfile.TemporaryDirectory() as data_dir_path:
    # Create some test data
    test_datafile = Path(data_dir_path + "/testfile.txt")
    test_datafile.touch()
    metadata = Metadata(dataset_name="TestDataSet",
                        extent=Polygon([[-1, -1], [1, -1], [1, 1], [-1, 1]]))
    test_dataset = DataSet(files=[test_datafile], metadata=metadata)

    # Upload it
    dataset_store_with_write_access.put(test_dataset)

metadata_store_with_write_access = create_metadata_store(TEST_BUCKET_NAME, anon=False)
# Update the metadata...
metadata.description = "This is a test data set"
# ...and upload it
metadata_store_with_write_access.put(metadata)

## Discovering Data
Both `S3FSDataSetStore` and `S3FSMetadataStore` both have an `available_datasets` method to discover what datasets exist:

In [9]:
print(f"dataset_store.available_datasets = {dataset_store.available_datasets()}")
print(f"metadata_store.available_datasets = {metadata_store.available_datasets()}")


dataset_store.available_datasets = ['TestDataSet', 'testdataset']
metadata_store.available_datasets = ['TestDataSet', 'testdataset']


## Downloading Data


In [10]:
test_dataset_id = "testdataset"

dataset = dataset_store.get(test_dataset_id)
print(str(dataset))

metadata = metadata_store.get(test_dataset_id)
print(str(metadata))

DataSet(files=[PosixPath('/var/tmp/tmpe14wm5eo/testdataset/testfile.txt')], metadata=Metadata(dataset_name='TestDataSet', extent=<shapely.geometry.polygon.Polygon object at 0x7f4f032ad910>, crs=<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
, description='This is a test data set', data_type=<DataType.OTHER: 'other'>, contacts=[]))
Metadata(dataset_name='TestDataSet', extent=<shapely.geometry.polygon.Polygon object at 0x7f4f032ad400>, crs=<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridia

## Deployment
For read-only/anonmous access to work correctly, the data storage bucket must have this policy applied:
```json
{"Version": "2008-10-17",
 "Id": "Read Access For Anonymous Users",
 "Statement": [
    {
      "Sid": "Read-only and list bucket access for Everyone",
      "Effect": "Allow",
      "Principal": {"anonymous": ["*"]},
      "Resource": "*",
      "Action": ["GetObject", "ListBucket"]
    }
]
}
```
Note, this policy was written for and tested with the JASMIN object store, so may need tweaking for use with
other object stores.

## Developer Note: s3fs vs s3fs-fuse

[s3fs](https://pypi.org/project/s3fs/) is a python library that provides a filesystem-like interface,
but actually just uses the S3 APIs under the hood and doesn't interact with the local filesystem

[s3fs-fuse](https://github.com/s3fs-fuse/s3fs-fuse) actually mounts an S3 compatible bucket as a virtual filesystem,
which can be accessed using any method that works for local files.