# Accessing the Clean Air Framework Datastore

Data is stored on an object store on the JASMIN cloud.
Access is encapsulated by a `S3FSDataSetStore` class.
(`s3fs` is the underlying library used to access the object store)

<details>
<summary>
    Developer Note: s3fs vs s3fs-fuse
</summary>

[s3fs](https://pypi.org/project/s3fs/) is a python library that provides a filesystem-like interface,
but actually just uses the S3 APIs under the hood and doesn't interact with the local filesystem

[s3fs-fuse](https://github.com/s3fs-fuse/s3fs-fuse) actually mounts an S3 compatible bucket as a virtual filesystem,
which can be accessed using any method that works for local files.
</details>

## Credentials
Anonymous access is enabled for the data bucket, but to be able to upload data credentials with write permissions are 
required.

Documentation on `s3fs`'s credentials are here: https://s3fs.readthedocs.io/en/latest/#credentials
As it uses Amazon AWS's `botocore` library, the credential config is the same as for `botocore` & `boto3`.

In summary, the options are (in order of increasing precedence):

* `~/.aws/credentials` file (refer to [Configuration and credential file settings](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)) 
with contents such as:
```
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```
* Setting the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables
* Explicitly creating an `s3fs.S3FileSystem` instance and passing them as arguments (refer to next section for details)

## Creating a DataSetStore
The simplest way to get a valid `DataSetStore` instance is to the use the convenience method `create_dataset_store`

In [1]:
from clean_air.data.storage import create_dataset_store

dataset_store = create_dataset_store(storage_bucket_name="test-data")

This function supports several parameters that are used to customise the resulting `DataSetStore`:

| Parameter Name        | Description | Default |
| --------------------- | ----------- | ------- |
| `storage_bucket_name` | Name of the bucket where datasets are stored | `caf-data` |
| `local_storage_path`  | Path to a writeable directory to store local copies of dataset files | A temporary directory in the systems `tmp` folder |
| `endpoint_url`        | the object store service endpoint URL. Changes depending on whether accessing data from inside or outside JASMIN, or using data stored on another AWS S3 compatible object store | `clean_air.data.storage.JasminEndpointUrls.EXTERNAL` |
| `anon`                | Whether to use anonymous access or credentials. `anon=False` is required for write access | `True` |

### Advanced options
The helper function creates a `DataSetStore` in its most common configurations, but for more control you can create an 
instance by using the constructor directly and providing the required arguments.

Here's an example of how to pass in credentials programmatically:

In [2]:
import s3fs

from clean_air.data.storage import JasminEndpointUrls
from clean_air.data.storage import S3FSDataSetStore

fs = s3fs.S3FileSystem(
    key="AKIAIOSFODNN7EXAMPLE",
    secret="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    client_kwargs={"endpoint_url": JasminEndpointUrls.EXTERNAL})
dataset_store_with_custom_credentials = S3FSDataSetStore(fs)

## Uploading Data

In [3]:
import tempfile
from pathlib import Path

from clean_air.models import MetaData, DataSet

# Credentials will need to be configured correctly for this to work
dataset_store_with_write_access = create_dataset_store("test-data", anon=False)
with tempfile.TemporaryDirectory() as data_dir_path:
    # Create some test data
    test_datafile = Path(data_dir_path + "/testfile.txt")
    test_datafile.touch()
    metadata = MetaData(dataset_name="TestDataSet")
    test_dataset = DataSet(files=[test_datafile], metadata=metadata)

    # Upload it
    dataset_store_with_write_access.put(test_dataset)

## Discovering Data
List the names of available datasets:

In [4]:
dataset_store.available_datasets()

['TestDataSet']

## Downloading Data


In [5]:
dataset = dataset_store.get("TestDataSet")
dataset

DataSet(files=[PosixPath('/var/tmp/tmp60yvm1n0/TestDataSet/testfile.txt')], metadata=MetaData(dataset_name='TestDataSet'))

## Deployment
For read-only/anonmous access to work correctly, the data storage bucket must have this policy applied:
```json
{"Version": "2008-10-17",
 "Id": "Read Access For Anonymous Users",
 "Statement": [
    {
      "Sid": "Read-only and list bucket access for Everyone",
      "Effect": "Allow",
      "Principal": {"anonymous": ["*"]},
      "Resource": "*",
      "Action": ["GetObject", "ListBucket"]
    }
]
}
```
Note, this policy was written for and tested with the JASMIN object store, so may need tweaking for use with
other object stores.