Skip to content

Commit

Permalink
Merge pull request #121 from Cornerstone-OnDemand/assets-azure
Browse files Browse the repository at this point in the history
Assets azure
  • Loading branch information
antoinejeannot committed Dec 9, 2021
2 parents 66844bc + d40f61a commit 595a20a
Show file tree
Hide file tree
Showing 21 changed files with 288 additions and 23 deletions.
2 changes: 1 addition & 1 deletion docs/assets/environment.md
Expand Up @@ -10,7 +10,7 @@ The parameters necessary to instantiate an `AssetsManager` can all be read from

| Environment variable | Default value | Parameter | Notes |
| --- | --- | --- | --- |
| `MODELKIT_STORAGE_PROVIDER` | `gcs` | `storage_provider` | `gcs` (default), `s3` or `local` |
| `MODELKIT_STORAGE_PROVIDER` | `gcs` | `storage_provider` | `gcs` (default), `s3`, `az` or `local` |
| `MODELKIT_STORAGE_BUCKET` | None | `bucket` | Bucket in which data is stored |
| `MODELKIT_STORAGE_PREFIX` | `modelkit-assets` | `prefix` | Objects prefix |
| `MODELKIT_STORAGE_TIMEOUT_S` | `300` | `timeout_s` | max time when retrying storage downloads |
Expand Down
20 changes: 16 additions & 4 deletions docs/assets/storage_provider.md
Expand Up @@ -28,11 +28,11 @@ Developers may additionally need to be able to push new assets and or update exi

## Using different providers

The flavor of the remote store that is used depends on the `STORAGE_PROVIDER` environment variables
The flavor of the remote store that is used depends on the `MODELKIT_STORAGE_PROVIDER` environment variables

### Using AWS S3 storage

Use `STORAGE_PROVIDER=s3` to connect to S3 storage.
Use `MODELKIT_STORAGE_PROVIDER=s3` to connect to S3 storage.

We use [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) under the hood.

Expand All @@ -53,7 +53,7 @@ Use `AWS_KMS_KEY_ID` environment variable to set your key and be able to upload

### GCS storage

Use `STORAGE_PROVIDER=gcs` to connect to GCS storage.
Use `MODELKIT_STORAGE_PROVIDER=gcs` to connect to GCS storage.

We use [google-cloud-storage](https://googleapis.dev/python/storage/latest/index.html).

Expand All @@ -65,10 +65,22 @@ By default, the GCS client use the credentials setup up on the machine.

If `GOOGLE_APPLICATION_CREDENTIALS` is provided, it should point to a local JSON service account file, which we use to instantiate the client with `google.cloud.storage.Client.from_service_account_json`

### Using Azure blob storage

Use `MODELKIT_STORAGE_PROVIDER=az` to connect to Azure blob storage.

We use [azure-storage-blobl](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python) under the hood.

The client is created by passing the authentication information to `BlobServiceClient.from_connection_string`:

| Environment variable | Note |
| ----------------------- | ----------------------- |
| `AZURE_STORAGE_CONNECTION_STRING` | azure connection string |


### `local` mode

Use `STORAGE_PROVIDER=local` to connect to GCS storage.
Use `MODELKIT_STORAGE_PROVIDER=local` to connect to GCS storage.

This is mostly used internally for development, but you can also use another folder on your file system as a storage provider

Expand Down
2 changes: 1 addition & 1 deletion docs/assets/store_organization.md
Expand Up @@ -12,7 +12,7 @@ Remote assets are stored in object stores, referenced as:

In this "path":

- `provider` is `s3` or `gcs` or `file` depending on the storage driver (value of `MODELKIT_STORAGE_PROVIDER`)
- `provider` is `s3`, `azfs` or `gcs` or `file` depending on the storage driver (value of `MODELKIT_STORAGE_PROVIDER`)
- `bucket` is the remote container name (`MODELKIT_STORAGE_BUCKET`)

The rest of the "path" is the remote object's name and consists of
Expand Down
1 change: 1 addition & 0 deletions docs/configuration.md
Expand Up @@ -31,6 +31,7 @@ These variables are necessary to set a remote storage from which to retrieve ass
pointing to a service account credentials JSON file (this is not necessary on dev
machines)
- for `MODELKIT_STORAGE_PROVIDER=s3`, you need to instantiate `AWS_PROFILE`
- for `MODELKIT_STORAGE_PROVIDER=az`, you need to instantiate `AZURE_STORAGE_CONNECTION_STRING` with a connection string

### Assets versioning related environment variable

Expand Down
87 changes: 87 additions & 0 deletions modelkit/assets/drivers/azure.py
@@ -0,0 +1,87 @@
import os
from typing import Optional

from azure.storage.blob import BlobServiceClient
from structlog import get_logger
from tenacity import retry

from modelkit.assets import errors
from modelkit.assets.drivers.abc import StorageDriver
from modelkit.assets.drivers.retry import RETRY_POLICY

logger = get_logger(__name__)


class AzureStorageDriver(StorageDriver):
bucket: str

def __init__(
self,
bucket: Optional[str] = None,
connection_string: Optional[str] = None,
client: Optional[BlobServiceClient] = None,
):
self.bucket = bucket or os.environ.get("MODELKIT_STORAGE_BUCKET") or ""
if not self.bucket:
raise ValueError("Bucket needs to be set for Azure storage driver")

if client:
self.client = client
elif connection_string: # pragma: no cover
self.client = BlobServiceClient.from_connection_string(connection_string)
else:
self.client = BlobServiceClient.from_connection_string(
os.environ["AZURE_STORAGE_CONNECTION_STRING"]
)

@retry(**RETRY_POLICY)
def iterate_objects(self, prefix=None):
container = self.client.get_container_client(self.bucket)
for blob in container.list_blobs(prefix=prefix):
yield blob["name"]

@retry(**RETRY_POLICY)
def upload_object(self, file_path, object_name):
blob_client = self.client.get_blob_client(
container=self.bucket, blob=object_name
)
if blob_client.exists():
self.delete_object(object_name)
with open(file_path, "rb") as f:
blob_client.upload_blob(f)

@retry(**RETRY_POLICY)
def download_object(self, object_name, destination_path):
blob_client = self.client.get_blob_client(
container=self.bucket, blob=object_name
)
if not blob_client.exists():
logger.error(
"Object not found.", bucket=self.bucket, object_name=object_name
)
if os.path.exists(destination_path):
os.remove(destination_path)
raise errors.ObjectDoesNotExistError(
driver=self, bucket=self.bucket, object_name=object_name
)
with open(destination_path, "wb") as f:
f.write(blob_client.download_blob().readall())

@retry(**RETRY_POLICY)
def delete_object(self, object_name):
blob_client = self.client.get_blob_client(
container=self.bucket, blob=object_name
)
blob_client.delete_blob()

@retry(**RETRY_POLICY)
def exists(self, object_name):
blob_client = self.client.get_blob_client(
container=self.bucket, blob=object_name
)
return blob_client.exists()

def get_object_uri(self, object_name, sub_part=None):
return "azfs://" + "/".join(
(self.bucket, object_name, *(sub_part or "").split("/"))
)
2 changes: 1 addition & 1 deletion modelkit/assets/drivers/gcs.py
Expand Up @@ -80,5 +80,5 @@ def exists(self, object_name):

def get_object_uri(self, object_name, sub_part=None):
return "gs://" + "/".join(
self.bucket, object_name, *(sub_part or "").split("/")
(self.bucket, object_name, *(sub_part or "").split("/"))
)
2 changes: 1 addition & 1 deletion modelkit/assets/drivers/s3.py
Expand Up @@ -98,5 +98,5 @@ def __repr__(self):

def get_object_uri(self, object_name, sub_part=None):
return "s3://" + "/".join(
self.bucket, object_name, *(sub_part or "").split("/")
(self.bucket, object_name, *(sub_part or "").split("/"))
)
3 changes: 3 additions & 0 deletions modelkit/assets/remote.py
Expand Up @@ -12,6 +12,7 @@

from modelkit.assets import errors
from modelkit.assets.drivers.abc import StorageDriver
from modelkit.assets.drivers.azure import AzureStorageDriver
from modelkit.assets.drivers.gcs import GCSStorageDriver
from modelkit.assets.drivers.local import LocalStorageDriver
from modelkit.assets.drivers.s3 import S3StorageDriver
Expand Down Expand Up @@ -73,6 +74,8 @@ def __init__(
self.driver = S3StorageDriver(**driver_settings)
elif provider == "local":
self.driver = LocalStorageDriver(**driver_settings)
elif provider == "az":
self.driver = AzureStorageDriver(**driver_settings)
else:
raise UnknownDriverError()

Expand Down
1 change: 1 addition & 0 deletions requirements-dev.in
Expand Up @@ -12,6 +12,7 @@ isort
nox
google-api-core
google-cloud-storage
azure-storage-blob
tenacity
# cli
memory-profiler
Expand Down
51 changes: 47 additions & 4 deletions requirements-dev.txt
Expand Up @@ -32,15 +32,24 @@ attrs==21.2.0
# -c requirements.txt
# aiohttp
# pytest
azure-core==1.21.1
# via
# -c requirements.txt
# azure-storage-blob
azure-storage-blob==12.9.0
# via
# -c requirements.txt
# -r requirements-dev.in
# -r requirements.in
backports.entry-points-selectable==1.1.1
# via virtualenv
black==21.10b0
# via -r requirements-dev.in
boto3==1.20.21
boto3==1.20.22
# via
# -c requirements.txt
# -r requirements.in
botocore==1.23.21
botocore==1.23.22
# via
# -c requirements.txt
# boto3
Expand All @@ -55,7 +64,12 @@ cachetools==4.2.4
certifi==2021.10.8
# via
# -c requirements.txt
# msrest
# requests
cffi==1.15.0
# via
# -c requirements.txt
# cryptography
charset-normalizer==2.0.9
# via
# -c requirements.txt
Expand All @@ -81,6 +95,10 @@ commonmark==0.9.1
# rich
coverage==6.1.1
# via -r requirements-dev.in
cryptography==36.0.0
# via
# -c requirements.txt
# azure-storage-blob
deprecated==1.2.13
# via
# -c requirements.txt
Expand All @@ -103,7 +121,7 @@ frozenlist==1.2.0
# aiosignal
ghp-import==2.0.2
# via mkdocs
google-api-core==2.2.2
google-api-core==2.3.0
# via
# -c requirements.txt
# -r requirements-dev.in
Expand Down Expand Up @@ -132,7 +150,7 @@ google-resumable-media==2.1.0
# via
# -c requirements.txt
# google-cloud-storage
googleapis-common-protos==1.53.0
googleapis-common-protos==1.54.0
# via
# -c requirements.txt
# google-api-core
Expand All @@ -152,6 +170,10 @@ importlib-metadata==4.8.2
# via mkdocs
iniconfig==1.1.1
# via pytest
isodate==0.6.0
# via
# -c requirements.txt
# msrest
isort==5.10.1
# via -r requirements-dev.in
jinja2==3.0.2
Expand Down Expand Up @@ -182,6 +204,10 @@ mkdocs-material==7.3.6
# via -r requirements-dev.in
mkdocs-material-extensions==1.0.3
# via mkdocs-material
msrest==0.6.21
# via
# -c requirements.txt
# azure-storage-blob
multidict==5.2.0
# via
# -c requirements.txt
Expand All @@ -197,6 +223,10 @@ networkx==2.6.3
# via -r requirements-dev.in
nox==2021.10.1
# via -r requirements-dev.in
oauthlib==3.1.1
# via
# -c requirements.txt
# requests-oauthlib
packaging==21.2
# via
# mkdocs
Expand Down Expand Up @@ -237,6 +267,10 @@ pyasn1-modules==0.2.8
# google-auth
pycodestyle==2.7.0
# via flake8
pycparser==2.21
# via
# -c requirements.txt
# cffi
pydantic==1.8.2
# via
# -c requirements.txt
Expand Down Expand Up @@ -282,8 +316,15 @@ regex==2021.11.2
requests==2.26.0
# via
# -c requirements.txt
# azure-core
# google-api-core
# google-cloud-storage
# msrest
# requests-oauthlib
requests-oauthlib==1.3.0
# via
# -c requirements.txt
# msrest
rich==10.15.2
# via
# -c requirements.txt
Expand All @@ -299,8 +340,10 @@ s3transfer==0.5.0
six==1.16.0
# via
# -c requirements.txt
# azure-core
# google-auth
# google-cloud-storage
# isodate
# python-dateutil
# virtualenv
sniffio==1.2.0
Expand Down
1 change: 1 addition & 0 deletions requirements.in
Expand Up @@ -5,6 +5,7 @@ cachetools
click
filelock
google-cloud-storage
azure-storage-blob
humanize
pydantic
python-dateutil
Expand Down

0 comments on commit 595a20a

Please sign in to comment.