Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assets azure #121

Merged
merged 9 commits into from Dec 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/assets/environment.md
Expand Up @@ -10,7 +10,7 @@ The parameters necessary to instantiate an `AssetsManager` can all be read from

| Environment variable | Default value | Parameter | Notes |
| --- | --- | --- | --- |
| `MODELKIT_STORAGE_PROVIDER` | `gcs` | `storage_provider` | `gcs` (default), `s3` or `local` |
| `MODELKIT_STORAGE_PROVIDER` | `gcs` | `storage_provider` | `gcs` (default), `s3`, `az` or `local` |
| `MODELKIT_STORAGE_BUCKET` | None | `bucket` | Bucket in which data is stored |
| `MODELKIT_STORAGE_PREFIX` | `modelkit-assets` | `prefix` | Objects prefix |
| `MODELKIT_STORAGE_TIMEOUT_S` | `300` | `timeout_s` | max time when retrying storage downloads |
Expand Down
20 changes: 16 additions & 4 deletions docs/assets/storage_provider.md
Expand Up @@ -28,11 +28,11 @@ Developers may additionally need to be able to push new assets and or update exi

## Using different providers

The flavor of the remote store that is used depends on the `STORAGE_PROVIDER` environment variables
The flavor of the remote store that is used depends on the `MODELKIT_STORAGE_PROVIDER` environment variables

### Using AWS S3 storage

Use `STORAGE_PROVIDER=s3` to connect to S3 storage.
Use `MODELKIT_STORAGE_PROVIDER=s3` to connect to S3 storage.

We use [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) under the hood.

Expand All @@ -53,7 +53,7 @@ Use `AWS_KMS_KEY_ID` environment variable to set your key and be able to upload

### GCS storage

Use `STORAGE_PROVIDER=gcs` to connect to GCS storage.
Use `MODELKIT_STORAGE_PROVIDER=gcs` to connect to GCS storage.

We use [google-cloud-storage](https://googleapis.dev/python/storage/latest/index.html).

Expand All @@ -65,10 +65,22 @@ By default, the GCS client use the credentials setup up on the machine.

If `GOOGLE_APPLICATION_CREDENTIALS` is provided, it should point to a local JSON service account file, which we use to instantiate the client with `google.cloud.storage.Client.from_service_account_json`

### Using Azure blob storage

Use `MODELKIT_STORAGE_PROVIDER=az` to connect to Azure blob storage.

We use [azure-storage-blobl](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python) under the hood.

The client is created by passing the authentication information to `BlobServiceClient.from_connection_string`:

| Environment variable | Note |
| ----------------------- | ----------------------- |
| `AZURE_STORAGE_CONNECTION_STRING` | azure connection string |


### `local` mode

Use `STORAGE_PROVIDER=local` to connect to GCS storage.
Use `MODELKIT_STORAGE_PROVIDER=local` to connect to GCS storage.

This is mostly used internally for development, but you can also use another folder on your file system as a storage provider

Expand Down
2 changes: 1 addition & 1 deletion docs/assets/store_organization.md
Expand Up @@ -12,7 +12,7 @@ Remote assets are stored in object stores, referenced as:

In this "path":

- `provider` is `s3` or `gcs` or `file` depending on the storage driver (value of `MODELKIT_STORAGE_PROVIDER`)
- `provider` is `s3`, `azfs` or `gcs` or `file` depending on the storage driver (value of `MODELKIT_STORAGE_PROVIDER`)
- `bucket` is the remote container name (`MODELKIT_STORAGE_BUCKET`)

The rest of the "path" is the remote object's name and consists of
Expand Down
1 change: 1 addition & 0 deletions docs/configuration.md
Expand Up @@ -31,6 +31,7 @@ These variables are necessary to set a remote storage from which to retrieve ass
pointing to a service account credentials JSON file (this is not necessary on dev
machines)
- for `MODELKIT_STORAGE_PROVIDER=s3`, you need to instantiate `AWS_PROFILE`
- for `MODELKIT_STORAGE_PROVIDER=az`, you need to instantiate `AZURE_STORAGE_CONNECTION_STRING` with a connection string

### Assets versioning related environment variable

Expand Down
87 changes: 87 additions & 0 deletions modelkit/assets/drivers/azure.py
@@ -0,0 +1,87 @@
import os
from typing import Optional

from azure.storage.blob import BlobServiceClient
from structlog import get_logger
from tenacity import retry

from modelkit.assets import errors
from modelkit.assets.drivers.abc import StorageDriver
from modelkit.assets.drivers.retry import RETRY_POLICY

logger = get_logger(__name__)


class AzureStorageDriver(StorageDriver):
bucket: str

def __init__(
self,
bucket: Optional[str] = None,
connection_string: Optional[str] = None,
client: Optional[BlobServiceClient] = None,
):
self.bucket = bucket or os.environ.get("MODELKIT_STORAGE_BUCKET") or ""
antoinejeannot marked this conversation as resolved.
Show resolved Hide resolved
if not self.bucket:
raise ValueError("Bucket needs to be set for Azure storage driver")

if client:
self.client = client
elif connection_string: # pragma: no cover
self.client = BlobServiceClient.from_connection_string(connection_string)
else:
self.client = BlobServiceClient.from_connection_string(
os.environ["AZURE_STORAGE_CONNECTION_STRING"]
)

@retry(**RETRY_POLICY)
def iterate_objects(self, prefix=None):
container = self.client.get_container_client(self.bucket)
for blob in container.list_blobs(prefix=prefix):
yield blob["name"]

@retry(**RETRY_POLICY)
def upload_object(self, file_path, object_name):
blob_client = self.client.get_blob_client(
container=self.bucket, blob=object_name
)
if blob_client.exists():
self.delete_object(object_name)
with open(file_path, "rb") as f:
blob_client.upload_blob(f)

@retry(**RETRY_POLICY)
def download_object(self, object_name, destination_path):
blob_client = self.client.get_blob_client(
container=self.bucket, blob=object_name
)
if not blob_client.exists():
logger.error(
"Object not found.", bucket=self.bucket, object_name=object_name
)
if os.path.exists(destination_path):
os.remove(destination_path)
raise errors.ObjectDoesNotExistError(
driver=self, bucket=self.bucket, object_name=object_name
)
with open(destination_path, "wb") as f:
f.write(blob_client.download_blob().readall())

@retry(**RETRY_POLICY)
def delete_object(self, object_name):
blob_client = self.client.get_blob_client(
container=self.bucket, blob=object_name
)
blob_client.delete_blob()

@retry(**RETRY_POLICY)
def exists(self, object_name):
blob_client = self.client.get_blob_client(
container=self.bucket, blob=object_name
)
return blob_client.exists()

def get_object_uri(self, object_name, sub_part=None):
return "azfs://" + "/".join(
(self.bucket, object_name, *(sub_part or "").split("/"))
)
2 changes: 1 addition & 1 deletion modelkit/assets/drivers/gcs.py
Expand Up @@ -80,5 +80,5 @@ def exists(self, object_name):

def get_object_uri(self, object_name, sub_part=None):
return "gs://" + "/".join(
self.bucket, object_name, *(sub_part or "").split("/")
(self.bucket, object_name, *(sub_part or "").split("/"))
)
2 changes: 1 addition & 1 deletion modelkit/assets/drivers/s3.py
Expand Up @@ -98,5 +98,5 @@ def __repr__(self):

def get_object_uri(self, object_name, sub_part=None):
return "s3://" + "/".join(
self.bucket, object_name, *(sub_part or "").split("/")
(self.bucket, object_name, *(sub_part or "").split("/"))
)
3 changes: 3 additions & 0 deletions modelkit/assets/remote.py
Expand Up @@ -12,6 +12,7 @@

from modelkit.assets import errors
from modelkit.assets.drivers.abc import StorageDriver
from modelkit.assets.drivers.azure import AzureStorageDriver
from modelkit.assets.drivers.gcs import GCSStorageDriver
from modelkit.assets.drivers.local import LocalStorageDriver
from modelkit.assets.drivers.s3 import S3StorageDriver
Expand Down Expand Up @@ -73,6 +74,8 @@ def __init__(
self.driver = S3StorageDriver(**driver_settings)
elif provider == "local":
self.driver = LocalStorageDriver(**driver_settings)
elif provider == "az":
self.driver = AzureStorageDriver(**driver_settings)
else:
raise UnknownDriverError()

Expand Down
1 change: 1 addition & 0 deletions requirements-dev.in
Expand Up @@ -12,6 +12,7 @@ isort
nox
google-api-core
google-cloud-storage
azure-storage-blob
tenacity
# cli
memory-profiler
Expand Down
51 changes: 47 additions & 4 deletions requirements-dev.txt
Expand Up @@ -32,15 +32,24 @@ attrs==21.2.0
# -c requirements.txt
# aiohttp
# pytest
azure-core==1.21.1
# via
# -c requirements.txt
# azure-storage-blob
azure-storage-blob==12.9.0
# via
# -c requirements.txt
# -r requirements-dev.in
# -r requirements.in
backports.entry-points-selectable==1.1.1
# via virtualenv
black==21.10b0
# via -r requirements-dev.in
boto3==1.20.21
boto3==1.20.22
# via
# -c requirements.txt
# -r requirements.in
botocore==1.23.21
botocore==1.23.22
# via
# -c requirements.txt
# boto3
Expand All @@ -55,7 +64,12 @@ cachetools==4.2.4
certifi==2021.10.8
# via
# -c requirements.txt
# msrest
# requests
cffi==1.15.0
# via
# -c requirements.txt
# cryptography
charset-normalizer==2.0.9
# via
# -c requirements.txt
Expand All @@ -81,6 +95,10 @@ commonmark==0.9.1
# rich
coverage==6.1.1
# via -r requirements-dev.in
cryptography==36.0.0
# via
# -c requirements.txt
# azure-storage-blob
deprecated==1.2.13
# via
# -c requirements.txt
Expand All @@ -103,7 +121,7 @@ frozenlist==1.2.0
# aiosignal
ghp-import==2.0.2
# via mkdocs
google-api-core==2.2.2
google-api-core==2.3.0
# via
# -c requirements.txt
# -r requirements-dev.in
Expand Down Expand Up @@ -132,7 +150,7 @@ google-resumable-media==2.1.0
# via
# -c requirements.txt
# google-cloud-storage
googleapis-common-protos==1.53.0
googleapis-common-protos==1.54.0
# via
# -c requirements.txt
# google-api-core
Expand All @@ -152,6 +170,10 @@ importlib-metadata==4.8.2
# via mkdocs
iniconfig==1.1.1
# via pytest
isodate==0.6.0
# via
# -c requirements.txt
# msrest
isort==5.10.1
# via -r requirements-dev.in
jinja2==3.0.2
Expand Down Expand Up @@ -182,6 +204,10 @@ mkdocs-material==7.3.6
# via -r requirements-dev.in
mkdocs-material-extensions==1.0.3
# via mkdocs-material
msrest==0.6.21
# via
# -c requirements.txt
# azure-storage-blob
multidict==5.2.0
# via
# -c requirements.txt
Expand All @@ -197,6 +223,10 @@ networkx==2.6.3
# via -r requirements-dev.in
nox==2021.10.1
# via -r requirements-dev.in
oauthlib==3.1.1
# via
# -c requirements.txt
# requests-oauthlib
packaging==21.2
# via
# mkdocs
Expand Down Expand Up @@ -237,6 +267,10 @@ pyasn1-modules==0.2.8
# google-auth
pycodestyle==2.7.0
# via flake8
pycparser==2.21
# via
# -c requirements.txt
# cffi
pydantic==1.8.2
# via
# -c requirements.txt
Expand Down Expand Up @@ -282,8 +316,15 @@ regex==2021.11.2
requests==2.26.0
# via
# -c requirements.txt
# azure-core
# google-api-core
# google-cloud-storage
# msrest
# requests-oauthlib
requests-oauthlib==1.3.0
# via
# -c requirements.txt
# msrest
rich==10.15.2
# via
# -c requirements.txt
Expand All @@ -299,8 +340,10 @@ s3transfer==0.5.0
six==1.16.0
# via
# -c requirements.txt
# azure-core
# google-auth
# google-cloud-storage
# isodate
# python-dateutil
# virtualenv
sniffio==1.2.0
Expand Down
1 change: 1 addition & 0 deletions requirements.in
Expand Up @@ -5,6 +5,7 @@ cachetools
click
filelock
google-cloud-storage
azure-storage-blob
humanize
pydantic
python-dateutil
Expand Down