Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blob storage #236

Merged
merged 9 commits into from
Nov 19, 2020
Merged

Blob storage #236

merged 9 commits into from
Nov 19, 2020

Conversation

j-mie
Copy link
Contributor

@j-mie j-mie commented Nov 10, 2020

Your checklist for this pull request

  • I've read the contributing guideline.
  • I've tested my changes by building and running the project, and testing changed functionality (if applicable)
  • I've added automated tests for my change (if applicable, optional)
  • I've updated documentation to reflect my change (if applicable)

What is the current behaviour?
Storage only supports local storage

What is the new behaviour?
Storage now supports S3 compatible blob storage (Amazon S3, MinIO etc)

Test plan
Setup MinIO or Amazon S3, set:

MWDB_STORAGE_PROVIDER=blob
MWDB_UPLOADS_FOLDER=""
MWDB_HASH_PATHING=0
MWDB_BLOB_STORAGE_ENDPOINT=storageserver.example.com
MWDB_BLOB_STORAGE_ACCESS_KEY=key
MWDB_BLOB_STORAGE_SECRET_KEY=key
MWDB_BLOB_STORAGE_BUCKET_NAME=mwdb
MWDB_BLOB_STORAGE_SECURE=1
MWDB_BLOB_STORAGE_REGION_NAME=us-east-1

Closing issues

closes #235


So I wrote this before reading #235 (oops...), so there's a few changes I need to make:

  • Avoid receiving whole file into memory and then sending the contents to the MinIO. We should work with streams and pass the chunks on-the-fly.
  • We should avoid using File.get_path and provide more universal interface not related to the file storage. Keep in mind that file contents can be used by plugins e.g. https://github.com/CERT-Polska/mwdb-plugin-drakvuf/blob/master/mwdb_plugin_drakvuf/plugin.py#L24 and that can be challenging.
  • Additionally I'm happy to swap out boto3 for the MinIO client library if that's something that's desired

Copy link
Contributor

@msm-code msm-code left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments, focusing on code quality

mwdb/model/file.py Outdated Show resolved Hide resolved
mwdb/model/file.py Outdated Show resolved Hide resolved
mwdb/core/config.py Outdated Show resolved Hide resolved
mwdb/core/config.py Outdated Show resolved Hide resolved
mwdb/core/config.py Outdated Show resolved Hide resolved
mwdb/model/file.py Outdated Show resolved Hide resolved
mwdb/model/file.py Show resolved Hide resolved
mwdb/resources/download.py Outdated Show resolved Hide resolved
@msm-code
Copy link
Contributor

msm-code commented Nov 12, 2020

Additionally, as you've said:

Additionally I'm happy to swap out boto3 for the MinIO client library if that's something that's desired

We really prefer minio client library here. Mostly because of consistency, as we use it everywhere else, but also because it's much lighter and works better for us in some usecases. It should be 100% compatible with other s3 implementations, at least we never had any problems.

Avoid receiving whole file into memory and then sending the contents to the MinIO. We should work with streams and pass the chunks on-the-fly.

It's also quite important, especially for large files. OTOH mwdb usually works with medium-small files only, so this may not be P1 issue right now. I can't find a working flask example anywhere, but this may help (it uses fastapi library instead of flask tough).

We should avoid using File.get_path and provide more universal interface not related to the file storage. Keep in mind that file contents can be used by plugins e.g. https://github.com/CERT-Polska/mwdb-plugin-drakvuf/blob/master/mwdb_plugin_drakvuf/plugin.py#L24 and that can be challenging

I'll contest psrok1 here a bit. I think what you did here is OK. We should just add another function called get_local_path to the file object, and use it in plugins. It should return a path for local files, or download a file to a temporary location for remote files. It should be relatively straightforward to implement (just need to remember to remove temporary files when they leave the request scope).

Co-authored-by: msm-code <msm2e4d534d@gmail.com>
@j-mie
Copy link
Contributor Author

j-mie commented Nov 13, 2020

Additionally, as you've said:

Additionally I'm happy to swap out boto3 for the MinIO client library if that's something that's desired

We really prefer minio client library here. Mostly because of consistency, as we use it everywhere else, but also because it's much lighter and works better for us in some usecases. It should be 100% compatible with other s3 implementations, at least we never had any problems.

I'll switch over to MinIO, it looks like I can do streaming responses.

Avoid receiving whole file into memory and then sending the contents to the MinIO. We should work with streams and pass the chunks on-the-fly.

It's also quite important, especially for large files. OTOH mwdb usually works with medium-small files only, so this may not be P1 issue right now. I can't find a working flask example anywhere, but this may help (it uses fastapi library instead of flask tough).

I think we can use Flask's stream_with_context for downloads, it might make sense to look at the upload functionality in the future too if it doesn't use chunking - but as you said it's not very often you're going to be analyzing binaries that are bigger than a few mb

We should avoid using File.get_path and provide more universal interface not related to the file storage. Keep in mind that file contents can be used by plugins e.g. https://github.com/CERT-Polska/mwdb-plugin-drakvuf/blob/master/mwdb_plugin_drakvuf/plugin.py#L24 and that can be challenging

I'll contest psrok1 here a bit. I think what you did here is OK. We should just add another function called get_local_path to the file object, and use it in plugins. It should return a path for local files, or download a file to a temporary location for remote files. It should be relatively straightforward to implement (just need to remember to remove temporary files when they leave the request scope).

I honestly wasn't a fan of the presumption that everything was going to be on disk. I think it makes most sense to expose an API that allows you to get the binary in memory (avoids writing to disk for operations that could be done in memory) and if you need it on disk you can write to disk but if it makes more sense from a usage perspective to just expose an API to write it to disk then I'm happy to do that too. I don't want to spend too much time working on a fancy API for different file providers but happy to implement something basic so that plugins can continue to work.

@j-mie
Copy link
Contributor Author

j-mie commented Nov 17, 2020

i got upload working but seem to be having some issues with the download, I just seem to be getting empty files :/ Any ideas?

@j-mie
Copy link
Contributor Author

j-mie commented Nov 18, 2020

I got streaming to work! I've implemented some methods on the file model to allow reading of files regardless of blob provider. It's a little bit opinionated in that it doesn't expose a way to write a file to disk locally, but I figure if you need that you could just read the file and write it to disk yourself.

For the drakvuf plugin you'd be able to do this:

--- a	2020-11-18 12:14:34.000000000 +0000
+++ b	2020-11-18 12:14:50.000000000 +0000
@@ -1,8 +1,6 @@
-        # Get contents path from "uploads" directory
-        contents_path = file.get_path()
         # Send request to Drakvuf Sandbox
         req = requests.post(f"{config.drakvuf.drakvuf_url}/upload", files={
-            "file": (file.sha256 + ".exe", open(contents_path, "rb")),
+            "file": (file.sha256 + ".exe", file),
         }, data={
             "timeout": config.drakvuf.timeout
         })

See:
https://github.com/psf/requests/blob/8149e9fe54c36951290f198e90d83c8a0498289c/requests/models.py#L158-L159

@msm-code
Copy link
Contributor

msm-code commented Nov 18, 2020

Thank you for all your work so far!

Almost there! But we really need an alterantive to get_path as Paweł said in his comment earlier (we use it in our plugins, even in karton one, so it's pretty important).

Can you add a method that returns a local path if the file is stored locally, or downloads the file to a temporary location if the file is stored on s3?

I've written some code that shows what I mean, maybe It'll be useful. It uses contextlib to create a simple context manager:

from contextlib import contextmanager

@contextmanager
def with_local_path():
    """ Gets a path to a file stored locally, or downloads it to a
    temporary file for remote resources """

    if app_config.mwdb.storage_provider == StorageProviderType.DISK:
        yield self._calculate_path()

    # 1. create a temporary file
    # 2. download file from minio to this temporary file
    # 3. yield path to this file
    # 4. delete the file
    #
    # for example (using the tempfile library):
    with tempfile.NamedTemporaryFile() as tf:
        # TODO: download file from minio to temporary file `tf` here
        # (https://docs.python.org/3/library/tempfile.html#examples)
        yield tf.name


def main():
    with get_local_path() as lp:
        print("ok, local path exists", lp)
    print("ok, temporary file was deleted")

mwdb/core/util.py Outdated Show resolved Hide resolved
mwdb/resources/download.py Outdated Show resolved Hide resolved
mwdb/model/file.py Outdated Show resolved Hide resolved
tests/backend/test_file.py Outdated Show resolved Hide resolved
mwdb/model/file.py Outdated Show resolved Hide resolved
mwdb/model/file.py Outdated Show resolved Hide resolved
mwdb/model/file.py Outdated Show resolved Hide resolved
mwdb/model/file.py Outdated Show resolved Hide resolved
mwdb/core/config.py Outdated Show resolved Hide resolved
Co-authored-by: msm <msm@tailcall.net>
@msm-code msm-code self-requested a review November 19, 2020 15:40
Copy link
Contributor

@msm-code msm-code left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@msm-code msm-code merged commit 4f458de into CERT-Polska:master Nov 19, 2020
@j-mie j-mie deleted the blob-storage branch November 19, 2020 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for MinIO storage
2 participants