Blob storage #236

j-mie · 2020-11-10T14:50:17Z

Your checklist for this pull request

I've read the contributing guideline.
I've tested my changes by building and running the project, and testing changed functionality (if applicable)
I've added automated tests for my change (if applicable, optional)
I've updated documentation to reflect my change (if applicable)

What is the current behaviour?
Storage only supports local storage

What is the new behaviour?
Storage now supports S3 compatible blob storage (Amazon S3, MinIO etc)

Test plan
Setup MinIO or Amazon S3, set:

MWDB_STORAGE_PROVIDER=blob
MWDB_UPLOADS_FOLDER=""
MWDB_HASH_PATHING=0
MWDB_BLOB_STORAGE_ENDPOINT=storageserver.example.com
MWDB_BLOB_STORAGE_ACCESS_KEY=key
MWDB_BLOB_STORAGE_SECRET_KEY=key
MWDB_BLOB_STORAGE_BUCKET_NAME=mwdb
MWDB_BLOB_STORAGE_SECURE=1
MWDB_BLOB_STORAGE_REGION_NAME=us-east-1

Closing issues

closes #235

So I wrote this before reading #235 (oops...), so there's a few changes I need to make:

Avoid receiving whole file into memory and then sending the contents to the MinIO. We should work with streams and pass the chunks on-the-fly.
We should avoid using File.get_path and provide more universal interface not related to the file storage. Keep in mind that file contents can be used by plugins e.g. https://github.com/CERT-Polska/mwdb-plugin-drakvuf/blob/master/mwdb_plugin_drakvuf/plugin.py#L24 and that can be challenging.
Additionally I'm happy to swap out boto3 for the MinIO client library if that's something that's desired

msm-code

First round of comments, focusing on code quality

mwdb/model/file.py

mwdb/core/config.py

mwdb/model/file.py

mwdb/resources/download.py

msm-code · 2020-11-12T15:13:23Z

Additionally, as you've said:

Additionally I'm happy to swap out boto3 for the MinIO client library if that's something that's desired

We really prefer minio client library here. Mostly because of consistency, as we use it everywhere else, but also because it's much lighter and works better for us in some usecases. It should be 100% compatible with other s3 implementations, at least we never had any problems.

Avoid receiving whole file into memory and then sending the contents to the MinIO. We should work with streams and pass the chunks on-the-fly.

It's also quite important, especially for large files. OTOH mwdb usually works with medium-small files only, so this may not be P1 issue right now. I can't find a working flask example anywhere, but this may help (it uses fastapi library instead of flask tough).

We should avoid using File.get_path and provide more universal interface not related to the file storage. Keep in mind that file contents can be used by plugins e.g. https://github.com/CERT-Polska/mwdb-plugin-drakvuf/blob/master/mwdb_plugin_drakvuf/plugin.py#L24 and that can be challenging

I'll contest psrok1 here a bit. I think what you did here is OK. We should just add another function called get_local_path to the file object, and use it in plugins. It should return a path for local files, or download a file to a temporary location for remote files. It should be relatively straightforward to implement (just need to remember to remove temporary files when they leave the request scope).

Co-authored-by: msm-code <msm2e4d534d@gmail.com>

j-mie · 2020-11-13T23:16:55Z

Additionally, as you've said:

Additionally I'm happy to swap out boto3 for the MinIO client library if that's something that's desired

We really prefer minio client library here. Mostly because of consistency, as we use it everywhere else, but also because it's much lighter and works better for us in some usecases. It should be 100% compatible with other s3 implementations, at least we never had any problems.

I'll switch over to MinIO, it looks like I can do streaming responses.

Avoid receiving whole file into memory and then sending the contents to the MinIO. We should work with streams and pass the chunks on-the-fly.

It's also quite important, especially for large files. OTOH mwdb usually works with medium-small files only, so this may not be P1 issue right now. I can't find a working flask example anywhere, but this may help (it uses fastapi library instead of flask tough).

I think we can use Flask's stream_with_context for downloads, it might make sense to look at the upload functionality in the future too if it doesn't use chunking - but as you said it's not very often you're going to be analyzing binaries that are bigger than a few mb

We should avoid using File.get_path and provide more universal interface not related to the file storage. Keep in mind that file contents can be used by plugins e.g. https://github.com/CERT-Polska/mwdb-plugin-drakvuf/blob/master/mwdb_plugin_drakvuf/plugin.py#L24 and that can be challenging

I'll contest psrok1 here a bit. I think what you did here is OK. We should just add another function called get_local_path to the file object, and use it in plugins. It should return a path for local files, or download a file to a temporary location for remote files. It should be relatively straightforward to implement (just need to remember to remove temporary files when they leave the request scope).

I honestly wasn't a fan of the presumption that everything was going to be on disk. I think it makes most sense to expose an API that allows you to get the binary in memory (avoids writing to disk for operations that could be done in memory) and if you need it on disk you can write to disk but if it makes more sense from a usage perspective to just expose an API to write it to disk then I'm happy to do that too. I don't want to spend too much time working on a fancy API for different file providers but happy to implement something basic so that plugins can continue to work.

j-mie · 2020-11-17T20:32:51Z

i got upload working but seem to be having some issues with the download, I just seem to be getting empty files :/ Any ideas?

j-mie · 2020-11-18T12:15:27Z

I got streaming to work! I've implemented some methods on the file model to allow reading of files regardless of blob provider. It's a little bit opinionated in that it doesn't expose a way to write a file to disk locally, but I figure if you need that you could just read the file and write it to disk yourself.

For the drakvuf plugin you'd be able to do this:

--- a	2020-11-18 12:14:34.000000000 +0000
+++ b	2020-11-18 12:14:50.000000000 +0000
@@ -1,8 +1,6 @@
-        # Get contents path from "uploads" directory
-        contents_path = file.get_path()
         # Send request to Drakvuf Sandbox
         req = requests.post(f"{config.drakvuf.drakvuf_url}/upload", files={
-            "file": (file.sha256 + ".exe", open(contents_path, "rb")),
+            "file": (file.sha256 + ".exe", file),
         }, data={
             "timeout": config.drakvuf.timeout
         })

See:
https://github.com/psf/requests/blob/8149e9fe54c36951290f198e90d83c8a0498289c/requests/models.py#L158-L159

msm-code · 2020-11-18T18:26:54Z

Thank you for all your work so far!

Almost there! But we really need an alterantive to get_path as Paweł said in his comment earlier (we use it in our plugins, even in karton one, so it's pretty important).

Can you add a method that returns a local path if the file is stored locally, or downloads the file to a temporary location if the file is stored on s3?

I've written some code that shows what I mean, maybe It'll be useful. It uses contextlib to create a simple context manager:

from contextlib import contextmanager

@contextmanager
def with_local_path():
    """ Gets a path to a file stored locally, or downloads it to a
    temporary file for remote resources """

    if app_config.mwdb.storage_provider == StorageProviderType.DISK:
        yield self._calculate_path()

    # 1. create a temporary file
    # 2. download file from minio to this temporary file
    # 3. yield path to this file
    # 4. delete the file
    #
    # for example (using the tempfile library):
    with tempfile.NamedTemporaryFile() as tf:
        # TODO: download file from minio to temporary file `tf` here
        # (https://docs.python.org/3/library/tempfile.html#examples)
        yield tf.name


def main():
    with get_local_path() as lp:
        print("ok, local path exists", lp)
    print("ok, temporary file was deleted")

mwdb/core/util.py

mwdb/resources/download.py

mwdb/model/file.py

tests/backend/test_file.py

mwdb/model/file.py

mwdb/core/config.py

Co-authored-by: msm <msm@tailcall.net>

msm-code

LGTM

Blob Storage implementation

b07443c

j-mie force-pushed the blob-storage branch from 122dcdb to b07443c Compare November 10, 2020 15:21

msm-code suggested changes Nov 12, 2020

View reviewed changes

Apply suggestions from code review

e3bee16

Co-authored-by: msm-code <msm2e4d534d@gmail.com>

j-mie added 2 commits November 17, 2020 18:55

Fix default storage provider

a2522bc

Replace Boto3 with MinIO

40b2350

Fix download response streaming

8ce1e7c

Add file read API

9d535cc

j-mie force-pushed the blob-storage branch from ebfbed8 to 9d535cc Compare November 18, 2020 12:29

msm-code suggested changes Nov 18, 2020

View reviewed changes

Cleanup

688a1b3

msm-code reviewed Nov 18, 2020

View reviewed changes

mwdb/core/config.py Outdated Show resolved Hide resolved

Delete bad test

6b7b014

j-mie force-pushed the blob-storage branch from 4a69cd4 to 6b7b014 Compare November 18, 2020 19:12

Fix old variable usage

e1b636e

Co-authored-by: msm <msm@tailcall.net>

msm-code self-requested a review November 19, 2020 15:40

msm-code approved these changes Nov 19, 2020

View reviewed changes

msm-code merged commit 4f458de into CERT-Polska:master Nov 19, 2020

j-mie deleted the blob-storage branch November 19, 2020 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blob storage #236

Blob storage #236

j-mie commented Nov 10, 2020 •

edited

msm-code left a comment

msm-code commented Nov 12, 2020 •

edited

j-mie commented Nov 13, 2020 •

edited

j-mie commented Nov 17, 2020

j-mie commented Nov 18, 2020 •

edited

msm-code commented Nov 18, 2020 •

edited

msm-code left a comment

Blob storage #236

Blob storage #236

Conversation

j-mie commented Nov 10, 2020 • edited

msm-code left a comment

Choose a reason for hiding this comment

msm-code commented Nov 12, 2020 • edited

j-mie commented Nov 13, 2020 • edited

j-mie commented Nov 17, 2020

j-mie commented Nov 18, 2020 • edited

msm-code commented Nov 18, 2020 • edited

msm-code left a comment

Choose a reason for hiding this comment

j-mie commented Nov 10, 2020 •

edited

msm-code commented Nov 12, 2020 •

edited

j-mie commented Nov 13, 2020 •

edited

j-mie commented Nov 18, 2020 •

edited

msm-code commented Nov 18, 2020 •

edited