Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow pickle when loading numpy array file #836

Merged
merged 4 commits into from Oct 4, 2023
Merged

Conversation

tomglk
Copy link
Contributor

@tomglk tomglk commented Jul 12, 2023

Hi, we had had problems loading a detector which we stored using save_detector.

When we want to load it using load_detector, we get the error value error: 'Object arrays cannot be loaded when allow_pickle=False'

Currently we apply a hack of overwriting np.load with the parameter allow_pickle=True in order to be able to use the loading mechanism. (https://stackoverflow.com/a/56243777)

However, we think that it makes sense to set this param in the loading function itself.

If you have any questions, feel free to ask.
Any tips on how to solve this issue with other ways are also more than welcome. :)

Thanks, Tobi & Tom

Co-authored-by: tokaessm

@jklaise jklaise requested a review from ascillitoe July 12, 2023 17:13
@codecov
Copy link

codecov bot commented Jul 12, 2023

Codecov Report

Merging #836 (ac43508) into master (4a1b4f7) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #836      +/-   ##
==========================================
- Coverage   81.98%   81.91%   -0.07%     
==========================================
  Files         159      159              
  Lines       10375    10375              
==========================================
- Hits         8506     8499       -7     
- Misses       1869     1876       +7     
Files Coverage Δ
alibi_detect/saving/loading.py 93.92% <100.00%> (ø)

... and 1 file with indirect coverage changes

@ascillitoe
Copy link
Contributor

Hi @tomglk, thank you for contributing this.

Are you able to share why you need to serialise/deserialise dtype='object' ndarray's here? We'd be interested to know since this isn't a use case we had envisioned (although that doesn't mean it isn't valid!), and want to be sure there is strong motivation for allow_pickle=True due to the associated CVE-2019-644.

@tomglk
Copy link
Contributor Author

tomglk commented Jul 21, 2023

Hi @ascillitoe ,
I know that allow_pickle = True has bad security implications. I would welcome any other solution.

We store strings, that seems to be the problem.

This is a minimal example that breaks:

data = pd.DataFrame(
    columns=["c1", "c2", "c3"],
    data=[
        ["42", 1, datetime.now().timestamp()],
        ["42", 2, datetime.now().timestamp()],
        ["42", 3, datetime.now().timestamp()],
    ]
)

detector = TabularDrift(
    x_ref=data.to_numpy(),
    p_val=0.05,
    x_ref_preprocessed=True
)

save_detector(
    detector,
    "detector"
)

loaded_detector = load_detector(
    "detector"
)

Changing "42" to 42 fixes it.

It also breaks when you remove the .timestamp(), but storing a timestamp instead of a datetime object is not a problem imo.

Edit:
Setting the c1 column as category-type & adding the param categories_per_feature={0: None} also does not change the behaviour.

@tomglk
Copy link
Contributor Author

tomglk commented Jul 22, 2023

New commit does not set the value to true all the time (because that is unnecessary, if you do not have the problem).
I see that the first commit was overly aggressive in that regard.

Instead, the value can now be controlled via parameter.

Intended usage:

loaded_detector = load_detector(
    "detector",
    enable_unsafe_loading=True
)

This way users who actually have the problem do not have to hack the np.load()-function themselves. But they are forced to make a conscious and informed decision (CVE you mentioned is linked in docstrings).

Tests:
I did not touch the tests, because I cannot get the project running locally.
I can try to make it work next week. Or look into test cases without running anything, but that would be mildly horrible.

tomglk added 2 commits September 8, 2023 20:51
Merge branch 'master' of github.com:SeldonIO/alibi-detect
@tomglk
Copy link
Contributor Author

tomglk commented Sep 8, 2023

Hi @ascillitoe,

I added some tests for load_detector(). Also demonstrating that the problem occurs and can be prevented with the new parameter.

What do you think? Is there anything else that is missing or could the code be merged like this?

@jklaise
Copy link
Member

jklaise commented Oct 4, 2023

@tomglk apologies for the late reply. We were initially hesitant about this change as it essentially means implicitly supporting data types in numpy arrays that we have been cautious to support. We would like to at some point rethink the data formats that we officially support (with examples, docs and everything), but that's currently not a priority. Since this unblocks your use case we're happy to merge it (will likely have a patch release out within a couple of weeks). Thanks for your contribution!

@jklaise jklaise self-requested a review October 4, 2023 13:33
@jklaise jklaise merged commit d188e02 into SeldonIO:master Oct 4, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants