# Serialization: More than Pickling
PyCon US 2022  
Joe Lucas

Notebooks available: https://github.com/JosephTLucas/pycon22

API: https://ebird.org/home

Plug: https://operationcode.org

## What is "serialization"?

**to serialize** (_verb_): to translate a data structure into something that can be _stored_, _transmitted_, or _reconstructed_ later.

Not storing, transmitting, or reconstructing. Just converting the data into a format conducive to those actions.

### Types

Plaintext vs. **Binary**

### Why?

You've spent hours training a machine learning model. How do you save it and use it later?

You've built an in-memory object with costly, time-dependent queries (e.g. a snapshot). How do you share it with colleagues?

In [None]:
from utils import EBIRD_KEY

In [None]:
import requests

class Bird_Counter:
    def __init__(self, count = 0):
        self.count = count
    
    def get_birds(self):
        url = "https://api.ebird.org/v2/data/obs/US-UT/recent"
        payload={}
        headers = {'X-eBirdApiToken': EBIRD_KEY}
        response = requests.request("GET", url, headers=headers, data=payload)
        self.count = len(response.text)

In [None]:
%%time
b = Bird_Counter()
b.get_birds()
print(f"There were {b.count} birds.")

## [Pickle](https://docs.python.org/3/library/pickle.html)

In [None]:
import pickle

with open("bird.pkl", "wb") as f:
    pickle.dump(b, f)

In [None]:
with open("bird.pkl", "rb") as f:
    c = pickle.load(f)

print(f"There were {c.count} birds.")

**Let's share our pickle with a friend.**

### Pros

1. Standard Library
2. We didn't have to define a schema
3. Well documented

    a. "Consider signing data with `hmac` if you need to ensure that it has not been tampered with."
    
    b. [Comparison with json](https://docs.python.org/3/library/pickle.html#comparison-with-json)
    
    c. [Format](https://docs.python.org/3/library/pickle.html#data-stream-format)
    
    d. [What can be (un)pickled?](https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled)

### Cons

1. Security Considerations
2. Only interoperable with Python
3. `load` still requires access to the class definition

### For those of you with pandas laying around...

In [None]:
import pandas as pd

pd.to_pickle(b, "pandas.pkl")

### How is it used in big projects?

https://github.com/scikit-learn/scikit-learn/search?l=Python&p=2&q=pickle

https://github.com/tensorflow/tensorflow/search?p=2&q=pickle

https://github.com/numpy/numpy/search?p=1&q=pickle

## Bad Friends

In [None]:
import os

class Bird_Counter:
    def __reduce__(self):
        cmd = ('echo this is bad')
        return os.system, (cmd,)
r = Bird_Counter()

In [None]:
with open("bad.pkl", "wb") as f:
    pickle.dump(r, f)

## Things that use Pickle

[Shelve](https://docs.python.org/3/library/shelve.html)

[Joblib](https://joblib.readthedocs.io/en/latest/)

"In the specific case of scikit-learn, it may be better to use joblib’s replacement of pickle (dump & load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators"

-- https://scikit-learn.org/stable/modules/model_persistence.html#python-specific-serialization

The more we know about our data structure, the more we can optimize our serialization. If you're serializing an object from a library, check to see if there's an established method.

Ex: [NetworkX](https://networkx.org/documentation/stable/reference/readwrite/index.html)

## Alternatives

## [Dill](https://pypi.org/project/dill/)

"dill provides the user the **same interface as the pickle module**, and also includes some additional features. In addition to pickling python objects, dill provides the ability to **save the state of an interpreter session** in a single command. Hence, it would be feasable to **save an interpreter session, close the interpreter, ship the pickled file to another computer, open a new interpreter, unpickle the session and thus continue from the ‘saved’ state** of the original interpreter session."

"dill can be used to store python objects to a file, but the **primary usage is to send python objects across the network as a byte stream**. dill is quite flexible, and allows arbitrary user defined classes and functions to be serialized. Thus dill is **not intended to be secure against erroneously or maliciously constructed data**."

In [None]:
import dill

In [None]:
import requests

class Dog_Counter:
    def __init__(self, count = 0):
        self.count = count
    
    def get_dogs(self):
        url = "https://api.ebird.org/v2/data/obs/US-UT/recent"
        payload={}
        headers = {'X-eBirdApiToken': EBIRD_KEY}
        response = requests.request("GET", url, headers=headers, data=payload)
        self.count = len(response.text)

In [None]:
d = Dog_Counter()
d.get_dogs()
print(f"There were {d.count} dogs.")

In [None]:
with open("dog.dill", "wb") as f:
    dill.dump(d, f)

## [msgpack](https://github.com/msgpack/msgpack-python)

In [None]:
import msgpack

with open("msgpack_bird.bin", "wb") as f:
    msgpack.dump(b.__dict__, f)

## Disengage Autopilot

Everything after this point will require us to start thinking about schema.

## [Marshmallow](https://github.com/marshmallow-code/marshmallow)

In [None]:
from marshmallow import Schema, fields

CountSchema = Schema.from_dict(
    {"count": fields.Int()}
)
schema = CountSchema()

In [None]:
result = schema.dumps(b)
print(result)

In [None]:
import json

with open('marshmallow_bird.json', 'w') as f:
    json.dump(result, f)

## Others

[Avro](https://avro.apache.org/docs/current/gettingstartedpython.html), [Protobuf](https://developers.google.com/protocol-buffers/docs/pythontutorial)

## References

https://github.com/trailofbits/fickling
https://davidhamann.de/2020/04/05/exploiting-python-pickle/