# Serialization: More than Pickling  
Alternatively: Serialization, an unopinianated introduction  
PyCon US 2022  
Joe Lucas

Notebooks available: https://github.com/JosephTLucas/pycon22

API: https://ebird.org/home

Plug: https://operationcode.org

## What is "serialization"?

**to serialize** (_verb_): to translate a data structure into something that can be _stored_, _transmitted_, or _reconstructed_ later.

Not storing, transmitting, or reconstructing. Just converting the data into a format conducive to those actions.

### Types

Plaintext vs. **Binary**

### Why?

You've spent hours training a machine learning model. How do you save it and use it later?

You've built an in-memory object with costly, time-dependent queries (e.g. a snapshot). How do you share it with colleagues?

In [None]:
from utils import EBIRD_KEY

In [None]:
import requests
import json


class Bird_Counter:
    def __init__(self, variety=[], locations=[]):
        self.variety = variety
        self.locations = locations

    def get_birds(self):
        url = "https://api.ebird.org/v2/data/obs/US-UT/recent"
        payload = {}
        headers = {"X-eBirdApiToken": EBIRD_KEY}
        response = requests.request("GET", url, headers=headers, data=payload)
        self.variety = [x["comName"] for x in json.loads(response.text)]
        self.locations = [(x["lat"], x["lng"]) for x in json.loads(response.text)]

## API Orientation

In [None]:
url = "https://api.ebird.org/v2/data/obs/US-UT/recent"
payload = {}
headers = {"X-eBirdApiToken": EBIRD_KEY}
response = requests.request("GET", url, headers=headers, data=payload)
result = json.loads(response.text)
result[0]

In [None]:
%%time
b = Bird_Counter()
b.get_birds()
print(f"There were {len(b.variety)} birds.")
print(f"One of the birds was a {b.variety[0]}.")
print(f"One of the locations was {b.locations[0]}.")

## [Pickle](https://docs.python.org/3/library/pickle.html)

In [None]:
import pickle

with open("bird.pkl", "wb") as f:
    pickle.dump(b, f)

In [None]:
with open("bird.pkl", "rb") as f:
    c = pickle.load(f)

print(f"There were {len(c.variety)} birds.")
print(f"One of the birds was a {c.variety[0]}.")
print(f"One of the locations was {c.locations[0]}.")

In [None]:
b == c

**Let's share our pickle with a friend.**

### Pros

1. Standard Library
2. We didn't have to define a schema
3. Well documented

    a. "Consider signing data with `hmac` if you need to ensure that it has not been tampered with."
    
    b. [Comparison with json](https://docs.python.org/3/library/pickle.html#comparison-with-json)
    
    c. [Format](https://docs.python.org/3/library/pickle.html#data-stream-format)
    
    d. [What can be (un)pickled?](https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled)

### Cons

1. Security considerations
2. Only interoperable with Python
3. `load` still requires access to the class definition

### For those of you with pandas laying around...

In [None]:
import pandas as pd

pd.to_pickle(b, "pandas.pkl")

## Things that use Pickle

[Shelve](https://docs.python.org/3/library/shelve.html)

For example, we could create a new `Bird_Counter()` every day, pickle it, and use the date as a key (keys must be strings).

[Joblib](https://joblib.readthedocs.io/en/latest/)

"In the specific case of scikit-learn, it may be better to use joblib’s replacement of pickle (dump & load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators"

-- https://scikit-learn.org/stable/modules/model_persistence.html#python-specific-serialization

The more we know about our data structure, the more we can optimize our serialization. If you're serializing an object from a library, check to see if there's an established method.

## Related Projects

[Ice Pickle](https://github.com/koaning/icepickle): a safe way to serialize and deserialize linear scikit-learn models. (backed by hdf5)

[Fickling](https://github.com/trailofbits/fickling): decompiler, static analyzer, and bytecode rewriter for Python pickle object serializations. Fickling can take pickled data streams and decompile them into human-readable Python code that, when executed, will deserialize to the original serialized object.

## Alternatives

## [Dill](https://pypi.org/project/dill/)

"dill provides the user the **same interface as the pickle module**, and also includes some additional features. In addition to pickling python objects, dill provides the ability to **save the state of an interpreter session** in a single command. Hence, it would be feasable to **save an interpreter session, close the interpreter, ship the pickled file to another computer, open a new interpreter, unpickle the session and thus continue from the ‘saved’ state** of the original interpreter session."

"dill can be used to store python objects to a file, but the **primary usage is to send python objects across the network as a byte stream**. dill is quite flexible, and allows arbitrary user defined classes and functions to be serialized. Thus dill is **not intended to be secure against erroneously or maliciously constructed data**."

In [None]:
import dill

In [None]:
import requests


class Dog_Counter:
    def __init__(self, variety=[], locations=[]):
        self.variety = variety
        self.locations = locations

    def get_dogs(self):
        url = "https://api.ebird.org/v2/data/obs/US-UT/recent"
        payload = {}
        headers = {"X-eBirdApiToken": EBIRD_KEY}
        response = requests.request("GET", url, headers=headers, data=payload)
        self.variety = [x["comName"] for x in json.loads(response.text)]
        self.locations = [(x["lat"], x["lng"]) for x in json.loads(response.text)]

In [None]:
d = Dog_Counter()
d.get_dogs()
print(f"There were {len(d.variety)} dogs.")

In [None]:
with open("dog.dill", "wb") as f:
    dill.dump(d, f)

## [msgpack](https://github.com/msgpack/msgpack-python)

"efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it's faster and smaller, efficient binary serialization format. It lets you exchange data among multiple languages like JSON."

In [None]:
import msgpack

with open("msgpack_bird.bin", "wb") as f:
    msgpack.dump(b.__dict__, f)

## Disengage Autopilot

Everything after this point will require us to start thinking about schema.

## [Marshmallow](https://github.com/marshmallow-code/marshmallow)

See also [Flask-Marshmallow](https://github.com/marshmallow-code/flask-marshmallow) and [Django REST Marshmallow](https://github.com/marshmallow-code/django-rest-marshmallow)

In [None]:
from marshmallow import Schema, fields

CountSchema = Schema.from_dict(
    {
        "variety": fields.List(fields.String),
        "locations": fields.List(fields.Tuple((fields.Float, fields.Float))),
    }
)
schema = CountSchema()

In [None]:
result = schema.dumps(b)

In [None]:
import json

with open("marshmallow_bird.json", "w") as f:
    json.dump(result, f)

## Others

[Avro](https://avro.apache.org/docs/current/gettingstartedpython.html), [Protobuf](https://developers.google.com/protocol-buffers/docs/pythontutorial)

## Bonus Serialization

How were we receiving data from the ebird API? `json`

### Pros
1. Human-readable
2. Available everywhere

### Cons
1. Attaches schema every time
2. Limited types

In [None]:
json_birds = dict()

json_birds["variety"] = [x["comName"] for x in json.loads(response.text)]
json_birds["locations"] = [(x["lat"], x["lng"]) for x in json.loads(response.text)]

with open("birds.json", "w") as f:
    json.dump(json_birds, f)

## Time and Space

### JSON

In [None]:
%%timeit
with open("bird.json", "w") as f:
    json.dump(json_birds, f)

In [None]:
import os

os.path.getsize("bird.json")

### Pickle

In [None]:
%%timeit
with open("bird.pkl", "wb") as f:
    pickle.dump(b, f)

In [None]:
os.path.getsize("bird.pkl")

### Dill

In [None]:
%%timeit
with open("bird.dill", "wb") as f:
    dill.dump(b, f)

In [None]:
os.path.getsize("bird.dill")

### msgpack

In [None]:
%%timeit
with open("msgpack_bird.bin", "wb") as f:
    msgpack.dump(b.__dict__, f)

In [None]:
os.path.getsize("msgpack_bird.bin")

### Marshmallow

In [None]:
result = schema.dumps(b)

In [None]:
%%timeit
with open("marshmallow_bird.json", "w") as f:
    json.dump(result, f)

In [None]:
os.path.getsize("marshmallow_bird.json")

## Complex Objects

In [None]:
def test(target):
    try:
        with open("test.json", "w") as f:
            json.dump(target, f)
        with open("test.json", "r") as f:
            json.load(f)
        print("JSON succeeds")
    except:
        print("JSON fails")
    try:
        with open("test.pkl", "wb") as f:
            pickle.dump(target, f)
        with open("test.pkl", "rb") as f:
            pickle.load(f)
        print("Pickle succeeds")
    except:
        print("Pickle fails")
    try:
        with open("test.dill", "wb") as f:
            dill.dump(target, f)
        with open("test.dill", "rb") as f:
            dill.load(f)
        print("Dill succeeds")
    except:
        print("Dill fails")

In [None]:
import numpy as np

test(np.random.rand(5, 5))

In [None]:
test(print)

In [None]:
test(lambda x: x * 2)

## Schema Versioning and Evolution

In [None]:
class FakeObject:
    def __init__(self, a=0, b=0):
        self.a = a
        self.b = b

In [None]:
version1 = FakeObject(1, 1)

In [None]:
with open("fake_v1.pkl", "wb") as f:
    pickle.dump(version1, f)

**New dev joins project**

Schemas can change over time. If you don't explicitly account for this versioning or evolution, your code may behave in unexpected ways (or not at all).

This may not matter for hobby projects, but think about large client/server architectures where clients may be operating on different schema versions. This is the support that you get from Avro/Protobuf without building the logic yourself.

## Bad Pickles

In [None]:
import os


class Bird_Counter:
    def __reduce__(self):
        cmd = "echo this is bad"
        return os.system, (cmd,)


r = Bird_Counter()
# ref: https://davidhamann.de/2020/04/05/exploiting-python-pickle/

In [None]:
with open("bad.pkl", "wb") as f:
    pickle.dump(r, f)

## Conclusion

Your serialization format is a design decision that impacts speed, interoperability, and security. Choose the right tool for the right job.