# Intro to Pickle Library
The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.

In [1]:
%%writefile createpickle1.py
import pickle

obj = {"a": 1, "b": {"c": 2, "d": 3}}
with open("sampleobj.pkl", "wb") as f:
    pickle.dump(obj, f)

Writing createpickle1.py


In [2]:
%%writefile readpickle1.py
import pickle

with open("sampleobj.pkl", "rb") as f:
    obj = pickle.load(f)
print(obj)

Writing readpickle1.py


In [3]:
!python createpickle1.py

In [4]:
!python readpickle1.py

{'a': 1, 'b': {'c': 2, 'd': 3}}


## `__setstate__` and  `__getstate__`

Classes can further influence how their instances are pickled; if the class defines the method `__getstate__()`, it is called and the returned object is pickled as the contents for the instance, instead of the contents of the instance’s dictionary. If the `__getstate__()` method is absent, the instance’s `__dict__` is pickled as usual.

Upon unpickling, if the class defines `__setstate__()`, it is called with the unpickled state. In that case, there is no requirement for the state object to be a dictionary. Otherwise, the pickled state must be a dictionary and its items are assigned to the new instance’s dictionary.

In [5]:
%%writefile createpickle2.py
import pickle

class TestState:
    def __init__(self, val):
        self.val = val
    
    def __getstate__(self):
        print(f"Inside __getstate__")
        self.val *= 2
        return self.__dict__
    
    def __setstate__(self, state):
        print(f"Inside __setstate__. Unpickled with dict: {repr(state)}")
        self.__dict__ = state
        self.val *= 2
        
obj1 = TestState(2)
print(f"Before pickling value :{obj1.val}")
pickleobj = pickle.dumps(obj1)
obj2 = pickle.loads(pickleobj)
print(f"After unpickling value: {obj2.val}")

Writing createpickle2.py


In [6]:
!python createpickle2.py

Before pickling value :2
Inside __getstate__
Inside __setstate__. Unpickled with dict: {'val': 4}
After unpickling value: 8


pickle does not use directly the methods described above. In fact, these methods are part of the copy protocol which implements the `__reduce__()` special method. The copy protocol provides a unified interface for retrieving the data necessary for pickling and copying objects.

Although powerful, implementing `__reduce__()` directly in your classes is error prone. For this reason, class designers should use the high-level interface (i.e., `__getnewargs_ex__()`, `__getstate__()` and `__setstate__()`) whenever possible.

## Edgecase 1

In [7]:
%%writefile createpickle3.py
import pickle

class PickleFail:
    def __init__(self, attr):
        self.attr = attr
    
    def __getstate__(self):
        return self.attr
    
    def __setstate__(self, state):
        self.attr = state
        print(f"In set state, attr={self.attr}")

pf = PickleFail("a")

with open("file.pkl", "wb") as f:
    pickle.dump(pf, f)

print("Loading where PickleFail is already defined in the namespace")

with open("file.pkl", "rb") as f:
    obj = pickle.load(f)

Writing createpickle3.py


In [8]:
!python createpickle3.py

Loading where PickleFail is already defined in the namespace
In set state, attr=a


In [9]:
%%writefile readpickle3.py
import pickle

print("Loading where PickleFail is not in the namespace yet")
with open("file.pkl", "rb") as f:
    obj = pickle.load(f)

Writing readpickle3.py


In [10]:
!python readpickle3.py

Loading where PickleFail is not in the namespace yet
Traceback (most recent call last):
  File "/home/safiuddin/deep_deep_dive/python/readpickle3.py", line 5, in <module>
    obj = pickle.load(f)
AttributeError: Can't get attribute 'PickleFail' on <module '__main__' from '/home/safiuddin/deep_deep_dive/python/readpickle3.py'>


This happens because we're trying to access the class before it is loaded into the namespace.

## Don't load any pickle you find on the street

In [11]:
%%writefile createpickle4.py
import pickle

class Trojan:
    def __init__(self, a=10):
        self.a = a
    def __setstate__(self, state):
        print("Running malicious code here")

trojan = Trojan()

with open("trojan.pkl", "wb") as f:
    pickle.dump(trojan, f)

Writing createpickle4.py


In [12]:
!python createpickle4.py

In [13]:
%%writefile readpickle4.py
import pickle
from createpickle4 import Trojan
with open("trojan.pkl", "rb") as f:
    pickle.load(f)

Writing readpickle4.py


In [14]:
!python readpickle4.py

Running malicious code here


## Protocols

#### From official docs:

There are currently 6 different protocols which can be used for pickling. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.

- Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.

- Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.

- Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.

- Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This was the default protocol in Python 3.0–3.7.

- Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. It is the default protocol starting with Python 3.8. Refer to PEP 3154 for information about improvements brought by protocol 4.

- Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5

In [15]:
%%writefile createpickle5.py
import pickle

class Sample:
    def __init__(self, attr=10):
        self.attr = attr
    def __getstate__(self):
        return self.attr
    def __setstate__(self, state):
        self.attr = state
        print(f"Unpickled! attr={self.attr}")

sample = Sample(attr="sample")
if __name__ == "__main__":
    for protocol in range(6):
        print(pickle.dumps(sample, protocol=protocol))
pickleobject = pickle.dumps(sample, protocol=0)

Overwriting createpickle5.py


In [16]:
!python createpickle5.py

b'ccopy_reg\n_reconstructor\np0\n(c__main__\nSample\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\nVsample\np5\nb.'
b'ccopy_reg\n_reconstructor\nq\x00(c__main__\nSample\nq\x01c__builtin__\nobject\nq\x02Ntq\x03Rq\x04X\x06\x00\x00\x00sampleq\x05b.'
b'\x80\x02c__main__\nSample\nq\x00)\x81q\x01X\x06\x00\x00\x00sampleq\x02b.'
b'\x80\x03c__main__\nSample\nq\x00)\x81q\x01X\x06\x00\x00\x00sampleq\x02b.'
b'\x80\x04\x95$\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x06Sample\x94\x93\x94)\x81\x94\x8c\x06sample\x94b.'
b'\x80\x05\x95$\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x06Sample\x94\x93\x94)\x81\x94\x8c\x06sample\x94b.'


## Pickletools

#### `pickletools.dis()`
Outputs a symbolic disassembly of the pickle

In [17]:
%%writefile createpickle6.py
from createpickle5 import pickleobject
import pickle
import pickletools
pickletools.dis(pickleobject)

Overwriting createpickle6.py


In [18]:
!python createpickle6.py

    0: c    GLOBAL     'copy_reg _reconstructor'
   25: p    PUT        0
   28: (    MARK
   29: c        GLOBAL     'createpickle5 Sample'
   51: p        PUT        1
   54: c        GLOBAL     '__builtin__ object'
   74: p        PUT        2
   77: N        NONE
   78: t        TUPLE      (MARK at 28)
   79: p    PUT        3
   82: R    REDUCE
   83: p    PUT        4
   86: V    UNICODE    'sample'
   94: p    PUT        5
   97: b    BUILD
   98: .    STOP
highest protocol among opcodes = 0


#### `pickletools.optimize()`
Returns a new equivalent pickle string after eliminating unused PUT opcodes. The optimized pickle is shorter, takes less transmission time, requires less storage space, and unpickles more efficiently.

In [19]:
%%writefile createpickle7.py
from createpickle5 import pickleobject
import pickle
import pickletools

op_pickle = pickletools.optimize(pickleobject)

if __name__ == "__main__":
    print("Before optimization")
    pickletools.dis(pickleobject)
    print("After optimization")
    pickletools.dis(op_pickle)

Overwriting createpickle7.py


In [20]:
!python createpickle7.py

Before optimization
    0: c    GLOBAL     'copy_reg _reconstructor'
   25: p    PUT        0
   28: (    MARK
   29: c        GLOBAL     'createpickle5 Sample'
   51: p        PUT        1
   54: c        GLOBAL     '__builtin__ object'
   74: p        PUT        2
   77: N        NONE
   78: t        TUPLE      (MARK at 28)
   79: p    PUT        3
   82: R    REDUCE
   83: p    PUT        4
   86: V    UNICODE    'sample'
   94: p    PUT        5
   97: b    BUILD
   98: .    STOP
highest protocol among opcodes = 0
After optimization
    0: c    GLOBAL     'copy_reg _reconstructor'
   25: (    MARK
   26: c        GLOBAL     'createpickle5 Sample'
   48: c        GLOBAL     '__builtin__ object'
   68: N        NONE
   69: t        TUPLE      (MARK at 25)
   70: R    REDUCE
   71: V    UNICODE    'sample'
   79: b    BUILD
   80: .    STOP
highest protocol among opcodes = 0


## Unpickling

During unpickling, these opcodes are normally interpreted by something called the Pickle Machine (PM).

The PM reads in a pickle program and performs each instruction in sequence. It terminates whenever it reaches a STOP opcode; whatever object is on top of the stack at that point is the final result of unpickling.

1. GLOBAL pushes either a class or a function on the stack given it’s module and name as arguments. Note that the disassembler message here is slightly misleading because copy_reg was renamed copyreg in Python 3.
2. MARK pushes a special markobject onto the stack so that we can later use it to specify a slice of the stack. 
3. NONE just pushes None to the stack.
4. TUPLE is going to remove everything in the stack since that “MARK” and place it in a tuple. It will then remove the “MARK”, and replace it with the tuple.
5. REDUCE removes the last two things from the stack. It then calls the object that was second to last using positional expansion of the thing that was last and places the result onto the stack.
6. UNICODE just pushes a unicode string onto the stack
7. BUILD pops the last thing off of the stack and then passes it as an argument to __setstate__() on the newly last thing on the stack.
8. STOP just means that whatever is on top of the stack is our final result.

In [21]:
%%writefile unpickle8.py

# PM's longterm memory storage
memo = {}
# PM's stack, which most opcodes interact with
stack = []

# Push a global object (module.attr) on the stack.
#  0: c    GLOBAL     'copy_reg _reconstructor'
from copyreg import _reconstructor
stack.append(_reconstructor)

# Push markobject onto the stack.
# 25: (    MARK
stack.append('MARK')

# Push a global object (module.attr) on the stack.
# 26: c        GLOBAL     'createpickle5 Sample'
from createpickle5 import Sample
stack.append(Sample)

# Push a global object (module.attr) on the stack.
# 48: c        GLOBAL     '__builtin__ object'
stack.append(object)

# Push None on the stack.
# 68: N        NONE
stack.append(None)

print(f"the stack before the TUPLE operation: {stack}")

# Build a tuple out of the topmost stack slice, after markobject.
# 69: t        TUPLE      (MARK at 25)
last_mark_index = len(stack) - 1 - stack[::-1].index('MARK')
mark_tuple = tuple(stack[last_mark_index + 1:])
stack = stack[:last_mark_index] + [mark_tuple]

print(f"the stack after the TUPLE operation: {stack}")

# Push an object built from a callable and an argument tuple.
# 70: R    REDUCE
args = stack.pop()
callable_ = stack.pop()
stack.append(callable_(*args))

# Push a Python Unicode string object.
# 71: V    UNICODE    'sample'
stack.append(u'sample')

# Finish building an object, via __setstate__ or dict update.
# 79: b    BUILD
arg = stack.pop()
stack[-1].__setstate__(arg)

# Stop the unpickling machine.
# 80: .    STOP
unpickled_obj = stack[-1]
print("Unpickling completed")

if __name__ == "__main__":
    print(unpickled_obj)
    print(vars(unpickled_obj))

Overwriting unpickle8.py


In [22]:
!python unpickle8.py

the stack before the TUPLE operation: [<function _reconstructor at 0x7fbefaf58430>, 'MARK', <class 'createpickle5.Sample'>, <class 'object'>, None]
the stack after the TUPLE operation: [<function _reconstructor at 0x7fbefaf58430>, (<class 'createpickle5.Sample'>, <class 'object'>, None)]
Unpickled! attr=sample
Unpickling completed
<createpickle5.Sample object at 0x7fbefaf5bfd0>
{'attr': 'sample'}


We can further simplify our unpickling code.

In [23]:
%%writefile unpickle9.py

from createpickle5 import Sample
from copyreg import _reconstructor

unpickled_obj = _reconstructor(cls=Sample, base=object, state=None)
unpickled_obj.__setstate__("sample")

if __name__ == "__main__":
    print(unpickled_obj)
    print(vars(unpickled_obj))

Overwriting unpickle9.py


In [24]:
!python unpickle9.py

Unpickled! attr=sample
<createpickle5.Sample object at 0x7f9ffac26e20>
{'attr': 'sample'}


Simplifying further...

In [25]:
%%writefile unpickle10.py

from createpickle5 import Sample

unpickled_obj = object.__new__(Sample)
unpickled_obj.__setstate__("sample")

if __name__ == "__main__":
    print(unpickled_obj)
    print(vars(unpickled_obj))

Overwriting unpickle10.py


In [26]:
!python unpickle10.py

Unpickled! attr=sample
<createpickle5.Sample object at 0x7fd380b89a90>
{'attr': 'sample'}


## Manual pickling

In [27]:
%%writefile pickleobj.pkl
# add a function to the stack to execute arbitrary python
GLOBAL    '__builtin__ eval'
# mark the start of args tuple
MARK
    # add the python code that we want to execute
    UNICODE    'print(Running malicious code)'
    # wrap that code into a tuple so it can be parsed by REDUCE
    TUPLE
# call `eval()` with our python code as argument
REDUCE
STOP

Overwriting pickleobj.pkl


Now to convert this into an actual pickle, we need to replace each opcode with its corresponding ASCII code: `c` for `GLOBAL`, `(` for `MARK`, `V` for `UNICODE`, `t` for `TUPLE`, `R` for `REDUCE`, and `.` for `STOP`.

In [28]:
%%writefile pickleobj.pkl
c__builtin__
eval
(Vprint("Running malicious code")
tR.

Overwriting pickleobj.pkl


In [29]:
%%writefile readpickle11.py
import pickle
with open("pickleobj.pkl", "rb") as f:
    pickle.load(f)

Overwriting readpickle11.py


In [30]:
!python readpickle11.py 

Running malicious code


In [31]:
%%writefile readpickle12.py
import pickle
pickle.loads(b'c__builtin__\neval\n(Vprint("Running malicious code")\ntR.')

Overwriting readpickle12.py


In [32]:
!python readpickle12.py

Running malicious code


## Restricting Globals

By default, unpickling will import any class or function that it finds in the pickle data. 
For this reason, you may want to control what gets unpickled by customizing `Unpickler.find_class()`. `Unpickler.find_class()` is called whenever a global (i.e., a class or a function) is requested. Thus it is possible to either completely forbid globals or restrict them to a safe subset.

In [33]:
%%writefile createpickle13.py
import builtins
import io
import pickle
from pickle import UnpicklingError, Unpickler
class SecureUnpickler(Unpickler):
    unsafe_builtins = ["eval"]
    safe_modules = ["builtins", "os"]
    def find_class(self, module, name):
        exception_ = UnpicklingError(f"global {module}.{name} is forbidden")
        if module not in self.safe_modules:
            raise exception_
        if module == "builtins" and name in self.unsafe_builtins:
            raise exception_
        if module == "os" and name == "system":
            raise exception_
        return getattr(module, name)

def secure_loads(s):
    """Helper function analogous to pickle.loads()."""
    return SecureUnpickler(io.BytesIO(s)).load()

if __name__ == "__main__":
    try:
        secure_loads(pickle.dumps(set([1, 2])))
        print("successfully unpickled a set")
    except UnpicklingError as e:
        print(e)
    try:
        secure_loads(b"cos\nsystem\n(S'echo Malicious code'\ntR.")
        print("successfully unpickled os.system command")
    except UnpicklingError as e:
        print(e)
    try:
        import pandas as pd
        secure_loads(pickle.dumps(pd.DataFrame([1, 2])))
        print("scucessfult unpickled pandas dataframe")
    except UnpicklingError as e:
        print(e)

Overwriting createpickle13.py


In [34]:
!python createpickle13.py

successfully unpickled a set
global os.system is forbidden
global pandas.core.frame.DataFrame is forbidden


## Clean-up

In [35]:
!rm *pickle*.py
!rm *.pkl

## Further Reading:
- `object.__getnewargs_ex__()`
- `object.__getnewargs__()`
- `object.__reduce__()`

## References:
- https://docs.python.org/3/library/pickle.html
- https://intoli.com/blog/dangerous-pickles/