# Data Persistence

Persistence is "the continuance of an effect after its cause is removed". In the context of storing data in a computer system, this means that the data survives after the process with which it was created has ended. In other words, for a data store to be considered persistent, it must write to non-volatile storage.

In short, storing any data generated by python script to secondary memory is called data persistence.

Persisted data can be used again while no persistence data is lost after system shutdown as it is stored on volatile memory, i.e RAM or ROM.

To accomplish data persistence we use a process called serialization.

**Serialization**
It is a process to convert a data structure into ad linear form that can be stored or transmitted over a network.

In python serialization allows us to take complex object structure and transform it into a stream of bytes that can be save to a disk or sent over a network.

Serialization is sometimes also called marshalling.

Now to convert these serialized data back to their original data structure we use process called deserialization or unmarshalling.

Python has three different module for this purpose:
- `pickle`
- `marshal`
- `json`

`marshall` is oldest module and it exists mainly to read and write the compiled bytecode of python modules, or the `.pyc` files you get when the interpreter imports a python module. So even it can serialize objects it is not recommended.

`json` is newest one convenient and widely accepted for data exchange and it serializes objects in human readable form and is light

`pickle` serializes objects in binary format thus not human readable, but its faster and works everywhere and allows to custom define objects.

**General guidelines for deciding which approach to use:**
- Don’t use the marshal module. It’s used mainly by the interpreter, and the official documentation warns that the Python maintainers may modify the format in backward-incompatible ways.

- The json module and XML are good choices if you need interoperability with different languages or a human-readable format.

- The Python pickle module is a better choice for all the remaining use cases. If you don’t need a human-readable format or a standard interoperable format, or if you need to serialize custom objects, then go with pickle.

## 1. Pickle

Pickle implements binary protocols for serializing and deserializing a python object structure.

This process of serialization and deserialization is called pickling and unpickling in case we are using `pickle` module.


**Data Stream format**
- Data format used by `pickle` is python specific.
  - Its advantage is that there are no restrictions imposed by external standards such as Json.
  - Its disadvantage is that non-python programs may not be able to reconstruct pickled python objects.
- By default `pickle` data format use a relatively compact binary representation. We can compress it if we want.
- The module `pickletools` contains tools for analyzing data streams generated by `pickle.pickletools` source code has extensive comments about opcodes used by pickle protocols.

Importing `pickle` module load three classes `Pickler`, `UnPickler`, `PickleBuffer`.

**Methods in Pickle**
- `pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)` Write the pickled representation of the object obj to the open file object file. This is equivalent to `Pickler(file, protocol).dump(obj)`.
- `pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)` Return the pickled representation of the object obj as a bytes object, instead of writing it to a file.
- `pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)` read the pickled representation of an object from the open file object and return the reconstituted object hierarchy specified therein. This is equivalent to `Unpickler(file).load()`
- `pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)` Return the reconstituted object hierarchy of the pickled representation data of an object. data must be a bytes-like object.
  

**Comparison with Marshal**
- `pickle` keeps tracks of the objects it has already serialized, so that later references to the same object won't be serialized again. `marshal` don not do it.
- `pickle` can serialize user-defined classes and instances, while `marshal` can't do it.
- `marshal` serialization is not guaranteed to be portable across all python versions as its primary job is to support `.pyc` file, while `pickle` provides this guarantee.

**Comparison with Json**
- `pickle` is binary serialization while `json` is text serialization in `utf-8` format thus human readable.
- `pickle` is widely used in python ecosystem only while `json` is interoperable and widely used outside of python ecosystem.
- `json` can only serialize a subset of the python built-in types only and no custom classes, while `pickle` can do that.
- `json` is far secure than `pickle` as deserializing it doesn't cause any arbitrary code execution.

**Note:** 
- `pickle` module is not secure. 
- Malicious data can be pickled which will execute arbitrary code during unpickling. 
- Even pickled data can also be tampered.
- Signing data with `hmac` will ensure that it has not been tampered.

In [2]:
import pickle

class Alpha:
    x = 10
    y = True
    z = [1, 2, 3]
    w = {'a': 1, 'b': 2}
    a = (1, 2)

obj_alpha = Alpha()

pickled_obj = pickle.dumps(obj_alpha)

In [7]:
print(pickled_obj)

b'\x80\x04\x95\x19\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x05Alpha\x94\x93\x94)\x81\x94.'


In [4]:
unpickled_obj = pickle.loads(pickled_obj)

In [6]:
print(unpickled_obj)

<__main__.Alpha object at 0x0000022B55A217C0>


In [8]:
# dumping an object to a file

model_pickle = pickle.dump(obj_alpha, open('model.sav', 'wb'))

**Note:**
- `dumps` and `loads` writes pickle representation to a file.
- `dump` and `load` returns a pickled representation instead of writing to a file.