# Data Persistence

Persistence is "the continuance of an effect after its cause is removed". In the context of storing data in a computer system, this means that the data survives after the process with which it was created has ended. In other words, for a data store to be considered persistent, it must write to non-volatile storage.

In short, storing any data generated by python script to secondary memory is called data persistence.

Persisted data can be used again while no persistence data is lost after system shutdown as it is stored on volatile memory, i.e RAM or ROM.

To accomplish data persistence we use a process called serialization.

**Serialization**
It is a process to convert a data structure into ad linear form that can be stored or transmitted over a network.

In python serialization allows us to take complex object structure and transform it into a stream of bytes that can be save to a disk or sent over a network.

Serialization is sometimes also called marshalling.

Now to convert these serialized data back to their original data structure we use process called deserialization or unmarshalling.

Python has three different module for this purpose:
- `pickle`
- `marshal`
- `json`
- `shelve`
- Storing to any database (built-in `sqlite3`)

`marshall` is oldest module and it exists mainly to read and write the compiled bytecode of python modules, or the `.pyc` files you get when the interpreter imports a python module. So even it can serialize objects it is not recommended.

`json` is newest one convenient and widely accepted for data exchange and it serializes objects in human readable form and is light

`pickle` serializes objects in binary format thus not human readable, but its faster and works everywhere and allows to custom define objects.

**General guidelines for deciding which approach to use:**
- Don’t use the marshal module. It’s used mainly by the interpreter, and the official documentation warns that the Python maintainers may modify the format in backward-incompatible ways.

- The json module and XML are good choices if you need interoperability with different languages or a human-readable format.

- The Python pickle module is a better choice for all the remaining use cases. If you don’t need a human-readable format or a standard interoperable format, or if you need to serialize custom objects, then go with pickle.

## 1. Pickle

Pickle implements binary protocols for serializing and deserializing a python object structure.

This process of serialization and deserialization is called pickling and unpickling in case we are using `pickle` module.


**Data Stream format**
- Data format used by `pickle` is python specific.
  - Its advantage is that there are no restrictions imposed by external standards such as Json.
  - Its disadvantage is that non-python programs may not be able to reconstruct pickled python objects.
- By default `pickle` data format use a relatively compact binary representation. We can compress it if we want.
- The module `pickletools` contains tools for analyzing data streams generated by `pickle.pickletools` source code has extensive comments about opcodes used by pickle protocols.

Importing `pickle` module load three classes `Pickler`, `UnPickler`, `PickleBuffer`.

**Methods in Pickle**
- `pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)` Write the pickled representation of the object obj to the open file object file. This is equivalent to `Pickler(file, protocol).dump(obj)`.
- `pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)` Return the pickled representation of the object obj as a bytes object, instead of writing it to a file.
- `pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)` read the pickled representation of an object from the open file object and return the reconstituted object hierarchy specified therein. This is equivalent to `Unpickler(file).load()`
- `pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)` Return the reconstituted object hierarchy of the pickled representation data of an object. data must be a bytes-like object.
  

**Comparison with Marshal**
- `pickle` keeps tracks of the objects it has already serialized, so that later references to the same object won't be serialized again. `marshal` don not do it.
- `pickle` can serialize user-defined classes and instances, while `marshal` can't do it.
- `marshal` serialization is not guaranteed to be portable across all python versions as its primary job is to support `.pyc` file, while `pickle` provides this guarantee.

**Comparison with Json**
- `pickle` is binary serialization while `json` is text serialization in `utf-8` format thus human readable.
- `pickle` is widely used in python ecosystem only while `json` is interoperable and widely used outside of python ecosystem.
- `json` can only serialize a subset of the python built-in types only and no custom classes, while `pickle` can do that.
- `json` is far secure than `pickle` as deserializing it doesn't cause any arbitrary code execution.

**Note:** 
- `pickle` module is not secure. 
- Malicious data can be pickled which will execute arbitrary code during unpickling. 
- Even pickled data can also be tampered.
- Signing data with `hmac` will ensure that it has not been tampered.

In [2]:
import pickle

class Alpha:
    x = 10
    y = True
    z = [1, 2, 3]
    w = {'a': 1, 'b': 2}
    a = (1, 2)

obj_alpha = Alpha()

pickled_obj = pickle.dumps(obj_alpha)

In [7]:
print(pickled_obj)

b'\x80\x04\x95\x19\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x05Alpha\x94\x93\x94)\x81\x94.'


In [4]:
unpickled_obj = pickle.loads(pickled_obj)

In [6]:
print(unpickled_obj)

<__main__.Alpha object at 0x0000022B55A217C0>


In [8]:
# dumping an object to a file

model_pickle = pickle.dump(obj_alpha, open('model.sav', 'wb'))

**Note:**
- `dumps` and `loads` writes pickle representation to a file.
- `dump` and `load` returns a pickled representation instead of writing to a file.

## 2. Marshal

This module contains serialize python objects in binary format which is specific to python and independent to machine architecture.
  
This is not general persistence module, it mainly exist to support reading and writing the  pseudo-compiled python modules of `.pyc` files. Therefore the python maintainers reserve the right to modify the marshal format in backward incompatible ways should the need arise.

All python objects ar not supported by `marshal`, generally onl objects whose values is independent of particular invocation of python can be serialized.

**Methods in Marshal**
- `marshal.dump(value, file[, version])` write the value on writeable binary file
- `marshal.dumps(value[, version])` return the bytes object that would be written to a file by `dump(value, file)`
- `marshal.load(file)` read one value from the open file nad return it
- `marshal.loads(bytes)` convert the bytes-like-objects to a value
- `marshal.version()` indicates the format that the module uses

**Note:** 
- This binary format changes with python versions, this is one of reasons why it is not suggested to use `marshal` for serialization.
- `marshal` is not secure against erroneous and maliciously constructed data.
- It is not recommended to use and it changes drastically from one version to another, thus officially its not documented well.

In [11]:
import marshal

In [12]:
marshal.version

4

## 3. JSON

JavaScript Notation is a light weight data interchange format inspired by JavaScript object literal syntax.


`json` module contains all classes and methods required to handle JSON data in python.

**Methods in JSON**
- `json.dump(obj, fp, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)` serialize python object as a JSON formatted stream
  - The json module always produces str objects, not bytes objects. 
  - If `ensure_ascii` is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped, if false these characters will be output as-is.
  - If `check_circular` is false (default: True), then the circular reference check for container types will be skipped and a circular reference will result in a RecursionError (or worse).
  - If `allow_nan` is false (default: True), then it will be a ValueError to serialize out of range float values (nan, inf, -inf) in strict compliance of the JSON specification, if  true their JavaScript equivalents (NaN, Infinity, -Infinity) will be used.
  - If `sort_keys` is true (default: False), then the output of dictionaries will be sorted by key.
  - If `indent` is a non-negative integer or string, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0, negative, or "" will only insert newlines. None (the default) selects the most compact representation. Using a positive integer indent indents that many spaces per level. If indent is a string (such as "\t"), that string is used to indent each level.
  
- `json.dumps(obj, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)` serialize python object as a JSON formatted string, the arguments have same meaning as above
  
- `json.load(fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)` deserialize a JSON stream
  - `object_hook` is an optional function that will be called with the result of any object literal decoded (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders (e.g. JSON-RPC class hinting).
  - `object_pairs_hook` is an optional function that will be called with the result of any object literal decoded with an ordered list of pairs. The return value of object_pairs_hook will be used instead of the dict. This feature can be used to implement custom decoders. If object_hook is also defined, the object_pairs_hook takes priority.
  - `parse_float`, if specified, will be called with the string of every JSON float to be decoded. By default, this is equivalent to float(num_str). This can be used to use another datatype or parser for JSON floats (e.g. decimal.Decimal).
  - `parse_int`, if specified, will be called with the string of every JSON int to be decoded. By default, this is equivalent to int(num_str). This can be used to use another datatype or parser for JSON integers (e.g. float).
  - `parse_constant`, if specified, will be called with one of the following strings: '-Infinity', 'Infinity', 'NaN'. This can be used to raise an exception if invalid JSON numbers are encountered.
  
- `json.loads(s, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)` deserialize a JSON string, the arguments have same meaning as above

- `json.JSONDecoder.decode(s)` return the python representation of `s` (a JSON string)

- `json.JSONDecoder.raw_decode(s)` decode a JSON document from `s`(a JSON string) and return a 2-tuple of the python representation and the index in `s` where the document ended

- `json.JSONEncoder.default(o)` this method if implemented in subclass such that it returns a serializable object `o`

- `json.JSOEncoder.encode(o)` return a JSON string representation of a python data structure

- `json.JSONEncoder.iterencode(o)` encode the given object `o` and yield each string representation as available


**Classes in json**
- `class json.JSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)` Extensible JSON encoder for python data structures
  - Supports following objects and types by default:
    |JSON|PYTHON|
    |--------|---------|
    |object|dict|
    |array|list, tuple|
    |string|str|
    |number|int, float|
    |true|True|
    |false|False|
    |null|None|

- `class json.JSONDecoder(*, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, strict=True, object_pairs_hook=None)` JSON decoder, supports the objects and types as mentioned in above table
  

**Note:**
- JSON data from untrusted source can contain malicious string that may lead decoder to consume considerable CPU and memory resources.

In [16]:
import json

In [22]:
json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])

'["foo", {"bar": ["baz", null, 1.0, 2]}]'

In [23]:
from io import StringIO
io = StringIO()
json.dump(['streaming API'], io)
io.getvalue()

'["streaming API"]'

In [24]:
json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]')

['foo', {'bar': ['baz', None, 1.0, 2]}]

In [25]:
from io import StringIO
io = StringIO('["streaming API"]')
json.load(io)

['streaming API']

In [26]:
# Extending JSON Encoder
class ComplexEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, complex):
            return [obj.real, obj.imag]
        # Let the base class default method raise the TypeError
        return json.JSONEncoder.default(self, obj)

json.dumps(2 + 1j, cls=ComplexEncoder)
ComplexEncoder().encode(2 + 1j)
l = list(ComplexEncoder().iterencode(2 + 1j))
print(l)

['[2.0', ', 1.0', ']']


In [17]:
def default(self, o):
   try:
       iterable = iter(o)
   except TypeError:
       pass
   else:
       return list(iterable)
   # Let the base class default method raise the TypeError
   return json.JSONEncoder.default(self, o)

In [18]:
json.JSONEncoder().encode({"foo": ["bar", "baz"]})

'{"foo": ["bar", "baz"]}'

In [21]:
l= [2,4,6]
e = []
for chunk in json.JSONEncoder().iterencode(l):
    e.append(chunk)

print(e)

['[2', ', 4', ', 6', ']']


## 4. Shelve

A "shelf" is a persistent dictionary like object. Where the values of dictionary are python objects and keys are arbitrary strings.


`shelve` module contains all relevant code related to shelf.

**Methods in Shelve**
- `shelve.open(filename, flag='c', protocol=None, writeback=False)` open a persistent dictionary.
  - `filename` specified is the base filename for underlying database
  - It need to be closed after opening and can be closed and opened at same time using `open()` much like how file IO operations syntax is
- `shelf.sync()` write back all entries in teh cache if the shelf was opened with `writeback` set to True, this is called automatically when shelf closed with `clOse()`
- `shelf.close()` synchronize and close the persistent shelf object.

**Classes in Shelve**
- `class shelve.Shelf(dict, protocol=None, writeback=False, keyencoding='utf-8')` subclass of `collections.abc.MutableMapping` which stores pickled values in the dict object.    
- `class shelve.BsdDbShelf(dict, protocol=None, writeback=False, keyencoding='utf-8')` subclass of Shelf which exposes `first(), next(), previous(), last() and set_location()` which are available in the third-party bsddb module from pybsddb but not in other database modules. 
  
- `class shelve.DbfilenameShelf(filename, flag='c', protocol=None, writeback=False)` subclass of Shelf which accepts a filename instead of a dict-like object. The underlying file will be opened using dbm.open(). By default, the file will be created and opened for both read and write.

In [27]:
import shelve