<center>
    <img src="https://upload.wikimedia.org/wikipedia/commons/a/a8/%D0%9B%D0%9E%D0%93%D0%9E_%D0%A8%D0%90%D0%94.png" width=500px/>
    <font>Python 2023</font><br/>
    <br/>
    <br/>
    <b style="font-size: 2em">Сериализация и десериализация</b><br/>
    <br/>
    <font>Камиль Талипов</font><br/>
</center>

<div align="center"><b><font size=6>Зачем это все?</font></b></div>

<div align="center"><img src="https://blogdotxkcddotcom.files.wordpress.com/2019/08/sendafile_1.png?w=1191&h=334" width="400px"/></div>

1. Web API: JSON/RPC/...
2. Конфигурация приложения
3. Кеширование / Хранение в БД

...

Форматы:

1. Текстовые: JSON, YAML, XML, ...
2. Бинарные: Pickle, protobuf, FlatBuffers, ...

Еще: <a href="https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats">Comparison of data-serialization formats</a>

<div align="center"><b><font size=6>JSON</font></b></div>
<div align="center"><img src="https://www.json.org/img/json160.gif"/></div>

```json
{
    "orders": [
        {
            "id": "2345328",
            "date": "June 20, 2020 10:45:34",
            "trackId": "XGB2567TD",
            "customer": {
                "custId": "106156",
                "fname": "Max",
                "lname": "Hatfield",
                "city": "NY"
            }
        }
    ]
}
```

JSON - JavaScript Object Notation

Формальное описание: https://www.json.org/json-en.html

Еще: <a href="https://www.youtube.com/playlist?list=PLEzQf147-uEoNCeDlRrXv6ClsLDN-HtNm">Videos about JSON</a>

Библиотеки для работы с JSON:
1. json 
2. simplejson
3. simdjson
4. orjson
5. ujson

##### Offtopic

https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/

В модуле json 4 основных функции: 2 для работы с потоками и 2 для работы со строчками.

Поток:
1. dump
2. load

Строчка:
1. dumps
2. loads

In [1]:
import json

data = ['foo', {'bar': ('baz', None, 1.0, 2)}]
data_dump = json.dumps(data)
print(data_dump)

with open('result.json', 'w') as fout:
    json.dump(data, fout)
!cat result.json

["foo", {"bar": ["baz", null, 1.0, 2]}]
["foo", {"bar": ["baz", null, 1.0, 2]}]

In [2]:
data_parsed = json.loads(data_dump)
print(data_parsed)

with open('result.json') as fin:
    print(json.load(fin))

['foo', {'bar': ['baz', None, 1.0, 2]}]
['foo', {'bar': ['baz', None, 1.0, 2]}]


In [3]:
print(data == data_parsed)

False


In [4]:
print(data)
print(data_parsed)

['foo', {'bar': ('baz', None, 1.0, 2)}]
['foo', {'bar': ['baz', None, 1.0, 2]}]


| Python  | JSON  |
|:---|:---|
| dict  | Object |
| list  | Array  |
| tuple  | Array  |
| str  | String  |
| int  | Number (int)  |
| float  | Number (real)  |
| True  | true  |
| False  | false  |
| None  | null  |

Что насчет Decimal, complex, datetime, ...?

In [6]:
from decimal import Decimal
num = Decimal('0.1')
json.dumps(num)

TypeError: Object of type Decimal is not JSON serializable

In [None]:
def my_encode(obj):
    if isinstance(obj, Decimal):
        return str(obj)
    raise TypeError('Unknown object type {}'.format(type(obj)))
    
print(json.dumps(num, default=my_encode))
print(json.dumps(num, default=str))

In [None]:
class MyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Decimal):
            return str(obj)
        return json.JSONEncoder.default(self, obj)
    
    def encode(self, obj):
        res = json.JSONEncoder.encode(self, obj)
        if isinstance(obj, list):
            return 'formatted:{}'.format(res)
        return res

data = ['hello world', Decimal('1.23'), [1.1234, 2, 3]]

print(json.dumps(data, cls=MyEncoder))

Как загрузить Decimal, complex, datetime, ...?

In [None]:
class DecimalEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Decimal):
            return {'__Decimal__': str(obj)}
        return json.JSONEncoder.default(self)
    
def as_Decimal(dct):
    val = dct.get('__Decimal__')
    if val is not None:
        return Decimal(val)
    return dct

a = [Decimal('0.1'), Decimal('0.001')]
a_json = json.dumps(a, cls=DecimalEncoder)
print(a)
print(a_json)
b = json.loads(a_json, object_hook=as_Decimal)
print(b)

In [None]:
json_str = '[0.01, 0.001]'
a = json.loads(json_str, parse_float=Decimal)
print(a)

Q: Как сделать удобнее? <br>
А: Использовать другие модули для работы с json/дополнительные модули.

Полезные модули:
* dataclasses-json

In [None]:
import simplejson

a = [0.1, Decimal('0.001')]
a_json = simplejson.dumps(a)
print(a_json)

a_parsed = simplejson.loads(a_json)
print(a_parsed, type(a_parsed[1]))

a_parsed_dec = simplejson.loads(a_json, use_decimal=True)
print(a_parsed_dec, type(a_parsed_dec[0]))

**Ловушки JSON**

In [None]:
# Keys are always str

dct = {
    1: 'one',
    2: 'two',
    3: 'three',
}

dct_json = json.dumps(dct)
print('json_str:', dct_json)
print(json.loads(dct_json))

In [None]:
# Multiple dumps 

val1 = [1, 2, 3]
val2 = {'key': 'value'}

with open('bad.json', 'w') as fout:
    json.dump(val1, fout)
    fout.write('\n')
    json.dump(val2, fout)
    
!cat bad.json

In [None]:
with open('bad.json') as fin:
    json.load(fin)

In [None]:
with open('bad.json') as fin:
    for line in fin:
        print(json.loads(line))

In [None]:
# repr misuse

arr = [1, 2, 3]
print(repr(arr))
print(json.loads(repr(arr)))

arr2 = ["Hello world", "!"]
print(repr(arr2))
print(json.loads(repr(arr2)))

Отсутствие кросс-платформенности:

https://github.com/jqlang/jq/issues/1959

**Полезные аргументы dump/dumps**

`json.dump(obj, fp, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)`

`json.dumps(obj, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)`

In [None]:
# indent + sort_keys

data = [
    {
        'name': 'Max',
        'age': 20,
    },
    {
        'name': 'Alex',
        'age': 31,
    }
]

print(json.dumps(data))
print(json.dumps(data, indent=2, sort_keys=True))

In [None]:
# ensure_ascii - json.dump only
msg = 'Привет мир!'

with open('file_ascii.txt', 'w') as fout:
    json.dump(msg, fout)
!cat file_ascii.txt
!echo "\n"

with open('file_utf8.txt', 'w', encoding='utf8') as fout:
    json.dump(msg, fout, ensure_ascii=False)
!cat file_utf8.txt

In [None]:
with open('file_utf8.txt', encoding='utf8') as fin:
    print(json.load(fin))

Лучше задавать кодировку при `ensure_ascii=False`.

Иначе будет использована `locale.getpreferredencoding()`.

In [None]:
# allow_nan

num = float('inf')
print(json.dumps(num))
print(json.dumps(num, allow_nan=False))

**Полезные аргументы load/loads**

`json.load(fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)`

`json.loads(s, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)`

In [None]:
json.loads('{"foo": "bar"}', object_pairs_hook=print)
json.loads('{"foo": "bar"}', object_hook=print)

<div align="center"><b><font size=6>YAML</font></b></div>
<div align="center"><img src="https://upload.wikimedia.org/wikipedia/commons/9/92/Yaml_logo.png"/></div>

<div align="center"><img src="https://miro.medium.com/max/2000/1*2Rly7p5CqW8-sb3CneOOxQ.png" /></div>

YAML - Yet Another Markup Language<br>
YAML - YAML Ain't Markup Language

Официальный сайт: https://yaml.org <br>
Спецификация (ver. 1.1): https://yaml.org/spec/1.1/current.html <br>
Спецификация (ver. 1.2): https://yaml.org/spec/1.2.2/

Быстрая инструкция: https://learnxinyminutes.com/docs/yaml/

Еще: <a href="https://stackoverflow.com/questions/1726802/what-is-the-difference-between-yaml-and-json/1729545#1729545">What is the difference between YAML and JSON?</a>

### YAML - надмножество JSON (1.2)
<div align="center"><img src="https://imgs.xkcd.com/comics/standards.png" /></div>

#### Словарь

```yaml
one: 1
two: 2 # comment
0.125: float key
1: one
2: 'two'
"key with :": "value"
flag: false
null_value: null
```

#### Вложенный словарь

```yaml
nested_map_1:
  key: value
  nested_map_2:
      new_key: new_value
```

#### Последовательность
   
```yaml
- Item 1
- Item 2
-
  - nested item 1
  - nested item 2
- - new nested item 1
  - new nested item 2
- - - yet another item 1
    - yet another item 2
-
  - nested key 1: value
  - nested key 2: value2  
```

#### Последовательность внутри словаря
```yaml
outer_key:
  innter_key:
    - item 1
    - item 2
```

#### JSON в YAML
```yaml
json_map: {"key": "value"}
json_seq: [1, 2, 3, "hello"]
quotes are optional: {key: [1, 3, 3, hello]}
```

#### Множества

```yaml
set1:
  ? item1
  ? item2
  ? item3
  
set2: {item1, item2, item3}

set3:
  item1: null
  item2: null
  item3: null
```

Примеры yaml из курса:
* https://gitlab.manytask.org/python/public-2023-fall/-/blob/main/.gitlab-ci.yml

#### Даты
```yaml
datetime: 2020-08-10T10:30:42.3Z
datetime_with_space: 2020-08-10 10:30:42
date: 2020-08-10
```

#### Теги и типы

https://yaml.org/refcard.html

```yaml
explicit_string: !!str 1.23
py_complex: !!python/complex 3+2j
```

#### Бинарные данные (в base64 кодировке)

```yaml
generic: !binary |
 R0lGODlhDAAMAIQAAP//9/X17unp5WZmZgAAAOfn515eXvPz7Y6OjuDg4J+fn5
 OTk6enp56enmlpaWNjY6Ojo4SEhP/++f/++f/++f/++f/++f/++f/++f/++f/+
 +f/++f/++f/++f/++f/++SH+Dk1hZGUgd2l0aCBHSU1QACwAAAAADAAMAAAFLC
 AgjoEwnuNAFOhpEMTRiggcz4BNJHrv/zCFcLiwMWYNG84BwwEeECcgggoBADs=
```

Библиотеки для работы с YAML:
1. PyYaml
2. ruamel.yaml

Основные функции модуля PyYAML:
1. `load` / `safe_load` / `unsafe_load` / `full_load`
2. `dump` / `safe_dump`
3. `load_all` / `safe_load_all` / `unsafe_load_all` / `full_load_all`
4. `dump_all` / `safe_dump_all`

**Зачем так много?**
1. `def load(stream, Loader=None)`
2. `def dump(data, stream=None, Dumper=Dumper, **kwds)`

Проблемы с Arbitrary Code Execution:
1. https://github.com/yaml/pyyaml/pull/386
2. https://github.com/yaml/pyyaml/issues/420

<p style="color:red">
Если вы не доверяете источнику yaml-файла, то используете только safe функции
<p style="color:red">

In [None]:
import yaml

data = yaml.safe_load('''
key 1: 
  - Item 1
  - Item 2
key 2:
  inner key: 10.5
''')
print(data)

In [None]:
print(yaml.safe_dump(data))

In [None]:
with open('sample.yaml', 'w') as fout:
    yaml.safe_dump(data, fout)
    
!cat sample.yaml
!echo "\n"

with open('sample.yaml') as fin:
    print(yaml.safe_load(fin) == data)

In [None]:
arr1 = [1, 2, 3]
arr2 = [4, 5, 6]

with open('multiple.yaml', 'w') as fout:
    yaml.safe_dump(arr1, fout)
    yaml.safe_dump(arr2, fout)
    
!cat multiple.yaml
!echo "\n"

with open('multiple.yaml') as fin:
    print(yaml.safe_load(fin))

In [None]:
arr1 = [1, 2, 3]
arr2 = [4, 5, 6]

with open('multiple_fix.yaml', 'w') as fout:
    yaml.safe_dump(arr1, fout, explicit_start=True)
    yaml.safe_dump(arr2, fout, explicit_start=True)
    
!cat multiple_fix.yaml
!echo "\n"

with open('multiple_fix.yaml') as fin:
    for arr in yaml.safe_load_all(fin):
        print(arr)

In [None]:
# safe_dump vs safe_dump_all

arr = [1, 2, 3, 4]
print(yaml.safe_dump(arr, explicit_start=True))
print()
print(yaml.safe_dump_all(arr, explicit_start=True))

In [None]:
arr = [1, 2, 3, 4]
dump_str = yaml.safe_dump_all(arr, explicit_start=True)
for item in yaml.safe_load_all(dump_str):
    print(item, type(item))
    
print(yaml.safe_load(dump_str))

<div align="center"><b><font size=6>Python Pickle</font></b></div>

Модуль для сериализации-десериализации произвольных Python-объектов.

Основные особенности:
1. Бинарный формат
2. Поддерживает большинство Python объектов
3. Не безопасен (десериализация может привести к выполнению произвольного кода)

Формат pickle - 6 разных протоколов: <br>
0 - cтарый "человеко-читабельный" формат <br>
1 - старый бинарный формат <br>
2 - бинарный формат с Python 2.3 (используется по умолчанию с этой версии) <br>
3 - бинарный формат с Python 3.0 (используется по умолчанию с этой версии) <br>
4 - новый бинарный формат. Добавлен в Python 3.4. Используется по умолчанию с Python 3.8 <br>
5 - улучшенный новый бинарный формат. Добавлен в Python 3.8 <br>

Можно использовать любой протокол при желании. <br>
Однако, чем более высокая версия используется, тем более свежий должен быть Python.

In [None]:
import sys
import pickle

print(sys.version_info)
print(pickle.HIGHEST_PROTOCOL)
print(pickle.DEFAULT_PROTOCOL)

В модуле pickle 4 основных функции: 2 для работы с потоками и 2 для работы со строчками.

Поток:
1. dump
2. load

Строчка:
1. dumps
2. loads

In [None]:
data = {
    'one': 1,
    'two': 2,
    'three': 3,
}

for protocol_version in range(6):
    data_dump = pickle.dumps(data, protocol=protocol_version)
    print(protocol_version, ':', data_dump)
    print(data == pickle.loads(data_dump))

In [None]:
with open('simple.pkl', 'wb') as fout:
    pickle.dump(data, fout)

with open('simple.pkl', 'rb') as fin:
    print(pickle.load(fin))

In [None]:
# Missing class example 

from dataclasses import dataclass
import pickle
from decimal import Decimal

@dataclass
class Item:
    name: str
    price: Decimal = '0.0'
    quantity: int = 0
        
    @property
    def total_cost(self):
        return self.price * self.quantity
    

item = Item('book', price=Decimal('1.23'), quantity=2)
print(item.total_cost)

item_pickled = pickle.dumps(item, protocol=0)
print(item_pickled)

@dataclass
class Item:
    name: str
    price: Decimal = '0.0'

print(pickle.loads(item_pickled))

In [None]:
# lambda example

my_lambda = lambda x: x ** 2
print(my_lambda(2))

pickle.dumps(my_lambda)

In [None]:
def func(x):
    return x + 2

pickle.dumps(func, protocol=0)

Pickle сохраняет **только** аттрибуты объекта. <br>
Для классов и функций сохраняется **только** идентификаторы, которые потом позволят "восстановить" объект.

**Полезные аргументы** <br>
`pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)` <br>
`pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)`

In [None]:
# Custom pickle logic
from urllib3 import PoolManager


class HTTPAdapter:
    __attrs__ = ('pool_connections', 'pool_timeout')
    
    def __init__(self, pool_connections, pool_timeout):
        self.pool_connections = pool_connections
        self.pool_timeout = pool_timeout
        self.pool_manager = None  # Set in _init_pool_manager
        self._init_pool_manager()
    
    def __getstate__(self):
        print('In __getstate__')
        return {attr: getattr(self, attr, None) for attr in self.__attrs__}
    
    def __setstate__(self, state):
        print('In __setstate__')
        for attr, value in state.items():
            setattr(self, attr, value)
        self._init_pool_manager()  # Reinit pool after deserialization
    
    def _init_pool_manager(self):
        self.pool_manager = PoolManager(self.pool_connections, timeout=self.pool_timeout)
        
        
adapter = HTTPAdapter(42, 32)
adapter_dump = pickle.dumps(adapter, protocol=0)
print(adapter_dump)
print('======Break======')
pickle.loads(adapter_dump)

<div align="center"><b><font size=6>Apache Parquet</font></b></div>
<div align="center"><img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Apache_Parquet_logo.svg"/></div>

* Write once, read many
* Disk-efficient, column based format
* Like CSV, but more efficient
  

#### Как записать табличные данные

1. Row-based: sequentially store rows (CSV).
2. Column-based: sequentially store columns (ORC).
3. Hybrid-base: sequentially store chunks of columns (Parquet).

<div align="center"><img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*QEQJjtnDb3JQ2xqhzARZZw.png"/></div>

#### Как устроен файл Parquet

https://github.com/apache/parquet-format/blob/master/README.md#glossary

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

<div align="center"><img src="https://parquet.apache.org/images/FileLayout.gif"/></div>

<div align="center"><img src="https://image.slidesharecdn.com/thecolumnarroadmap-180625170240/75/the-columnar-roadmap-apache-parquet-and-apache-arrow-14-2048.jpg?cb=1667639755"/></div>

#### Уровни параллелизации
* Processing (MapReduce, Spark, ...) - File/Row Group
* IO - Column chunk
* Encoding/Compression - Page

#### Ускорение фильтрации - cтатистики

https://github.com/apache/parquet-format/blob/master/README.md#sort-order

https://github.com/apache/parquet-format/blob/master/PageIndex.md

<div align="center"><img src="https://image.slidesharecdn.com/thecolumnarroadmap-180625170240/75/the-columnar-roadmap-apache-parquet-and-apache-arrow-33-2048.jpg?cb=1667639755"/></div>

<div align="center"><img src="https://image.slidesharecdn.com/thecolumnarroadmap-180625170240/75/the-columnar-roadmap-apache-parquet-and-apache-arrow-34-2048.jpg?cb=1667639755"/></div>

#### Code demo

https://arrow.apache.org/docs/python/index.html

https://arrow.apache.org/docs/python/parquet.html

https://github.com/chhantyal/parquet-cli

In [None]:
import pyarrow as pa
import pyarrow.parquet as pq

from datetime import datetime

In [None]:
events_schema = pa.schema([
    ('id', pa.int64()),
    ('timestamp', pa.timestamp('ns')),
    ('event_name', pa.string())
])


ids = pa.array(
    [0, 10, 42],
    type=pa.int64()
)
timestamps = pa.array(
    [
        datetime(2023, 11, 14, 18, 0, 0),
        datetime(2023, 11, 14, 18, 0, 1),
        datetime(2023, 11, 14, 18, 0, 2),
    ],
    type=pa.timestamp('ns')
)
event_names = pa.array(
    ['info_event', 'warn_event', 'error_event'],
    type=pa.string()
)

batch = pa.RecordBatch.from_arrays(
    [ids, timestamps, event_names],
    schema=events_schema
)

table1 = pa.Table.from_batches([batch])
table2 = pa.Table.from_arrays(
    [ids, timestamps, event_names],
    schema=events_schema
)
    

In [None]:
table1.equals(table2)

In [None]:
pq.write_table(table1, 'pa_table.parquet', row_group_size=2)

In [None]:
import pandas as pd

df = pd.DataFrame(data={
    'id': [0, 10, 42],
    'timestamp': [
        pd.Timestamp(2023, 11, 14, 18, 0, 0),
        pd.Timestamp(2023, 11, 14, 18, 0, 1),
        pd.Timestamp(2023, 11, 14, 18, 0, 2),
    ],
    'event_name': ['info_event', 'warn_event', 'error_event'],
})

In [None]:
df.to_parquet('pd_dataframe.parquet', engine='pyarrow', row_group_size=2)

In [None]:
table_from_pandas = pq.read_table('pd_dataframe.parquet')
df_from_arrow = pd.read_parquet('pa_table.parquet', engine='pyarrow')

table_from_pandas.to_pandas().info()
print()
df_from_arrow.info()

In [None]:
table_from_pandas.to_pandas()

In [None]:
df_from_arrow

In [None]:
df_from_arrow.equals(table_from_pandas.to_pandas())

ParquetFile

In [None]:
parquet_file = pq.ParquetFile('pa_table.parquet')

In [None]:
parquet_file.metadata

In [None]:
parquet_file.metadata.row_group(1)

In [None]:
parquet_file.metadata.row_group(0).column(0)

In [None]:
parquet_file.metadata.row_group(0).column(0).statistics

In [None]:
parquet_file.read()

In [None]:
parquet_file.read_row_group(0)

#### Parquet configuration

https://parquet.apache.org/docs/file-format/configurations/

#### Parquet: what's next
1. Video: <a href="https://www.youtube.com/watch?v=dPb2ZXnt2_U">The columnar roadmap: Apache Parquet and Apache Arrow</a>
1. In-memory: <a href="https://arrow.apache.org">Apache Arrow</a>
1. Datalake: <a href="https://iceberg.apache.org">Apache Iceberg</a>,  <a href="https://hudi.apache.org">Apache Hudi</a>, <a href="https://delta.io">Delta Lake</a>

## 