# Storing JSON Documents in Db2
Updated: 2019-09-14

This notebook describes the ways that JSON can be stored inside a Db2 database.

## Storing JSON Documents in a Relational Database
The ISO standard does not specify how JSON documents should be stored in relational databases, leaving that decision to the database vendors. Since JSON can be easily stored in either its native character format or in the binary format (BSON) using existing database data types, most database products, including Db2, have chosen to not implement a native JSON data type.

One prerequisite for storing JSON in its character form is that must be encoded in Unicode with the default Db2 encoding being UTF-8.

Db2 provides several data type options for both JSON formats and the DBA can decide based on their individual needs. Some of the factors to consider when choosing a database data type are:

* whether the original source data is in JSON or BSON format
* whether the desired column data type is supported for the table type chosen
* whether the maximum data length is compatible with the data type
* whether query performance is a critical success factor

### Representing JSON as Character Strings
JSON is convenient for developers for a number of reasons including the fact that JSON is in a "readable" format since it is just a character string. The data is presented in a form that doesn't require any conversion or formatting of the data to make sense of it. This makes the process of interchanging the data without other applications and systems much simpler and convenient for developers.

The whitespace (blanks) found in JSON documents are ignored. This allows a developer to format the record in a way that makes it easy to read and understand its structure. Indentation and spacing are used for formatting and clarifying the structure of the record for similar reasons. There is no requirement in JSON itself for this extraneous formatting, it is strictly for the benefit of the human eye.

### Binary JSON (BSON)
There are drawbacks associated with representing JSON as human-readable strings. The storage required for any additional white-space and spacing can add up when dealing with millions of records. Searching for values in JSON records requires traversing throughout the document and parsing every value encountered. The overhead of hundreds of users searching millions of documents can quickly add up.

In order to improve the access time to individual fields and values within a JSON document, vendors have developed alternative storage formats for the data. One popular format is called BSON which stands for Binary (JSON) storage notation. There are code libraries available for most programming languages which will convert JSON into the more efficient BSON format.

While BSON has some slight space advantages over JSON (but not always), this format has a considerable advantage when it comes to searching within documents. The document is parsed into an internal format which allows for the efficient traversal of the fields and values. The overhead of converting a document to BSON can be quickly recovered when searching for fields within a large document. 

From a development perspective, JSON is the record structure that is being stored and manipulated whether it is stored in human-readable format or stored in a binary BSON form. From a processing perspective, BSON is the format typically used when query performance is critical.

### Character versus Binary JSON
Since JSON can be stored in character format or in binary format (BSON), the decision for which one to use is left up to the user. BSON has some slight space and compute advantages over JSON, but given the advanced compression capabilities of Db2, there is not much benefit gained from space savings.

One reason to use BSON is for compatibility with existing applications that already create BSON objects. The JSON functions can determine from the data type which format the data is in and adjust the processing as required. This means that from a development perspective, there is no need to convert from JSON to BSON (or vice-versa) to use the JSON functions.

BSON is supported within Db2 in two ways. The data that is inserted into a column can be converted to BSON using the built-in **`BSON_TO_JSON`** function or the application can use any routines that convert character strings into the proper BSON format. The BSON format used by the earlier Db2 JSON API functions is supported in addition to the new format. Applications that were written using the JSON API BSON format can use this new set of JSON functions without converting the data.

There are some restrictions to what the BSON document can contain within it. The following table summarizes the BSON data types that are supported by Db2.
 
| BSON ID  | TYPE 
|--------: |:-----
| 1 | Double               
| 2 | String               
| 3 | Object               
| 4 | Array                
| 8 | Boolean              
| 9 | Date                 
|10 | Null      
|16 | 32-bit integer
|18 | 64-bit integer

Any BSON types outside of these values will not be recognized during processing.

### Sample JSON Data Set
The examples found in this chapter use a JSON data set (customers.js) that needs ro be generated. A sample document is found below:
```json
{
    "customerid": 100000,
    "identity": 
      {
        "firstname": "Jacob", "lastname": "Hines", 
        "birthdate": "1982-09-18"
      },
    "contact": 
      {
        "street": "Main Street North",
        "city": "Amherst", "state": "OH", "zipcode": "44001",
        "email": "Ja.Hines@yahii.com",
        "phone": "813-689-8309"
      },
    "payment": 
      {
        "card_type": "MCCD", "card_no": "4742-3005-2829-9227"
      },
    "purchases": 
    [
      {
        "tx_date": "2018-02-14",
        "tx_no": 157972,
        "product_id": 1860,
        "product": "Ugliest Snow Blower",
        "quantity": 1,
        "item_cost": 51.8
      }, ... additional purchases ...
    ]
}
```
The JSON document contains five distinct pieces of information:
* Customerid – Primary key
* Identity – Information on the customer including name and birthdate
* Contact – Address, email, and phone number information
* Payment – Current payment card that is used
* Purchase – The purchase that the customer has made

The purchase structure contains information on the customer purchases. For each purchased item, there is the following information:
* tx_date – Date of the transaction
* tx_no – Transaction number
* product_id – Id for the product
* product – Name of the product
* quantity – Quantity of products purchased
* item_cost – Cost of one product

If this was a relational database, you would probably split these fields up into different tables and use join techniques to bring the information back together. In a JSON document, we are able to keep all of this information in one place, which makes retrieval of an individual customer's purchases easier.

### Load Db2 Extensions and Connect to the Database
The `connection` notebook contains the `CONNECT` statement which allows access to the `SAMPLE` database. If you need to modify the connection information, edit the `connection.ipynb` notebook.

In [None]:
%run ../db2.ipynb
%run ../connection.ipynb

### Create the Customer File
The following code will generate 25000 customers records formatted as JSON documents.

In [None]:
%run generate_json.ipynb

### Creating a Table with Character JSON Columns
JSON data can be stored in any column that is defined as a character data type. The format of the table can be either ROW organized, or COLUMN organized. In the case of `COLUMN ORGANIZED` tables, the CLOB column data type is only supported in Db2 11.5.

The following SQL demonstrates the various ways a JSON character column can be defined in a table:

In [None]:
%%sql -quiet
DROP TABLE JSON_DATA;
CREATE TABLE JSON_DATA 
  
    FIELD1 CHAR(255),
    FIELD2 VARCHAR(300),
    FIELD3 CLOB(1000)
  );

When using a CLOB object, an `INLINE LENGTH` specification should be used to try and place as much of the data on the data page to take advantage of the performance advantage provided by the buffer pool caching effect. If you do not specify an inline length for CLOB objects, the JSON data will not reside in the buffer pool and searching and retrieval of this data will take an additional I/O operation.
The following SQL will recreate the JSON_DATA table specifying an inline length for the JSON column.

In [None]:
%%sql -quiet
DROP TABLE JSON_DATA;
CREATE TABLE JSON_DATA 
  (
    JSON CLOB(1000) INLINE LENGTH 1000
  );

Consideration should also be given to using a large enough table page size (32K) so that it all of the JSON data can be stored on it. 

**Note:** To use the Db2 JSON SYSTOOLS functions, you must store the data as BSON in BLOB objects.

### Creating a Table with Binary JSON (BSON) Columns
BSON data can be stored in columns defined as a binary data type which encompasses the `BINARY`, `VARBINARY`, and `BLOB` data types. There is one additional binary data type that is supported, `FOR BIT DATA` character columns. With the introduction of `BINARY` and `VARBINARY` fields, there is little reason to use the `FOR BIT DATA` specification. In the case of `COLUMN ORGANIZED` tables, the `BLOB` column data type is only supported in Db2 11.5.

In [None]:
%%sql -quiet
DROP TABLE JSON_DATA;
CREATE TABLE JSON_DATA 
  (
    FIELD1 BINARY(255),
    FIELD2 VARBINARY(300),
    FIELD3 BLOB(1000),
    FIELD4 VARCHAR(300) FOR BIT DATA
  );

When using a BLOB column, the same considerations mentioned for CLOB columns in the previous section also apply.

### Differences between JSON and BSON Storage
There are a number of considerations when choosing BSON over JSON. Using BSON can result in spacing savings (most of the time!) but requires extra processing power to convert the JSON character strings. To illustrate the space savings, the following example will use the customer.js file created earlier in this script.

The customer file (customer.js) contains a single row for each JSON record similar to the following:
```json
{"customerid": 100000, "identity": {"firstname": "Jonathan",...
```
Rather than having to write an application to read and insert the data, the Db2 IMPORT command can be used to insert this data in one step.
```sql
CREATE TABLE JSON_RAW_DATA 
  (
  CUSTOMER VARCHAR(2000)
  );
IMPORT FROM customers.js OF ASC METHOD l(1 2000) 
    INSERT INTO JSON_RAW_DATA;
```

### Load the Customer file into a table
You must run the first command to get the working directory for the IMPORT command.

In [None]:
fname = os.getcwd() + "/customers.js"
print("Input file: " + fname)

Next we create the table that will contain the customer data.

In [None]:
%%sql -quiet 
DROP TABLE JSON_RAW_DATA;
CREATE TABLE JSON_RAW_DATA 
  (
  CUSTOMER VARCHAR(2000)
  );

 The Db2 following code will be used to load the data into the table.

In [None]:
import io
import json
print("Starting Load")
start_time = time.time()
%sql autocommit off
x = %sql prepare INSERT INTO JSON_RAW_DATA VALUES (?)
if (x != False):
    i = 0
    with open(fname,"r") as records:
        for record in records:
            i += 1
            rc = %sql execute :x using record@char
            if (rc == False): break
            if ((i % 5000) == 0): 
                print(str(i)+" rows read.")
                %sql commit hold
                
    %sql commit work  
%sql autocommit on
end_time = time.time()
print('Total load time for {:d} records is {:.2f} seconds'.format(i,end_time-start_time))

In this step we will create two tables to hold the data: one using a character format, while the second one using a binary format.

In [None]:
%%sql -q
DROP TABLE JSON_CHAR;
CREATE TABLE JSON_CHAR 
  (
    CUSTOMER VARCHAR(2000)
  );
    
DROP TABLE JSON_BINARY;
CREATE TABLE JSON_BINARY 
  (
    CUSTOMER VARBINARY(2000)
  );
DROP TABLE CUSTOMERS;
CREATE TABLE CUSTOMERS
  (
  INFO VARCHAR(2000)
  );

The data from the base table will be inserted into these two tables using `INSERT INTO SELECT FROM` syntax. The size of each table is compared after the INSERT completes.

In [None]:
%sql -q INSERT INTO CUSTOMERS SELECT * FROM JSON_RAW_DATA;
%sql -q INSERT INTO JSON_CHAR SELECT * FROM JSON_RAW_DATA;
char_load = sqlelapsed
%sql -q INSERT INTO JSON_BINARY SELECT JSON_TO_BSON(CUSTOMER) FROM JSON_RAW_DATA;
blob_load = sqlelapsed
char_size = %sql -r SELECT SUM(LENGTH(CUSTOMER)) FROM JSON_CHAR
blob_size = %sql -r SELECT SUM(LENGTH(CUSTOMER)) FROM JSON_BINARY
%sql -bar values ('CHAR',:char_size[1]),('BLOB',:blob_size[1])

The difference in storage is minimal but generally the size of a BSON table will be about 5% less than a character based table. Converting JSON data to BSON does incur some additional overhead so that may also be a consideration when storing the data. The execution time was captured from the previous `INSERT` statements and is summarized in the graph below.

In [None]:
%sql -bar values ('CHAR',:char_load),('BLOB',:blob_load)

Converting the data to BSON does add overhead to the `INSERT` time. If you are only storing and retrieving entire JSON documents then the conversion to BSON may not be unnecessary. However, if you find that you are continually quering these documents then the overhead of converting to BSON will be worth the improved query time. BSON has an internal tree structure that makes querying document much more efficient, while character-based JSON objects will first need to be converted to BSON before any searching can be performed.

### Summary
JSON data can be stored in character or binary format. The decision of which format to use is left up to the DBA and the Db2 functions can work against either format. Consideration should be given to storing documents in BSON format if the contents of the document are to queried. While BSON has a slight advantage of space savings over character-based JSON documents, it does incur added overhead during the conversion process. However, this additional overhead is easily justified with the improved performance when searching within a document.

#### Credits: IBM 2019, George Baklarz [baklarz@ca.ibm.com]