# Quick Start

Let's say just export a DynamoDB table to S3, now you want to analyze the exported data.

## Get Sample Data

In [1]:
import json
import time
from pathlib import Path
from urllib import request

import polars as pl
from rich import print as rprint
from fast_dynamodb_json.vendor.polars_utils import pprint_df
from fast_dynamodb_json.vendor.timer import DateTimeTimer

dir_here = Path.cwd()
path_json_gz = dir_here / "data.json.gz"

In [2]:
url = "https://github.com/MacHu-GWU/fast_dynamodb_json-project/releases/download/0.1.1/xz5bebbfty4bvho4mujjbfdz7m.json.gz"
with request.urlopen(url) as response:
    path_json_gz.write_bytes(response.read())

## Load Export JSON Line Data

For efficient processing of exported JSON line data, we strongly recommend using the [polars.read_ndjson](https://docs.pola.rs/api/python/stable/reference/api/polars.read_ndjson.html) API. This powerful function eliminates the need for manual line-by-line JSON parsing, offering significant performance benefits. It leverages Polars' optimized routines to rapidly ingest and parse NDJSON (Newline Delimited JSON) files, converting them directly into Polars DataFrames.

In [3]:
df = pl.read_ndjson(str(path_json_gz))

## Preview Sample Data

In [4]:
# 143K records
df.shape

(142928, 1)

In [5]:
record = df.head(1).to_dicts()[0]
rprint(record)

## Import fast_dynamodb_json Library

You can import ``fast_dynamodb_json.api`` to access all public APIs.

In [6]:
from fast_dynamodb_json.api import (
    Integer,
    Float,
    String,
    Binary,
    Bool,
    Null,
    Set,
    List,
    Struct,
    deserialize_df,
)

## Define Your DynamoDB Table Schema

You have to define DynamoDB table schema so that ``fast_dynamodb_json`` knows how to resolve data type conflict.

In [7]:
simple_schema = {
    "OrderID": String(),
    "CustomerID": String(),
    "OrderDate": String(),
    "TotalAmount": Float(),
    "Status": String(),
    "ShippingAddress": Struct(
        {
            "StreetAddress": String(),
            "City": String(),
            "State": String(),
            "ZipCode": String(),
            "Country": String(),
        }
    ),
    "Items": List(
        Struct(
            {
                "ProductID": String(),
                "Name": String(),
                "Price": Float(),
                "Quantity": Integer(),
            }
        ),
    ),
    "AppliedCoupons": List(String()),
    "PaymentMethod": String(),
    "LastFourDigits": String(),
    "EstimatedDeliveryDate": String(),
    "GiftWrap": Bool(),
    "GiftMessage": String(),
}

## Deserialize DynamoDB JSON

This library achieves exceptional performance through several key optimizations:

1. Vectorized Parsing: The parser operates in vectorization mode, enabling efficient processing of multiple data elements simultaneously. This approach maximizes CPU utilization and significantly reduces processing time.
2. Columnar Data Storage: The original DynamoDB JSON data is stored in a columnar format. This structure facilitates parallel processing across multiple CPU cores, further enhancing performance.
3. Zero-Copy Techniques: By employing zero-copy methods, the library minimizes data movement between memory locations. This strategy reduces memory bandwidth usage and improves overall processing speed.
4. Minimal Intermediate Object Creation: The library avoids generating unnecessary intermediate Python objects such as lists and dictionaries. This approach not only reduces memory consumption but also decreases the overhead associated with object creation and garbage collection.

These optimizations work in concert to deliver a highly efficient DynamoDB JSON deserialization process, resulting in faster execution times and lower memory usage compared to traditional parsing methods.

In [19]:
with DateTimeTimer(f"Deserialize {df.shape[0]} DynamoDB JSON records"):
    df_res = deserialize_df(df, simple_schema, dynamodb_json_col="Item")

Deserialize 142928 DynamoDB JSON records: from 2024-08-07 20:45:12.635942 to 2024-08-07 20:45:12.702129 elapsed 0.066187 second.


In [15]:
# 143K, 13 fields
df_res.shape

(142928, 13)

In [16]:
pprint_df(df_res.head(5))

+---------------+--------------+---------------------+---------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------+-----------------+------------------+-------------------------+------------+------------------------------------------------+
| OrderID       | CustomerID   | OrderDate           |   TotalAmount | Status    | ShippingAddress                                                                                                               

In [17]:
record = df_res.head(1).to_dicts()[0]
rprint(record)

## Performance Comparison

To demonstrate the efficiency of fast_dynamodb_json, we conducted a comparative analysis against the widely-used pure Python dynamodb_json library. This benchmark aims to highlight the performance advantages of our optimized implementation.

In [12]:
from dynamodb_json import json_util

with DateTimeTimer(f"Deserialize {df.shape[0]} DynamoDB JSON records"):
    items = list()
    for record in df.to_dicts():
        item = json_util.loads(record["Item"])
        items.append(item)

Deserialize 142928 DynamoDB JSON records: from 2024-08-08 01:14:03.451483 to 2024-08-08 01:14:22.992041 elapsed 19.540558 second.


In [13]:
rprint(items[0])

## Conclusion

This notebook demonstrates the power and efficiency of the fast_dynamodb_json library for processing DynamoDB JSON data. We compared its performance against the traditional dynamodb_json library, showcasing significant improvements in speed and resource utilization.

Key takeaways:

1. ``fast_dynamodb_json`` processed 142,928 DynamoDB JSON records in approximately **0.066** seconds.
2. The same task using ``dynamodb_json`` took about **19.54** seconds.
3. This represents a speed improvement of nearly 300 times.

The ``fast_dynamodb_json`` library achieves this impressive performance through vectorized parsing, columnar data storage, zero-copy techniques, and minimal intermediate object creation. These optimizations make it an excellent choice for large-scale DynamoDB data processing tasks, especially when working with exports or analytics pipelines.

By leveraging ``fast_dynamodb_json`` in conjunction with tools like Polars, data engineers and analysts can significantly reduce processing time and resource consumption, enabling more efficient and cost-effective data workflows.