# Store Large Object in DynamoDB

This feature allows you to store any Python object and arbitrary big data in DynamoDB that can exceed the 400KB limit.

NOTE: this solution is based on [pynamodb_mate](https://github.com/MacHu-GWU/pynamodb_mate-project) Python library.

**Summary**

DynamoDB is a very good choice for **Pay-as-you-go**, **high-concurrent** key value database. Sometimes, you want to store large binary object as a DynamoDB item attribute. For example, a web crawler app wants to store crawled html source to avoid re-visit the same url. But DynamoDB has a limitation that one item can not be larger than 256KB. How could you solve the problem?

The best practice is to serialize the data you want to store in binary and store the binary data in S3 object. In DynamoDB item, you only store the S3 uri of that object. ``pynamodb_mate`` provides a clean API allow you to store arbitrary large Python objects in DynamoDB..

**How it Works**

``pynamodb_mate`` will use the sha256 fingerprint of the binary view of your data as the naming convention in the S3 key. This ensures that the same data will be stored at the same S3 location to avoid duplicate traffic.

When you delete the DynamoDB item, it won't delete the S3 object because there might be another application using it. In the future, ``pynamodb_mate`` will provide a clean API to delete the S3 object as well.

## Define your Data Model

In [1]:
import pynamodb_mate as pm
import boto3

In [2]:
s3_client = boto3.session.Session().client("s3")

bucket_name = "aws-data-lab-sanhe-for-opensource"

class UrlModel(pm.Model):
    class Meta:
        table_name = "pynamodb-mate-example-store-large-object"
        region = "us-east-1"
        billing_mode = pm.PAY_PER_REQUEST_BILLING_MODE

    url = pm.UnicodeAttribute(hash_key=True)

    # declare attribute
    html = pm.S3BackedBigTextAttribute(
        # required, for each S3Backed attribute, you have to specify
        # the s3 bucket to store your data
        bucket_name=bucket_name,
        # optional, by default it will use pynamodb-mate/bigtext/{fingerprint}.txt
        key_template="pynamodb_mate/s3backed/{fingerprint}.html",
        # optional, by default, the data is compressed
        compressed=True,
        # optional, explicitly specify the underlying s3 client you want to use
        # this is useful you want to use a different AWS credential other than
        # the DynamoDB client.
        s3_client = s3_client,
    )

    content = pm.S3BackedBigBinaryAttribute(
        # required, for each S3Backed attribute, you have to specify
        # the s3 bucket to store your data
        bucket_name=bucket_name,
        # optional, by default it will use pynamodb-mate/bigtext/{fingerprint}.txt
        key_template="pynamodb_mate/s3backed/{fingerprint}.dat",
        # optional, by default, the data is compressed
        compressed=True,
        # optional, explicitly specify the underlying s3 client you want to use
        # this is useful you want to use a different AWS credential other than
        # the DynamoDB client.
        s3_client = s3_client,
    )

# create DynamoDB table if not exists, quick skip if already exists
UrlModel.create_table(wait=True)

## Write / Read / Update / Delete

In [3]:
# Write
url = "https://pynamodb-mate.readthedocs.io/en/latest/"
html = "<html>Hello World!</html>"
content = "this is a dummy image!".encode("utf-8")

# create item
url_model = UrlModel(url=url, html=html, content=content)
# write item to DynamoDB table
url_model.save()
# preview the DynamoDB item, the value in ``.data`` is the s3 uri
print(f"preview the DynamoDB item: {url_model.item_detail_console_url}")

preview the DynamoDB item: https://us-east-1.console.aws.amazon.com/dynamodbv2/home?region=us-east-1#edit-item?table=pynamodb-mate-example-store-large-object&itemMode=2&pk=https://pynamodb-mate.readthedocs.io/en/latest/&sk&ref=%23item-explorer%3Ftable%3Dpynamodb-mate-example-store-large-object&route=ROUTE_ITEM_EXPLORER


In [4]:
# Read
url_model = UrlModel.get(url)
print(f"url.html = {url_model.html!r}")
print(f"url.content = {url_model.content!r}")

url.html = '<html>Hello World!</html>'
url.content = b'this is a dummy image!'


In [5]:
# Update the item
url_model.update(
    actions=[
        UrlModel.html.set("<html>Hello DynamoDB</html>"),
        UrlModel.content.set("this is a real image!".encode("utf-8")),
    ]
)
url_model.refresh() # get the up-to-date data
print(f"url.html = {url_model.html!r}") # should give you new data
print(f"url.content = {url_model.content!r}") # should give you new data

url.html = '<html>Hello DynamoDB</html>'
url.content = b'this is a real image!'


In [6]:
# delete item from DynamoDB
# this won't delete s3 object
url_model.delete()

{'ConsumedCapacity': {'CapacityUnits': 1.0,
  'TableName': 'pynamodb-mate-example-store-large-object'}}

## Custom S3Backed Attribute

``pynamodb_mate`` has three built-in ``S3Backed`` attribute:

- ``pynamodb_mate.S3BackedBigBinaryAttribute``
- ``pynamodb_mate.S3BackedBigTextAttribute``
- ``pynamodb_mate.S3BackedJsonDictAttribute``

It is also easy to create your own S3Backed attribute. In this example, even though we already have the built-in ``S3BackedJsonDictAttribute``, let's learn by re-inventing the wheel.

In [8]:
# import the base class for S3Backed attribute
import json
from pynamodb_mate import S3BackedAttribute

class S3BackedJsonAttribute(S3BackedAttribute):
    # user_serializer is a method to define how you want to
    # convert your data to binary.
    def user_serializer(self, value: dict) -> bytes:
        return json.dumps(value).encode("utf-8")

    # user_deserializer is a method to define how you want to
    # recover your data from binary.
    def user_deserializer(self, value: bytes) -> dict:
        return json.loads(value.decode("utf-8"))

That's it, now you can use it like any other S3Backed attribute.

In [9]:
class UrlModel(pm.Model):
    class Meta:
        table_name = "pynamodb-mate-example-store-large-object"
        region = "us-east-1"
        billing_mode = pm.PAY_PER_REQUEST_BILLING_MODE

    url = pm.UnicodeAttribute(hash_key=True)
    data = S3BackedJsonAttribute(
        bucket_name=bucket_name,
        # you want to use .json as the file extension
        key_template="pynamodb_mate/s3backed/{fingerprint}.json",
        s3_client = s3_client,
    )

UrlModel.create_table(wait=True)

In [10]:
# Write
url = "https://pynamodb-mate.readthedocs.io/en/latest/"
data = dict(a=1)

# create item
url_model = UrlModel(url=url, data=data)
# write item to DynamoDB table
url_model.save()
# preview the DynamoDB item, the value in ``.data`` is the s3 uri
print(f"preview the DynamoDB item: {url_model.item_detail_console_url}")

preview the DynamoDB item: https://us-east-1.console.aws.amazon.com/dynamodbv2/home?region=us-east-1#edit-item?table=pynamodb-mate-example-store-large-object&itemMode=2&pk=https://pynamodb-mate.readthedocs.io/en/latest/&sk&ref=%23item-explorer%3Ftable%3Dpynamodb-mate-example-store-large-object&route=ROUTE_ITEM_EXPLORER


In [11]:
# Read
url_model = UrlModel.get(url)
print(f"url.data = {url_model.data!r}")

url.data = {'a': 1}


In [12]:
# Update the item
url_model.update(
    actions=[
        UrlModel.data.set(dict(b=2)),
    ]
)
url_model.refresh() # get the up-to-date data
print(f"url.data = {url_model.data!r}") # should give you new data

url.data = {'b': 2}


In [13]:
# delete item from DynamoDB
# this won't delete s3 object
url_model.delete()

{'ConsumedCapacity': {'CapacityUnits': 1.0,
  'TableName': 'pynamodb-mate-example-store-large-object'}}

In [16]:
print(url_model.to_dict()["data"])

{'b': 2}
