# Store Large Object in DynamoDB


## Overview

When storing large binary data in DynamoDB, AWS recommends saving the data in S3 and only keeping the S3 URI in DynamoDB. However, implementing this correctly can be challenging. This article will demonstrate the best practices for this pattern and provide code examples using the [pynamodb_mate](https://github.com/MacHu-GWU/pynamodb_mate-project) library to implement this pattern.
 

## Data Consistency across DynamoDB and S3

The AWS official documentation "[Best practices for storing large items and attributes](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-use-s3-too.html)" explicitly states that AWS cannot guarantee dual-write consistency between DynamoDB and S3. This article focuses on addressing this issue.

Firstly, **we need to determine the order of write operations to DynamoDB and S3**. Can they be performed simultaneously? I believe that simultaneous writing can be ruled out. The latency of writing to S3 is significantly higher than writing to DynamoDB, making parallel execution less meaningful. Moreover, during an update operation, there might be other attributes that need to be updated alongside the binary attributes. For instance, the ``update_at`` attribute can reflect the modification time of the DynamoDB item. Logically, this ``update_at`` attribute should only be written to DynamoDB after all the S3 write operations are completed.
Let's discuss the two options for the write operations:

1. **Write to DynamoDB first**, then write to S3: This approach is not ideal. If the write operation to S3 fails after successfully writing to DynamoDB, which can be a lengthy process, dirty data may be read during this period. Additionally, if the S3 write fails, you need to roll back the DynamoDB data to its previous state, which is complex and error-prone.
2. **Write to S3 first, then write to DynamoDB**: This approach is better. When writing to S3, you can create a new object. If an error occurs while writing to DynamoDB later, you can do nothing without impacting data consistency. If the write to S3 fails, you can fail directly.

You may need to catch exceptions and delete unused data, depending on whether you need to remove unused data immediately. Some projects require retaining historical records, while others can batch delete unused data after a certain period.

Deletion is slightly different. Generally, you delete from DynamoDB first and then from S3. If you delete from S3 first and the DynamoDB deletion fails, a read request coming in would find that the data in S3 is missing, causing logical confusion. On the other hand, if you delete from DynamoDB first, even if the S3 deletion fails, it doesn't matter because the S3 data is no longer needed and can be deleted later in a batch cleanup program.

**In conclusion, when creating or updating, you should write to S3 first and then write to DynamoDB. When deleting, delete from DynamoDB first and then from S3**.


## S3 Key Naming Convention

When storing data that should be in DynamoDB on S3, there are many strategies for choosing the S3 location. Here are a few common approaches:

1. **Content-based**: The S3 key is based on the hash of the content. This helps eliminate duplicates and avoids overwrites. The advantage of this approach is that there is no risk of overwriting useful data. The downside is that when deleting data from S3, you need to check if the object is referenced by other DynamoDB items, which can be complex.
2. **Based on pk and sk**: Since the combination of pk and sk is unique, using them together as the S3 key is also a good choice. However, it's important to note that the compound key of pk and sk cannot be used as the final key because it may overwrite correct data in case of failures during dual-write operations. This compound key should exist as a prefix.

**Conclusion**: We choose to use ``${prefix}/${pk}/${sk}/${content_hash}`` as the S3 key, where ``prefix`` is a custom root directory. This approach combines the advantages of both strategies, ensuring that there is no overwriting of useful data and eliminating the need to worry about objects being referenced by other DynamoDB items when deleting data.

## Code Example using pynamodb_mate

### Declare the Model

``pynamodb_mate`` provides an elegant way to use this pattern. Firstly, you need to create a DynamoDB ORM model and inject the ``LargeAttributeMixin`` mixin class to enable this feature.

In [1]:
import pynamodb_mate.api as pm
from boto_session_manager import BotoSesManager
from s3pathlib import S3Path, context # for demo only
from rich import print as rprint # for demo only

class Document(pm.Model, pm.patterns.large_attribute.LargeAttributeMixin):
    class Meta:
        table_name = f"pynamodb-mate-test-large-attribute"
        region = "us-east-1"
        billing_mode = pm.constants.PAY_PER_REQUEST_BILLING_MODE

    pk = pm.UnicodeAttribute(hash_key=True)
    # this attribute track the last update time of the item
    update_at = pm.UTCDateTimeAttribute()
    # you can declare multiple large attributes in one model.
    html = pm.UnicodeAttribute(null=True)
    image = pm.UnicodeAttribute(null=True)
    # this attribute to store arbitrary user data
    data = pm.JSONAttribute()

### Define the AWS Configuration

Then we need to define:

- Define which AWS credentials (boto3 session) to use.
- Tell pynamodb_mate to use the specified boto3 session.
- Create the DynamoDB table.
- Define an S3 location to store the large attribute data.
- Prepare some helper functions.

In [2]:
from datetime import datetime, timezone

# define boto session using default profile, I prefer to use boto_session_manager,
# however, you can use the native boto3
bsm = BotoSesManager()
# Tell s3pathlib to use this boto session
# s3pathlib is NOT a required library, it is used to simplify writing this demo.
context.attach_boto_session(bsm.boto_ses)
# Tell pynamodb_mate to use this boto session for DynamoDB connection
conn = pm.Connection()
# Create Table
Document.create_table(wait=True)
# Define S3 Bucket and Prefix
BUCKET = f"{bsm.aws_account_alias}-{bsm.aws_region}-data"
PREFIX = f"projects/pynamodb_mate/examples/large_attribute/"

def get_utc_now() -> datetime:
    return datetime.utcnow().replace(tzinfo=timezone.utc)

### Clean Up S3 prefix and DynamoDB Table

This is for demo only. I would like to ensure that I have a clean S3 bucket and DynamoDB table at start.

In [3]:
S3Path(f"s3://{BUCKET}/{PREFIX}").to_dir().delete()
Document.delete_all()

0

### Two API styles

``pynamodb_mate`` provides two API styles: **Transaction API** and **Step-by-Step API**.

The **Transaction API** allows users to use a single Python function to create/update/delete the DynamoDB item and its underlying S3 objects. It automatically manages the data consistency between DynamoDB and S3.

The **Step-by-Step API** allows users to use one Python function to interact with S3 and another Python function to interact with DynamoDB. Users need to manually manage the data consistency between DynamoDB and S3.


### Transaction API example

#### Transaction API example - Create

In [4]:
pk = "id-1"
sk = None
html_data = "<b>Hello Alice</b>".encode("utf-8")
image_data = "this is image one".encode("utf-8")
utc_now = get_utc_now()

new_doc = Document.create_large_attribute_item(
    # ``boto3.client("s3")`` object
    s3_client=bsm.s3_client, 
    # hash key value of the DynamoDB item.
    pk=pk,
    # range key value if your DynamoDB table has range key, otherwise use None.
    sk=sk,
    # key value mapper in Python dictionary for large attribute name
    # and binary data. All data has to be encoded in binary format.
    kvs=dict(html=html_data, image=image_data),
    # S3 bucket to store the large attribute data.
    bucket=BUCKET,
    # S3 prefix to store the large attribute data, the final S3 key
    # would be ``s3://{bucket}/{prefix}/pk={pk}/sk={sk}/attr={attr}/md5={md5}``.
    prefix=PREFIX,
    # the update time of the DynamoDB item, it will be stored
    # in the S3 object metadata as well.
    update_at=utc_now,
    # additional DynamoDB item attributes other than
    # large attributes you want to set.
    attributes=dict(
        update_at=utc_now, 
        data={"version": 1},
    ),
    # if True, if S3 write succeeded and DynamoDB create item failed, 
    # the created S3 object will be deleted.
    clean_up_when_failed=True,
)
rprint(new_doc.to_dict())

In [5]:
print(f"html content = {S3Path(new_doc.html).read_text()}")
print(f"image content = {S3Path(new_doc.image).read_bytes()}")

html content = <b>Hello Alice</b>
image content = b'this is image one'


#### Transaction API example - Update

In [6]:
html_data = "<b>Hello Bob</b>".encode("utf-8")
image_data = "this is image two".encode("utf-8")
old_doc = Document.get(pk, sk)
utc_now = get_utc_now()

new_doc = Document.update_large_attribute_item(
# ``boto3.client("s3")`` object
    s3_client=bsm.s3_client, 
    # hash key value of the DynamoDB item.
    pk=pk,
    # range key value if your DynamoDB table has range key, otherwise use None.
    sk=sk,
    # key value mapper in Python dictionary for large attribute name
    # and binary data. All data has to be encoded in binary format.
    kvs=dict(html=html_data, image=image_data),
    # S3 bucket to store the large attribute data.
    bucket=BUCKET,
    # S3 prefix to store the large attribute data, the final S3 key
    # would be ``s3://{bucket}/{prefix}/pk={pk}/sk={sk}/attr={attr}/md5={md5}``.
    prefix=PREFIX,
    # the update time of the DynamoDB item, it will be stored
    # in the S3 object metadata as well.
    update_at=utc_now,
    # additional DynamoDB item update expressions syntax 
    # other than large attributes you want to set. Please refer to
    # https://pynamodb.readthedocs.io/en/latest/updates.html
    update_actions=[
        Document.update_at.set(utc_now),
        Document.data.set({"version": 2}),
    ],
    # if True, if large attributes of old DynamoDB
    # item got changed, the old S3 object will be deleted.
    clean_up_when_succeeded=True,
    # if Ture, if S3 write succeeded and DynamoDB update item failed, 
    # the created S3 object will be deleted.
    clean_up_when_failed=True,
)
rprint(new_doc.to_dict())

In [7]:
# S3 object of old document should be cleaned up
print(f"{S3Path(old_doc.html).exists(bsm=bsm) = }")
print(f"{S3Path(old_doc.image).exists(bsm=bsm) = }")

S3Path(old_doc.html).exists(bsm=bsm) = False
S3Path(old_doc.image).exists(bsm=bsm) = False


In [8]:
# S3 object of new document should be created
print(f"html content = {S3Path(new_doc.html).read_text()}")
print(f"image content = {S3Path(new_doc.image).read_bytes()}")

html content = <b>Hello Bob</b>
image content = b'this is image two'


#### Transaction API example - Delete

In [9]:
old_doc = Document.get(pk, sk)
Document.delete_large_attribute_item(
    # ``boto3.client("s3")`` object.
    s3_client=bsm.s3_client,
    # hash key value of the DynamoDB item.
    pk=pk,
    # range key value if your DynamoDB table has range key, otherwise use None.
    sk=sk,
    # list of large attribute names to delete. This is required when 
    # clean_up_when_succeeded is True. 
    # If clean_up_when_succeeded is False, this parameter has no effect.
    attributes=[
        Document.html.attr_name,
        Document.image.attr_name,
    ],
    # if True, the corresponding S3 object will deleted after DynamoDB item been deleted.
    clean_up_when_succeeded=True,
)
deleted_doc = Document.get_one_or_none(pk, sk)
print(f"{deleted_doc = }")

deleted_doc = None


In [10]:
# S3 object of old document should be cleaned up
print(f"{S3Path(old_doc.html).exists() = }")
print(f"{S3Path(old_doc.image).exists() = }")

S3Path(old_doc.html).exists() = False
S3Path(old_doc.image).exists() = False


### Step-by-Step API example

#### Step-by-Step API example - Create

In [11]:
# Do S3 write first
pk = "id-2"
html_data = "<b>Hello Alice</b>".encode("utf-8")
image_data = "this is image one".encode("utf-8")
utc_now = get_utc_now()

put_s3_res = Document.put_s3(
    s3_client=bsm.s3_client,
    pk=pk,
    sk=None,
    kvs=dict(html=html_data, image=image_data),
    bucket=BUCKET,
    prefix=PREFIX,
    update_at=utc_now,
)
rprint(put_s3_res)

In [12]:
# Do DynamoDB write then, note that DynamoDB write operation may fail
try:
    new_doc = Document(
        pk=pk,
        update_at=utc_now,
        data={"version": 1},
        **put_s3_res.to_attributes(),
    )
    new_doc.save()
    rprint(new_doc.to_dict())
except Exception as e:
    put_s3_res.clean_up_created_s3_object_when_create_dynamodb_item_failed(bsm.s3_client)

In [13]:
# S3 object of new document should be created
print(f"html content = {S3Path(new_doc.html).read_text()}")
print(f"image content = {S3Path(new_doc.image).read_bytes()}")

html content = <b>Hello Alice</b>
image content = b'this is image one'


#### Step-by-Step API example - Update

In [14]:
# Do S3 write first
html_data = "<b>Hello Bob</b>".encode("utf-8")
image_data = "this is image two".encode("utf-8")
old_doc = Document.get(pk)
old_doc_copy = Document.get(pk)
utc_now = get_utc_now()
put_s3_res = Document.put_s3(
    s3_client=bsm.s3_client,
    pk=pk,
    sk=None,
    kvs=dict(html=html_data, image=image_data),
    bucket=BUCKET,
    prefix=PREFIX,
    update_at=utc_now,
)
rprint(put_s3_res)

In [15]:
# Do DynamoDB update then, note that DynamoDB update operation may fail
try:
    actions = put_s3_res.to_update_actions(model_klass=Document)
    actions.append(Document.update_at.set(utc_now))
    actions.append(Document.data.set({"version": 2}))
    old_doc.update(actions=actions)
    put_s3_res.clean_up_old_s3_object_when_update_dynamodb_item_succeeded(
        s3_client=bsm.s3_client, 
        old_model=old_doc_copy,
    )
    new_doc = old_doc
    rprint(new_doc.to_dict()) # now in-memory old_doc become new_doc after updates
except Exception as e:
    put_s3_res.clean_up_created_s3_object_when_update_dynamodb_item_failed(
        s3_client=bsm.s3_client,
    )

In [16]:
# S3 object of old document should be cleaned up
print(f"{S3Path(old_doc_copy.html).exists() = }")
print(f"{S3Path(old_doc_copy.image).exists() = }")

S3Path(old_doc_copy.html).exists() = False
S3Path(old_doc_copy.image).exists() = False


In [17]:
# S3 object of new document should be created
print(f"html content = {S3Path(new_doc.html).read_text()}")
print(f"image content = {S3Path(new_doc.image).read_bytes()}")

html content = <b>Hello Bob</b>
image content = b'this is image two'


#### Step-by-Step API example - Delete

In [18]:
old_doc = Document.get(pk, sk)
old_doc.delete()

{'ConsumedCapacity': {'CapacityUnits': 1.0,
  'TableName': 'pynamodb-mate-test-large-attribute'}}

You decide whether if you want to delete the S3 object after the DynamoDB item is deleted.

In [19]:
_ = S3Path(old_doc.html).delete()
_ = S3Path(old_doc.image).delete()

## Summary

By leveraging pynamodb_mate, you can:

- Easily integrate the storage of large binary data in S3 with your DynamoDB ORM model.
- Ensure data consistency between DynamoDB and S3 during create, update, and delete operations.
- Choose between the Transaction API for simplified usage or the Step-by-Step API for more control and customization.
- Benefit from a well-defined S3 key naming convention that eliminates data overwriting and simplifies object management.

This library offers a valuable solution for handling large attributes in DynamoDB, saving you time and effort in implementing this pattern correctly.