# Working with Indexes

A secondary index is a data structure that contains a subset of attributes from a table, along with an alternate key to support Query operations. You can retrieve data from the index using `Query` and `Scan`, in much the same way as you use with a table. A table can have multiple secondary indexes, which give your applications access to many different query patterns.

## Local Secondary Index

Assume that you're modeling a database schema for flash card application. In there, each user makes their own card decks and the information is stored into DynamoDB like this.

| Attribute | Something Special? | Description | Sample Values |
| -- | -- | -- | -- |
| UserId | Partition key | User ID |
| DeckId | Sort key | Deck ID |
| CardNo | Attribute | Card number |
| FrontMessage | Attribute | Message in a card front |
| BackMessage | Attribute | Message in a card back |
| LastUpdatedDateTime | Attribute | The last updated date and time |

This model can support queries for searching decks of a user, but there is a new requirement coming. The application developers want to get the latest decks of a specific user. In the current schema, it is difficult to avoid searching full data of a user to satisfy the requirement.

To alleviate the situation, we can make a local secondary index as follows.

| Attribute | Something Special? | Description | Sample Values |
| -- | -- | -- | -- |
| UserId | Partition key | User ID |
| LastUpdatedDateTime | Sort key | The last updated date and time |

Here is a snippet to make this table.

In [4]:
# import and get dynamodb resource
import boto3
from boto3.dynamodb.conditions import Key, Attr
from botocore.exceptions import ClientError
from pprint import pprint, pformat
from decimal import Decimal
import time
import multiprocessing as mp
import csv
from datetime import datetime
import uuid

dynamodb = boto3.resource('dynamodb')

In [2]:
# create a table
flash_cards = dynamodb.create_table(
    TableName='FlashCards',
    AttributeDefinitions=[
        {'AttributeName': 'UserId', 'AttributeType': 'S'},
        {'AttributeName': 'DeckId', 'AttributeType': 'S'},
        {'AttributeName': 'LastUpdatedDateTime', 'AttributeType': 'S'}
    ],
    KeySchema=[
        {'AttributeName': 'UserId', 'KeyType': 'HASH'},
        {'AttributeName': 'DeckId', 'KeyType': 'RANGE'}
    ],
    BillingMode='PAY_PER_REQUEST',
    LocalSecondaryIndexes=[
        {
            'IndexName': 'LSI_01_UserIdLastUpdatedDateTime',
            'KeySchema': [
                {'AttributeName': 'UserId', 'KeyType': 'HASH'},
                {'AttributeName': 'LastUpdatedDateTime', 'KeyType': 'RANGE'}
            ],
            'Projection': {'ProjectionType': 'ALL'}
        }
    ]
)

flash_cards.wait_until_exists()

In [5]:
# put dummy data
users = ['dongkyun', 'kunwoong']
decks = ['Python', 'AWS', 'DynamoDB']

for user in users:
    for deck in decks:
        for card in range(10):
            response = flash_cards.put_item(
                Item={
                    'UserId': user,
                    'DeckId': deck,
                    'CardNo': card,
                    'FrontMessage': uuid.uuid4().hex,
                    'BackMessage': uuid.uuid4().hex,
                    'LastUpdatedDateTime': str(datetime.now())
                }
            )

pprint(flash_cards.scan())

{'Count': 6,
 'Items': [{'BackMessage': '565a8f73cd9b42c9ba96aff6f7e77ffe',
            'CardNo': Decimal('9'),
            'DeckId': 'AWS',
            'FrontMessage': '54f8349b01c74d38a4803895714a1c08',
            'LastUpdatedDateTime': '2020-10-05 02:18:43.970576',
            'UserId': 'kunwoong'},
           {'BackMessage': 'f6d1627546dc431f8673338162996ef1',
            'CardNo': Decimal('9'),
            'DeckId': 'DynamoDB',
            'FrontMessage': '4f453c2660ff43b299b3b77364f6d7cf',
            'LastUpdatedDateTime': '2020-10-05 02:18:44.051005',
            'UserId': 'kunwoong'},
           {'BackMessage': '4df4545d898c489ab6658159c2865ad1',
            'CardNo': Decimal('9'),
            'DeckId': 'Python',
            'FrontMessage': '753df996f495442d9ee6af0bb93be4d8',
            'LastUpdatedDateTime': '2020-10-05 02:18:43.892100',
            'UserId': 'kunwoong'},
           {'BackMessage': 'aa518a9f00ca4733b2f04ab9c80578bb',
            'CardNo': Decimal('9'),
    

In [6]:
# check secondary index information
pprint(flash_cards.local_secondary_indexes)

[{'IndexArn': 'arn:aws:dynamodb:ap-northeast-2:886100642687:table/FlashCards/index/LSI_01_UserIdLastUpdatedDateTime',
  'IndexName': 'LSI_01_UserIdLastUpdatedDateTime',
  'IndexSizeBytes': 0,
  'ItemCount': 0,
  'KeySchema': [{'AttributeName': 'UserId', 'KeyType': 'HASH'},
                {'AttributeName': 'LastUpdatedDateTime', 'KeyType': 'RANGE'}],
  'Projection': {'ProjectionType': 'ALL'}}]


In order to use an index in queries, `IndexName` should be specified explicitly. If not, DynamoDB doesn't use any indexed and only scans from the table. For the additional query pattern mentioned above, this query can be used. The returned result set is always sorted by the table's sort key in ascending order. By just changing the sort key order - `ScanIndexForward`, we can get what we want.

In [11]:
# get the latest 10 decks
response = flash_cards.query(
    IndexName='LSI_01_UserIdLastUpdatedDateTime',
    ExpressionAttributeValues={
        ':user_id': 'dongkyun'
    },
    KeyConditionExpression='UserId = :user_id',
    ScanIndexForward=False,
    Limit=10,
    ReturnConsumedCapacity='INDEXES'
)

pprint(response)

{'ConsumedCapacity': {'CapacityUnits': 0.5,
                      'LocalSecondaryIndexes': {'LSI_01_UserIdLastUpdatedDateTime': {'CapacityUnits': 0.5}},
                      'Table': {'CapacityUnits': 0.0},
                      'TableName': 'FlashCards'},
 'Count': 3,
 'Items': [{'BackMessage': '66e993d4f23d4da490f61f7f981bbf30',
            'CardNo': Decimal('9'),
            'DeckId': 'DynamoDB',
            'FrontMessage': 'bef7fd610acd4621ae3bc97ef5a5a599',
            'LastUpdatedDateTime': '2020-10-05 02:18:43.811198',
            'UserId': 'dongkyun'},
           {'BackMessage': 'aa518a9f00ca4733b2f04ab9c80578bb',
            'CardNo': Decimal('9'),
            'DeckId': 'AWS',
            'FrontMessage': 'd7e58b0df85d42b7b01d338cd7ca21ea',
            'LastUpdatedDateTime': '2020-10-05 02:18:43.729397',
            'UserId': 'dongkyun'},
           {'BackMessage': '524b5ab5fe5a4637a0c90a0d4d887937',
            'CardNo': Decimal('9'),
            'DeckId': 'Python',
         

If there is no index, we should execute this query and manipulate it in an application side.

In [12]:
response = flash_cards.query(
    ExpressionAttributeValues={
        ':user_id': 'dongkyun'
    },
    KeyConditionExpression='UserId = :user_id',
    ReturnConsumedCapacity='INDEXES'
)

latest_items = sorted(response['Items'], key=lambda item: item['LastUpdatedDateTime'], reverse=True)
pprint(latest_items)

[{'BackMessage': '66e993d4f23d4da490f61f7f981bbf30',
  'CardNo': Decimal('9'),
  'DeckId': 'DynamoDB',
  'FrontMessage': 'bef7fd610acd4621ae3bc97ef5a5a599',
  'LastUpdatedDateTime': '2020-10-05 02:18:43.811198',
  'UserId': 'dongkyun'},
 {'BackMessage': 'aa518a9f00ca4733b2f04ab9c80578bb',
  'CardNo': Decimal('9'),
  'DeckId': 'AWS',
  'FrontMessage': 'd7e58b0df85d42b7b01d338cd7ca21ea',
  'LastUpdatedDateTime': '2020-10-05 02:18:43.729397',
  'UserId': 'dongkyun'},
 {'BackMessage': '524b5ab5fe5a4637a0c90a0d4d887937',
  'CardNo': Decimal('9'),
  'DeckId': 'Python',
  'FrontMessage': 'e6a4e9b1d9e54e449d0a7ce6142da787',
  'LastUpdatedDateTime': '2020-10-05 02:18:43.647216',
  'UserId': 'dongkyun'}]


Actually, we don't need to make addional local index for this use case. If we make the sort key as the combination of LastUpdatedDateTime and DeckId, we can satisfy the access patterns without indexes. This tutorial is only for exercise.

## Global Secondary Index

In this section, we're going to use webserver log file located `data/logfile_medium1.csv`. Since the file content is quite simple, you can recognize it after opening the file. The partition key is request ID in the first column and no sort key.

In [13]:
# create a table
logs = dynamodb.create_table(
    TableName='Logs',
    AttributeDefinitions=[
        {'AttributeName': 'RequestId', 'AttributeType': 'S'}
    ],
    KeySchema=[
        {'AttributeName': 'RequestId', 'KeyType': 'HASH'}
    ],
    BillingMode='PAY_PER_REQUEST'
)

logs.wait_until_exists()

In [14]:
# import data, only 100 rows to save our time
items = []

with open('data/logfile_medium1.csv', 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f, fieldnames=['RequestId', 'IP', 'Date', 'Hour', 'Timezone', 'HttpMethod', 'Path', 'ResponseCode', 'Bytes', 'Client'])
    
    for row in reader:
        item = {key: value for key, value in row.items() if value != ''}
        item['RequestId'] = 'Request#' + item['RequestId']
        item['ResponseCode'] = int(item['ResponseCode'])
        item['Bytes'] = int(item['Bytes'])
    
        items.append(item)

with logs.batch_writer() as batch:
    for item in items[:100]:
        batch.put_item(Item=item)

As a batch process, a new requirement that fetch data for a specific day with response code filter, such as `Date = '2017-07-20' and ResponseCode = 302`. In the current schema, there is no way but to scan all table items.

By creating a global secondary index (PK: `Date`, SK: `ResponseCode`), we can satisfy the new query pattern. Global secondary index can be created after table creation with `Update` call.

In [15]:
# add GSI
logs = logs.update(
    AttributeDefinitions=[
        {'AttributeName': 'Date', 'AttributeType': 'S'},
        {'AttributeName': 'ResponseCode', 'AttributeType': 'N'}
    ],
    GlobalSecondaryIndexUpdates=[
        {
            'Create': {
                'IndexName': 'IndexDateResponseCode',
                'KeySchema': [
                    {'AttributeName': 'Date', 'KeyType': 'HASH'},
                    {'AttributeName': 'ResponseCode', 'KeyType': 'RANGE'}
                ],
                'Projection': {
                    'ProjectionType': 'INCLUDE',
                    'NonKeyAttributes': ['Hour', 'Timezone', 'Path']
                }
            }
        }
    ]
)

gsi_status = logs.global_secondary_indexes[0]['IndexStatus']
pprint(gsi_status)

'CREATING'


In [16]:
while gsi_status != 'ACTIVE':
    print('{}: {}'.format(datetime.now(), gsi_status))
    gsi_status = dynamodb.Table('Logs').global_secondary_indexes[0]['IndexStatus']
    time.sleep(30)

2020-10-05 02:33:41.819176: CREATING
2020-10-05 02:34:11.859253: CREATING


The usage pattern of global secondary index is completely same. To get the new query pattern with `Date` and `ResponseCode`, we can make this query.

In [17]:
response = logs.query(
    IndexName='IndexDateResponseCode',
    KeyConditionExpression=Key('Date').eq('2017-07-20') & Key('ResponseCode').eq(302),
    Limit=5,
    ReturnConsumedCapacity='INDEXES'
)

pprint(response)

{'ConsumedCapacity': {'CapacityUnits': 0.5,
                      'GlobalSecondaryIndexes': {'IndexDateResponseCode': {'CapacityUnits': 0.5}},
                      'Table': {'CapacityUnits': 0.0},
                      'TableName': 'Logs'},
 'Count': 5,
 'Items': [{'Date': '2017-07-20',
            'Hour': '20',
            'Path': '/gallery/main.php?g2_itemId=17878&g2_highlightId=17974',
            'RequestId': 'Request#57',
            'ResponseCode': Decimal('302'),
            'Timezone': 'GMT-0700'},
           {'Date': '2017-07-20',
            'Hour': '20',
            'Path': '/gallery/main.php?g2_highlightId=685',
            'RequestId': 'Request#47',
            'ResponseCode': Decimal('302'),
            'Timezone': 'GMT-0700'},
           {'Date': '2017-07-20',
            'Hour': '20',
            'Path': '/gallery/main.php?g2_itemId=24659&g2_highlightId=24674',
            'RequestId': 'Request#20',
            'ResponseCode': Decimal('302'),
            'Timezone': 'G