Refactor Snapshot and ArchiveResult to use `ulid` and `typeid` instead of `uuidv4` #1430

pirate · 2024-05-12T11:46:21Z

Summary

Migrate to Snapshot.ulid and ArchiveResult.ulid (instead of using Snapshot.timestamp as unique key).

Related issues

Fixes: #74

Changes these areas

This is the new folder layout I'm migrating to after after the switch from timestamps to ulids:

There is some nesting to avoid running into trouble with directories having 100k+ files and taking a long time to list.

For maximum practicality I went with [objecttype] / [date] / [domain] / [ulid] as the nesting order.

This satisfies a bunch of the most common use cases:

copying/backing up/deleting all of the data from a specific daterange
finding all of the data for a specific domain
finding all of the data matching a specific ulid

For maximum fun, the ULIDs also embed information about the object type, timestamp, url, subtype, and some randomness (in case you happen to snapshot the same domain with the same extractor a few thousands times in the same millisecond).

archive/
    results/
        [20241231]/
            [example.com]/
                [result.ulid]
                    index.json

                    logs/
                        cmd.sh
                        stderr.log
                        stdout.log
                        returncode.log
                    data/
                        output.pdf                        

    snapshots/
        [20241231]/
            [example.com]/
                [snapshot.ulid]
                    index.json
                    title.txt    -> ./title/title_from_html.txt
                    source.txt   -> /sources/[source.path]
                    persona      -> /personas/[personaname]

                    [result.type] -> /archive/results/*/*/[result.ulid]

class ULIDParts(NamedTuple):
    timestamp: str
    url: str
    subtype: str
    randomness: str


class Snapshot(models.Model):
    ...
    
    @property
    def url_hash(self):
		"""
		'E4A5CCD9AF4ED2A6E0954DF19FD274E9CDDB4853051F033FD518BFC90AA1AC25'
		"""
        return hashlib.sha256(self.url.encode('utf-8')).hexdigest().upper()
    
    @property
    def ulid_from_urlhash(self):
        """
        'E4A5CCD9'     # takes first 8 characters of sha256(url)
        """
        return self.url_hash[:8]

    @property
    def ulid_from_timestamp(self):
        """
        '01HX9FPYTR'   # produces 10 character Timestamp section of ulid based on added date
        """
        return str(ulid.from_timestamp(self.added))[:10]

    @property
    def ulid_from_type(self):
        """
		Snapshots have 00 type, other objects have other subtypes like wget/media/etc.
		Also allows us to change the ulid spec later by putting special sigil values here.
		"""
        return '00'

    @property
    def ulid_from_randomness(self):
    	"""
    	'ZYEBQE'   # takes last 6 characters of randomness from existing legacy uuid db field
    	"""
        return str(ulid.from_uuid(self.id))[-6:]

    @property
    def ulid_tuple(self) -> ULIDParts:
        """
        ULIDParts(timestamp='01HX9FPYTR', url='E4A5CCD9', subtype='00', randomness='ZYEBQE')
        """
        return ULIDParts(
			self.ulid_from_timestamp,
			self.ulid_from_urlhash,
			self.ulid_from_type,
			self.ulid_from_randomness,
		)

    @property
    def ulid(self):
        """
        <ULID('01HX9FPYTRE4A5CCD900ZYEBQE')>         # new unique primary key
        """
        return ulid.parse(''.join(self.ulid_tuple))

    @property
    def uuid(self):
        """
        UUID('018f52fb-7b58-7114-5631-a9003fe72eee') # uuid4-comaptible encoding of new ulid
        """
        return self.ulid.uuid

    @property
    def typeid(self):
        """
        TypeID('snapshot_01hx9fpytre4a5ccd900zyebqe')   # equivalent to <type>_<ulid>
        """
        return TypeID.from_uuid(prefix='snapshot', suffix=self.ulid.uuid)

This has the very cool property that all of the ArchiveResults under a certain snapshot share the same prefix, e.g.:

>>> snap = Snapshot.objects.filter(url__icontains='browserless')[0]
>>> snap.ulid_tuple
ULIDParts(timestamp='01HX9FPYTR', url='E4A5CCD9', subtype='00', randomness='ZYEBQE')
>>> snap.ulid
# 01HX9FPYTRE4A5CCD900ZYEBQE

>>> result = snap.archiveresult_set.last()
>>> result.ulid_tuple
ULIDParts(timestamp='01HX9FPYTR', url='E4A5CCD9', subtype='72', randomness='P0YDSB')
# 01HX9FPYTRE4A5CCD972P0YDSB

This means that all the data in the system that uses this ulid format will sort lexicographically together properly, and in the same order/grouping as the nested archive/ folder structure provides.

Even if all the data were thrown together in one big folder it would maintain all the nice ordering properties of objtype > date > domain > subtype > uuid.

pirate · 2024-05-13T14:49:11Z

pirate added 2 commits May 12, 2024 04:45

add ulid and typeid to Snapshot and ArchiveResult

33bc462

automatically create storage directories and symlinks based on ulid

ce833e8

pirate linked an issue May 12, 2024 that may be closed by this pull request

Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #74

Closed

pirate added 7 commits May 12, 2024 19:25

dont wait for ipython history saver thread before shell exit

b5ad134

move monkey patches to dedicated file

e97d779

switch from monkey patching WebhookModel to using swappable

f896e5d

create abid_utils with new ABID type for ArchiveBox IDs

4f9f22e

remove accidentally commited db

9733b8d

switch everywhere to use Snapshot.pk and ArchiveResult.pk instead of id

0420662

add migrations to create and populate ABIDField and UUIDField values

206e7e7

pirate added 5 commits May 13, 2024 07:49

only use domain part of uri for hash

1ba8215

add created, modified, updated, created_by and update django admin

241a7c6

add migrations for third round of field changes

a4cc10d

add API support for obj.pk .uuid .abid

406f570

add docstrings

fdf6f46

pirate mentioned this pull request May 16, 2024

Make Could not find profile "Default" in CHROME_USER_DATA_DIR a warning instead of an error, and move to new PERSONAS_DIR system #1425

Open

pirate added 5 commits May 17, 2024 20:11

fix abid calculation

a1afd02

make abids searchable in the admin ui

acfd346

show original section titles in config admin ui

29c7aa2

fix singlefile extractor exception when result is none

774ce3f

change live snapshot preview iframe sandbox rules

e4176db

pirate merged commit 3114980 into snapshot-detail-ui Jun 3, 2024
2 of 7 checks passed

pirate deleted the ulid-typeid-refactor branch June 3, 2024 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Snapshot and ArchiveResult to use `ulid` and `typeid` instead of `uuidv4` #1430

Refactor Snapshot and ArchiveResult to use `ulid` and `typeid` instead of `uuidv4` #1430

pirate commented May 12, 2024 •

edited

pirate commented May 13, 2024

Refactor Snapshot and ArchiveResult to use ulid and typeid instead of uuidv4 #1430

Refactor Snapshot and ArchiveResult to use ulid and typeid instead of uuidv4 #1430

Conversation

pirate commented May 12, 2024 • edited

Summary

Related issues

Changes these areas

pirate commented May 13, 2024

Refactor Snapshot and ArchiveResult to use `ulid` and `typeid` instead of `uuidv4` #1430

Refactor Snapshot and ArchiveResult to use `ulid` and `typeid` instead of `uuidv4` #1430

pirate commented May 12, 2024 •

edited