Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Snapshot and ArchiveResult to use ulid and typeid instead of uuidv4 #1430

Merged
merged 19 commits into from
Jun 3, 2024

Conversation

pirate
Copy link
Member

@pirate pirate commented May 12, 2024

Summary

Migrate to Snapshot.ulid and ArchiveResult.ulid (instead of using Snapshot.timestamp as unique key).

Related issues

Fixes: #74

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

This is the new folder layout I'm migrating to after after the switch from timestamps to ulids:

There is some nesting to avoid running into trouble with directories having 100k+ files and taking a long time to list.

For maximum practicality I went with [objecttype] / [date] / [domain] / [ulid] as the nesting order.

This satisfies a bunch of the most common use cases:

  • copying/backing up/deleting all of the data from a specific daterange
  • finding all of the data for a specific domain
  • finding all of the data matching a specific ulid

For maximum fun, the ULIDs also embed information about the object type, timestamp, url, subtype, and some randomness (in case you happen to snapshot the same domain with the same extractor a few thousands times in the same millisecond).

archive/
    results/
        [20241231]/
            [example.com]/
                [result.ulid]
                    index.json

                    logs/
                        cmd.sh
                        stderr.log
                        stdout.log
                        returncode.log
                    data/
                        output.pdf                        

    snapshots/
        [20241231]/
            [example.com]/
                [snapshot.ulid]
                    index.json
                    title.txt    -> ./title/title_from_html.txt
                    source.txt   -> /sources/[source.path]
                    persona      -> /personas/[personaname]

                    [result.type] -> /archive/results/*/*/[result.ulid]

class ULIDParts(NamedTuple):
    timestamp: str
    url: str
    subtype: str
    randomness: str


class Snapshot(models.Model):
    ...
    
    @property
    def url_hash(self):
		"""
		'E4A5CCD9AF4ED2A6E0954DF19FD274E9CDDB4853051F033FD518BFC90AA1AC25'
		"""
        return hashlib.sha256(self.url.encode('utf-8')).hexdigest().upper()
    
    @property
    def ulid_from_urlhash(self):
        """
        'E4A5CCD9'     # takes first 8 characters of sha256(url)
        """
        return self.url_hash[:8]

    @property
    def ulid_from_timestamp(self):
        """
        '01HX9FPYTR'   # produces 10 character Timestamp section of ulid based on added date
        """
        return str(ulid.from_timestamp(self.added))[:10]

    @property
    def ulid_from_type(self):
        """
		Snapshots have 00 type, other objects have other subtypes like wget/media/etc.
		Also allows us to change the ulid spec later by putting special sigil values here.
		"""
        return '00'

    @property
    def ulid_from_randomness(self):
    	"""
    	'ZYEBQE'   # takes last 6 characters of randomness from existing legacy uuid db field
    	"""
        return str(ulid.from_uuid(self.id))[-6:]

    @property
    def ulid_tuple(self) -> ULIDParts:
        """
        ULIDParts(timestamp='01HX9FPYTR', url='E4A5CCD9', subtype='00', randomness='ZYEBQE')
        """
        return ULIDParts(
			self.ulid_from_timestamp,
			self.ulid_from_urlhash,
			self.ulid_from_type,
			self.ulid_from_randomness,
		)

    @property
    def ulid(self):
        """
        <ULID('01HX9FPYTRE4A5CCD900ZYEBQE')>         # new unique primary key
        """
        return ulid.parse(''.join(self.ulid_tuple))

    @property
    def uuid(self):
        """
        UUID('018f52fb-7b58-7114-5631-a9003fe72eee') # uuid4-comaptible encoding of new ulid
        """
        return self.ulid.uuid

    @property
    def typeid(self):
        """
        TypeID('snapshot_01hx9fpytre4a5ccd900zyebqe')   # equivalent to <type>_<ulid>
        """
        return TypeID.from_uuid(prefix='snapshot', suffix=self.ulid.uuid)

This has the very cool property that all of the ArchiveResults under a certain snapshot share the same prefix, e.g.:

>>> snap = Snapshot.objects.filter(url__icontains='browserless')[0]
>>> snap.ulid_tuple
ULIDParts(timestamp='01HX9FPYTR', url='E4A5CCD9', subtype='00', randomness='ZYEBQE')
>>> snap.ulid
# 01HX9FPYTRE4A5CCD900ZYEBQE

>>> result = snap.archiveresult_set.last()
>>> result.ulid_tuple
ULIDParts(timestamp='01HX9FPYTR', url='E4A5CCD9', subtype='72', randomness='P0YDSB')
# 01HX9FPYTRE4A5CCD972P0YDSB

This means that all the data in the system that uses this ulid format will sort lexicographically together properly, and in the same order/grouping as the nested archive/ folder structure provides.

Even if all the data were thrown together in one big folder it would maintain all the nice ordering properties of objtype > date > domain > subtype > uuid.

@pirate
Copy link
Member Author

pirate commented May 13, 2024

Screenshot 2024-05-13 at 7 48 38 AM
Screenshot 2024-05-13 at 7 48 22 AM

@pirate pirate merged commit 3114980 into snapshot-detail-ui Jun 3, 2024
2 of 7 checks passed
@pirate pirate deleted the ulid-typeid-refactor branch June 3, 2024 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp
1 participant