Skip to content

Persistence

Joshua Essex edited this page Oct 8, 2019 · 1 revision

Persistence

Once entities and entity graphs have successfully passed through our ingest pipeline, they are ready to be persisted to our database. There is not much complexity to discuss here, but a few points are worth emphasizing.

General-purpose

There is no jurisdiction-specific logic within the persistence layer, neither through configuration parameters, inheritance, or plugin hooks. Further, the code is general-purpose across different conceptual schemas, i.e. it can take in entity graphs from both the county corrections and state corrections and supervision schemas and persist them through the same logical flow. This is important because it implies certain database invariants will be consistent across all data we ever write into our database.

Attempts to de-generalize persistence code should be reviewed carefully.

SQLAlchemy

Our persistence layer is built around the SQLAlchemy ORM. The ORM requires classes for our entity models that are defined using SQLAlchemy types and decorators -- queries to the database (both reads and writes) can then be constructed using OOP-style method calls directly on these entities, instead of hand-written queries. To ORM or not to ORM is a classical debate in software engineering, and we have chosen to ORM for the time being as it has accelerated database-related development considerably. This may be revisited in the future.

Executing queries within a transaction is a matter of opening a session, merging objects meant to be added, updated, or deleted to that session, and then committing that session. An example of this can be seen in recidiviz.persistence.database.database#_save_record_trees. This takes in a list of entity graphs rooted around specific person objects, all of which are meant to be updated within the database, and merges them into the given session. The session is flushed at the end, but opening and committing the session happens further up in the stack, in the persistence.py#write function which invokes this.

Application vs persistence layer definitions

This does mean that for every application-layer entity model (e.g. the classes in state/entities.py or county/entities.py) there is an equivalent persistence-layer entity model (e.g. the classes in state/schema.py or county/schema.py) and there is logic in the recidiviz.persistence.database.schema_entity_converter package for transforming between these objects while passing through the persistence interface.

Here is a simple, abbreviated example showing the difference in these two entity types, from our county corrections schema:

Application-layer definition of a Bond:

@attr.s
class Bond(ExternalIdEntity, BuildableAttr, DefaultableAttr):
    """Models a Bond on a particular Charge."""

    amount_dollars: Optional[int] = attr.ib()
    bond_type: Optional[BondType] = attr.ib()
    bond_type_raw_text: Optional[str] = attr.ib()
    status: BondStatus = attr.ib()  # non-nullable
    status_raw_text: Optional[str] = attr.ib()
    bond_agent: Optional[str] = attr.ib()

    bond_id: Optional[int] = attr.ib(default=None)
    booking_id: Optional[int] = attr.ib(default=None)

Persistence-layer definition of a Bond:

class _BondSharedColumns:
    """A mixin which defines all columns common to Bond and BondHistory"""

    # Consider this class a mixin and only allow instantiating subclasses
    def __new__(cls, *_, **__):
        if cls is _BondSharedColumns:
            raise Exception('_BondSharedColumns cannot be instantiated')
        return super().__new__(cls)

    external_id = Column(String(255), index=True)
    amount_dollars = Column(Integer)
    bond_type = Column(bond_type)
    bond_type_raw_text = Column(String(255))
    status = Column(bond_status, nullable=False)
    status_raw_text = Column(String(255))
    bond_agent = Column(String(255))

    @declared_attr
    def booking_id(self):
        return Column(
            Integer,
            ForeignKey(
                'booking.booking_id', deferrable=True, initially='DEFERRED'),
            nullable=False)


class Bond(JailsBase, _BondSharedColumns):
    """Represents a bond in the SQL schema"""
    __tablename__ = 'bond'

    bond_id = Column(Integer, primary_key=True)

Historical snapshot updates

As discussed in detail in Temporal Tables, our database includes application-time temporal tables to allow tracking the state of each entity at a given point in/duration of time. The aforementioned _save_record_trees function in database.py calls the historical_snapshot_update module, which invokes our logic for determining which snapshots need to be created/opened and which need to be closed, based on the state of the ingested entities and their counterparts in the database, should they exist.

The SQLAlchemy session is passed from database.py to historical_snapshot_update.py so that both primary and historical table writes happen within the same transaction.

Once the historical update has completed successfully, the database.py#save_record_trees call returns and the caller commits the session, sealing the entire transaction in the database. If a failure occurred during snapshot updating, the error propagates upward until it gets caught, at which point the transaction is explicitly rolled back.

Scrape-based aggregate ingest

Public reports ingested via our scrape-based aggregate ingest channel have their own persistence flow, within recidiviz.persistence.database.aggregate. This includes SQLAlchemy ORM entity models for different aggregate tables, and a dao.py class for writing ingested numbers. This has little overlap with the rest of the logic described above.