Skip to content

Data System High Level Design

David Ihnen edited this page Feb 9, 2020 · 6 revisions

This is a back end data system designed to deliver, at scale and volume, information to drive the phylogeny explorer application.

The data has some interesting characteristics.

  • mutation rate slow relative to query - delta:query ratio approaches 0
  • graph structure - the overall shape is that of a graph
  • graph-directed inheritance - normalized attributes apply to children
  • normalized storage - each attribute is represented only change points in canonical data
  • highly denormalized delivery - delivered entry records have all attributes relevant represented

There are several types of query that we have identified we will need to serve

  • Clade Entry - retrieval of a particular clade entry. Dozens of times at page load, second-by-second during exploration
  • Text Search index
    • Attribute names - the process of finding attributes to use for other searches
      • Autocomplete support - performant text substring searching - 100ms @ 95% peak active user load.
    • Clade names
      • Autocomplete support - performant text substring searching - 100ms @ 95% peak active user load.
    • Description text
    • more of a 'help us find things by description' that has a search result page
  • Graph search index
    • retrieval of clade entries based on their relationships
  • Time interval index - clades which intersect with a particular time point

#Scale and flow

A principle principal of this design is to use the least possible amount of processing resources to serve a request. In evolved designs this is handled with cache but this has downsides such as only caching questions that have been answered before, and heavily loading the server on cache misses due to expiration or operational concerns like a new cloudfront cache node coming online, or deploying a new version of the application. We can take a proactive approach effectively because the delta:query ratio approaches zero - we can spend several very intense spot-minutes with high cpu capability pre-answering, assembling, and indexing all of the common query result patterns for rapid retrieval later with minimal resources required for the rest of the site's operation, allowing a smaller cluster of servers to handle the live load. These approaches also make it trivial to maintain stateless requesting, distributing the requests across more hardware as demand increases - with no star topology and its accompanying headaches of replication synchronization and ultimate dependency on a single conceptual system that is coordinates state across the application.

SYSTEM CONCEPTUAL BLOCKS

  • Static Content

    • A simple file system accessible through the main web url.
    • Provides application resources, images, and other unchanging bits of front end related content.
    • Also has a category to provide clade entry particular media resources
  • Front End Web Server

    • Provides routing for the web application front end
    • Provides api endpoints internal to the web application
  • Authentication and Authorization Store

    • Who are our users
    • What are our groups
    • What are our roles
    • What actions can a role do
    • What actions can't a role do
    • Which users are in which groups
    • Which groups are in which groups
  • Data API Server

    • Provides API endpoints for access to the indexed and queryable data
    • External

STATES

ANONYMOUS STATE

The user story lifetime starts by hitting a url that directs the browser to load the static resources that represent the client code of the application in the form of webpack and other static content. The application initializes and makes a connection to the front end web server - probably by default through REST requests.

##USER SESSION The negotiation between the web application and the front end web server continues to the establishment of user identity as relevant with the Authentication and Authorization Store.

##QUERY RESULTS PENDING FRONT END ASSEMBLY The application will make a query based on its display requirements or in response to the user's actions. This query will be communicated to the Front end web server via defined API

<>

The front end web server will delegate the query to a series of APIs exposed by the Data API server

The Data API server provides a stable and versioned data access API definition. As the back end representations are rearranged to fit the requirements and flow necessary, the Data API Servers will abstract any of those changes from the operations of the front end web server (and other clients as may occur)

The Data API server will fulfill its contract - making suitable hits against the various back end indexes and servers that have been established to fill its needs.

Once the result is assembled, it will be rturned

##QUERY RESULTS COMPLETE

The front end web server will relay and complete