Skip to content

Cache PAG serialization #20

@SamStudio8

Description

@SamStudio8

After three months of Majora-ing I think we have discovered an interesting flaw in the process model. I think it's important that we're able to model the concepts of samples, tubes, boxes, files, directories and the processes that are applied to them. It means we can quickly return information on particular artifacts and more easily model how to create and validate such artifacts through the API. It makes natural sense to send and receive information about these real world items through the API with structures that try to represent them.

Yet, when it comes to analyses, we most often want to dump our knowledge about these carefully crafted objects into a gigantic unstructured flat file to tabulate, count and plot things of interest. It's not impossible to do this - we already can unroll all the links between artifacts and processes to traverse the process tree model that is central to how Majora records the journal of an artifact.

The two issues with this are:

  • The unravelling is quite slow, likely owing to the suboptimal implementation (given my Django learning curve and time constraints) and the sheer number of models involved
  • The unravelling is quite inflexible. Currently the API supports unravelling Published Artifact Groups and Sequencing Runs and not much else. The serializers for the latter are even a special implementation to work specifically for flattening metadata and metrics for artifacts that lead up to a sequencing run.

The first is not hugely problematic, we request this data from the database infrequently. However the latter is why I'm writing this issue. I want users to be able to request specific information ("columns") of metadata pertaining to any group of artifacts in the system - ideally in a fast and simple fashion.

This led me to think more about what the PAG really represents: If you think about it, the Published Artifact Group is a brief highlight reel of the journey an artifact has taken through its analysis lifespan (eg. for the Covid work,a PAG shows the sample and its FASTA - skipping everything in-between). We can formalise the idea of binding everything (including that in-between part) by specifically linking all the processes that were performed onto the Published Artifact Group.

I've previously discussed this idea and first thought about collecting all the processes from the start of the process tree to the end (eg. a sample, through to its FASTA) and adding these to a process_set on the Published Artifact Group. One could then ask all the processes in this group to serialize themselves, potentially with some context (eg. "these columns only"). We can formalise this slightly better by adding a concrete idea of a "journal" as a many-to-many FK on the Artifact and Process-related models.

That is, we still maintain the audit linkage of what processes were applied to which artifacts and when. But once the result of such a journey is final and a Published Artifact Group is minted, we can collect all those processes and label them with a specific journal_id. This means we can fetch all the processes related to a PAG/journal and serialise them without processing the tree.

If this still doesn't suffice, a post_save process for the PAG could serialize all the information and store it as JSON in postgresql or something.

Metadata

Metadata

Assignees

Labels

P:HIGHPresents a significant roadblock to activitiesenhancementNew feature or requestnextcool things coming soonperf

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions