Example of encoding metadata requirements in Python
Peter  15/11

This illustrates how decision trees and metadata requirements might be encoded in Python. Please don’t worry too much about the precise specification encoded in this example – it includes some of my proposals, but I’m just using it to indicate how we might set out details for discussion, implementation and revision.

Looking forward to feedback and comments.

In [1]:
def answer_yes_not_no(prompt):
    """Asks a yes/no question, and returns a boolean, true for answer yes"""
    answer = input(prompt + "(answer y or n)")
    if answer == "y": # Should cover many more possibilities, probably something in library
        return True
    else:
        return False

def define_small_molecule(): # Collect enough information to identify unambiguously
    """Options to give database reference, chemical structure etc"""
    pass # To be implemented

def get_concentration_and_units(prompt):
    """Ask for input of a concentration including an acceptable unit"""
    # Could return a data type including a number and a unit, perhaps a pointer to one in a list
    pass # To be implemented

def get_mass_and_units(prompt):
    """Ask for input of a mass including an acceptable fdunit"""
    # Could return a data type including a number and a unit, perhaps a pointer to one in a list
    pass # To be implemented

def get_volume_and_units(prompt):
    """Ask for input of a volume including an acceptable unit"""
    # Could return a data type including a number and a unit, perhaps a pointer to one in a list
    pass # To be implemented

def define_enzyme():
    """Collects information to unambiguosly define an enzyme, by either chemical nature, provenance or preparation process"""
    prompt = "\nIs this a fairly pure enzyme of known sequence?"
    if answer_yes_not_no(prompt):
        prompt = "\nIs any post-translational modification known for sure? (perhaps absent)"
        if answer_yes_not_no(prompt):
            collect_sequence()
            collect_posttrans()
            return
        else:
            prompt = "\nWas the enzyme produced in a well-characterised expression system?"
            if answer_yes_not_no(prompt):
                describe_expression_system()
                return
    prompt = "\nWas the enzyme obtained from an external source thought to be reproducible, e.g. commercial supplier?"
    if answer_yes_not_no(prompt):
        collect_provenance()
        return
    prompt = "\nWas the enzyme produced by expression from a known gene sequence?"
    if answer_yes_not_no(prompt):
        describe_expression()
    else:
        print("Are you from the last century?!") # Probably should actually cover this possibility, isolation from some biological material
    return

def collect_sequence():
    """Collects the sequence of an enzyme used"""
    pass # To be implemented

def collect_posttrans():
    """Collects full description of any post-translational modification of an enzyme used"""
    pass # To be implemented

def describe_expression_system():
    """Collects full details of the expression system used to prepare an enzyme, which will determine post-translational modification"""
    pass # To be implemented

def collect_provenance():
    """Collects necessary information on source to identify an enzyme, e.g. manufacturer, code and lot numbers"""
    pass # To be implemented

def describe_expression():
    """Collects necessary information on how an enzyme has been expressed from a construct of known sequence, recovered and possibly purified"""
    pass # To be implemented

def define_macromolecule():
    """Collects necessary information to define a macromolecule or other less well defined ingredient in a reaction mixture"""
    
    # Not the enzyme
    # May need to have separate parts for different classes, e.g. carbohydrates, proteins
    # Quite often this will concern a natural mixed material like lignocellulose - need experts on these to say how to define
    pass # To be implemented

def specify_pH_adjustment():
    """Collects details of how a pH adjustment is made"""
    pass # To be implemented

    
def define_solid_phase():
    """Collects details of how a solid phase added to the reaction is made, perhaps the biocatalyst"""
    
    # Might start by splitting to describe biocatalyst or other solid phase, but perhaps use same system for both
    # Exactly what needs to be defined depends on other features of the reaction system, so water content needed if
    # the reaction system is low water. So might need to establish this in global variable that is examined by define_solid_phase
    pass # To be implemented

def define_pure_solid():
    """Collects details to descibe a fairly pure solid"""
    
    # Chemical identity, but also probably polymorph, particle size and shape in applicable cases
    pass # To be implemented

def define_gas_or_SC_phase():
    """Collects details of a gaseous or supercritical phase"""
    
    # Or would it be better to separate these
    pass # To be implemented
 


def define_liquid_or_solution(): # Define the recipe for a homogeneous liquid or solution
    """Collects details of all components with their concentrations, specifies solvent, pH if this adjusted"""
    print("\nDescribe the composition of this liquid/solution by listing all ingredients.\n")
    prompt = "Please select one option:\n"
    prompt = prompt + "0. No more ingredients\n"
    prompt = prompt + "1. Specify the main liquid/solvent if this is not the default water\n"
    prompt = prompt + "2. Specify a well defined chemical entity, usually a small molecule\n"
    prompt = prompt + "3. Specify an enzyme preparation\n" 
    prompt = prompt + "4. Specify a less well defined material, probably a macromolecule\n"
    prompt = prompt + "5. Specify a pH adjustment step - should be placed after specifying all ingredients present when adjustment is made\n"
    finished = False
    while not finished:
        answer = input(prompt)
        if answer == "0":
            finished = True
        elif answer == "1":
            define_small_molecule() # And need to put answer into solvent field in place of default water
        elif answer == "2":
            define_small_molecule()
            get_concentration_and_units("Enter the concentration of this ingredient")
        elif answer == "3":
            define_enzyme()
            get_concentration_and_units("Enter the concentration of this enzyme")
        elif answer == "4":
            define_macromolecule()
            get_concentration_and_units("Enter the concentration of this ingredient")
        elif answer == "5":
            specify_pH_adjustment()
        else:
            print ("Invalid choice")

def describe_multiphase(): # Define the recipe for a multiphase reaction mixture
    """Collects details of exactly how the reaction mixture was prepared"""
    print("\nDescribe the recipe of the reaction mixture as a series of well-defined ‘Additions’.") # Needs more explanation
    prompt = "\nPlease select one option:\n"
    prompt = prompt + "0. No more additions\n"
    prompt = prompt + "1. The next addition is a pure liquid or a homogeneous liquid mixture or solution of defined composition\n"
    prompt = prompt + "2. The next addition is a solid phase, perhaps with adsorbed species, prepared by a defined procedure. Typically this will be the biocatalyst preparation.\n"
    prompt = prompt + "3. The next addition is a fairly pure solid compound\n" 
    prompt = prompt + "4. The next addition is a gas or supercritical phase of defined composition, including solute content in the supercritical case\n"
    finished = False
    while not finished:
        answer = input(prompt)
        if answer == "0":
            finished = True
        elif answer == "1":
            define_liquid_or_solution()
            get_volume_and_units("Enter the volume of this liquid addition")
        elif answer == "2":
            define_solid_phase()
            get_mass_and_units("Enter the mass of this solid phase addition")
        elif answer == "3":
            define_pure_solid()
            get_mass_and_units("Enter the mass of this solid addition")
        elif answer == "4":
            define_gas_or_SC_phase()
            # Need to think how to report quantity, perhaps different in two cases
        else:
            print ("Invalid choice")
 
        
# The (very short) main program

prompt = "Is the reaction mixture definitely a single phase?  (Beware of phase separation events that might not be expected "
prompt = prompt + "or easily noticed. For example, some other liquids are infinitely miscible with water, but may separate into "
prompt = prompt + "two liquid phases in the presence of other solutes (e.g. salts can make acetonitrile-water mixtures separate "
prompt = prompt + "into two phases). Another example is the precipitation of enzymes, sometimes as micro- or nano-particles "
prompt = prompt + "that are hard to detect.)"
isonephase = answer_yes_not_no(prompt)
if isonephase:
    define_liquid_or_solution()
else:
    describe_multiphase()
        


Describe the recipe of the reaction mixture as a series of well-defined ‘Additions’.
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice

Describe the composition of this liquid/solution by listing all ingredients.

Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid choice
Invalid ch

# Running the code

To run the code, you need to install the sdRDM package from Jan. For this please install by pip:

`python -m pip install git+https://github.com/JR-1991/software-driven-rdm.git`

In [5]:
from sdRDM import DataModel, Field
from typing import List, Optional

class Enzyme(DataModel):

    name: str = Field(
        description="Name of the enzyme"
    )
    sequence: str = Field(
        description="Amino acid sequence of the enzyme"
    )
    posttrans: Optional[str] = Field(
        description="Postranslational modification of the Enzyme"
    )


enzyme1 = Enzyme(
    name = "carboxylase",
    sequence = "aaaaa",
    posttrans = "glycosilation"
)

enzyme2 = Enzyme(
    name = "carboxylase",
    sequence = "aaaaa"
)

print(enzyme2.yaml())

name: carboxylase
sequence: aaaaa
__source__:
  root: Enzyme





The above code regularly needs input of information about the experiment, which might come from asking a user, or perhaps something like parsing an EnzymeML file. The input is actually obtained by the various Python functions (most of which are still to be implemented!). As far as I can see the only basic sub-program element available in Python is the function defined by def – although this is clearly much more versatile than functions in some other languages. Most of the functions are currently coded as returning nothing as a formal result. They would actually each obtain several attributes of an experiment, although perhaps these could be packaged into a Python data structure and returned as one result. But perhaps they just have to put values obtained into global variables.

So my key question, on which I would like advice from those more expert in Python, is the best data structure(s) to use for the attributes that make up the metadata of a given experiment. In each case there will be a series of required attributes, selected from a larger comprehensive list of attributes that may sometimes be needed. These attribute values are of a wide variety of types, including real numbers with units, text strings, booleans (e.g. single phase?), selections from a restricted list (e.g. batch, fed-batch, continuous), integers (e.g. number of impellers), database references, chemical structure diagrams, time programs (e.g. for temperature or pH). 

Would we use some of the data types available to construct something comparable to a database record?  Would each experiment be best recorded as an instance of a Python Class? Or something else? 

I also have some worries about the use of global variables for this purpose. Of course Python allows the function to alter global variables used in other parts of the code. And it seems to me, as a beginner, that this is standard practice. But is there some convention of good practice that reduces the risk of the problem with extensive code – that something in one function alters variables used elsewhere that it’s not supposed to change (according to its stated job). You will perhaps recognise here the concern of someone brought up on Pascal, where global variables should be used only very sparingly, and the definition of a function or procedure sets out explicitly what variables it is allowed to change. Python is not restrictive like this, but to avoid problems in writing the code that calls a function, I think there has to be very clear specification of which metadata values each function is allowed to alter, and as much protection as possible to stop it accidentally altering others.

I look forward to hearing views and recommendations on this from any Python specialists, and am happy for this notebook to be shared as appropriate.

Peter  15/11
