Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing to read PDB files generated by VMD #48

Open
lmiq opened this issue May 24, 2024 · 60 comments
Open

failing to read PDB files generated by VMD #48

lmiq opened this issue May 24, 2024 · 60 comments

Comments

@lmiq
Copy link
Contributor

lmiq commented May 24, 2024

When trying to read this file (and other similar ones), I get:

julia> read("vmd.pdb", BioStructures.PDB)
ERROR: Two copies of the same atom have the same alternative location ID. Existing atom:
Atom N with serial 1, coordinates [-41.156, -40.019, 21.411]
New atom record to add:
AtomRecord N with serial 1501, coordinates [-42.523, -41.722, -19.881]

This is one example VMD pdb file.

vmd.pdb.zip

@jgreener64
Copy link
Member

The offending PDB lines are

ATOM      1  N   POPE    1     -41.156 -40.019  21.411  1.00  0.00      L11  N

and

ATOM   1501  N   POPE    1     -42.523 -41.722 -19.881  1.00  0.00      L21  N

which both try to put an atom with name N onto residue 1 of the chain with empty chain ID.

We treat this defensively since there are no or very few cases in the PDB that have such duplicate atoms. I reported a handful of cases to the PDB a while ago and they updated the records, so I believe it is considered a format violation.

In this case you could use different chain IDs (the empty column 22) or keep incrementing the residue number.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

Since this file was written by VMD, I assume this should be reported to them? @lmiq, can you do that?

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

Well, there is nothing wrong with that file for the use it was designed for, so I do not think there's anything to report there, upstream. The parsing of the residues in that file is simply done by incrementing the residue counter when the residue index changes (for more or less). Having limitations in the number of residues per chain, or number of chains, etc, is something that cannot be important in MD simulations PDB files.

This is the choice to be made here: have or not the possibility of parsing non-standard files to some degree.

I perfectly understand if BioStructures keeps strictly adhering to the standard, but that would not allow us using it in the most broad context of MD simulations.

Note that this limitation, specifically, is associated to trying to read the data into the hierarchical structure, so in some sense this is related to that initial choice of representation.

Also, note the L11 and L21 segment identifiers. These are used as the "top" level of hierarchy in the context of VMD and some simulations packages, but are barely used for reporting experimental structures.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

If it's true that it violates https://www.wwpdb.org/documentation/file-format (I haven't checked), I'd say that's clearly a problem. It's not OK that there's a workaround. It introduces ambiguities and issues, precisely as being discussed here.

But if the format is ambiguous, then that's another issue entirely.

@jgreener64
Copy link
Member

jgreener64 commented May 24, 2024

It's certainly challenging to balance the considerations.

Note that this limitation, specifically, is associated to trying to read the data into the hierarchical structure, so in some sense this is related to that initial choice of representation.

This is the key, BioStructures guarantees that every atom has "meaning" about what it is (i.e. on which residue and chain), which is useful in many contexts (such as file interchange) but in turn necessitates some complexities in representation and strictness in parsing.

The only way the original case could work currently is if the existing atom is overwritten (or the new atom ignored) with a warning, which is unlikely to be desired behaviour even if it runs without error.

I guess philosophically the aim of BioStructures is to not so much to read structural files, but to represent unambiguously the molecules within them. This affects other design decisions, such as why countatoms will treat disordered versions of the same atom as one (unless expand_disordered is used).

But if the format is ambiguous, then that's another issue entirely.

Sadly, this seems to be the case. There is a lot of documentation on the column format, but as far as I know not so much on the row format (e.g. what duplicates mean). In general I prefer it when other tool authors use MODEL blocks to distinguish conformations in a MD trajectory though. That works well in BioStructures.

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

From the description, that file does not adhere to the standard because:

- Non-blank alphanumerical character is used for chain identifier.

and there the chain identifier is blank. But the standard does not say anything about two residues in the same chain having the same number (which is actually the issue here). For actual bio-structures it is just assumed that that does not make sense. But when part of the system is thousands of water molecules, that is just unnecessarily limiting.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

Given that the format is ambiguous, then perhaps we do need to deal with ambiguity. I don't think this should be open-ended: it's kind of a disaster to take the strategy "can I find something that seems to make sense in column 5? No? OK, let's see if things seem more sensible if I substitute the value in column 11? No? OK, maybe I can infer a good value from column 13?" But if instead one passes flavor=:vmd to the reader, and that follows strict rules, things are more sensible.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

when part of the system is thousands of water molecules

is this part of the issue? That (legacy) PDB only lets you use a single character for encoding the chain? In which case shouldn't VMD be writing mmCIF instead?

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

As a minimal example:

ATOM   4247  H   HOH     1     -28.310   3.370 -27.343  1.00  0.00      HATI                          
ATOM   4248  H   HOH     1     -27.809   4.855 -27.803  0.00  0.00      HATI                          
ATOM   4249  O   HOH     1     -27.550   4.020 -27.318  0.00  0.00      HATI                          
ATOM   4250  H   HOH     2     -19.010   5.493  44.300  0.00  0.00      HATI                          
ATOM   4251  H   HOH     2     -18.958   5.838  45.896  1.00  0.00      HATI                          
ATOM   4252  O   HOH     2     -19.119   6.215  44.984  0.00  0.00      HATI                          
ATOM   4253  H   HOH     1      12.951   6.641  28.351  1.00  0.00      HATI                          
ATOM   4254  H   HOH     1      13.590   7.926  29.130  0.00  0.00      HATI                          
ATOM   4255  O   HOH     1      13.304   6.973  29.226  0.00  0.00      HATI 

VMD, Pymol, Packmol, and other common simulation packages interpret this without problems. There are three water molecules there.

The issue is that this does not fit into an hierarchical structure, unless the reader creates chains arbitrarily. That would come with a bunch of problems as well.

is this part of the issue? That (legacy) PDB only lets you use a single character for encoding the chain? In which case shouldn't VMD be writing mmCIF instead?

Well, in some sense yes, the PDB format is problematic. But that ship has sailed decades ago in the MD field. Not that there aren't other formats that are more appropriate, particularly for the limitations of the coordinates fields, but the legacy PDB format is still widely used because of its readability. This is something we (MD simulation package authors) just have to live with.

ps: MIToS is reading this if one adds occupancy and b-factor fields (added now), because its hierarchy is at the residue level only, and it just not cares about the "meaning" of two residues with identical numbers in the same chain.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

So maybe we need a flavor=:md mode or something? @jgreener64 would that be acceptable?

@jgreener64
Copy link
Member

Yes, but the question still remains about how to put the two atoms into the same hierarchical structure.

We could assign an unused residue number or chain ID, but that feels clunky and runs into problems if the assigned value is found later during reading. We could have another data type that stores multiple coordinates for the same atom, but that raises concerns about how to write the file back out in the same order. We could assign duplicates into a new Model, which would mimic MODEL/ENDMDL tags but doesn't fit the case of many waters in a MD file.

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

Maybe the flavor should be the type of hierarchy desired: :segment, :chain, :residue (MIToS now), :atom (PDBTools) or whatever. I don't know, however, if that would make most of the other functionality of BioStructures fail to interoperate now, and thus the package very complex.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

I guess my thought is that we decide on concrete flavor-specific behavior and document it. Passing a flavor might require that the entire file be scanned first to identify "occupied" names/indices/whatever, which would be a bit of a performance hit, but perhaps the price of abusing a file format.

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

I think it is a little more complicated than that. If we keep the hierarchy we are forced to attribute some of the fields arbitrarily. That will cause confusion.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

Isn't the hierarchy about how you assign bonds? Meaning, you have implicit hierarchy any time you want to link entries in rows together.

@jgreener64
Copy link
Member

The issue is that multiple atoms want the same "spot" in the hierarchy, because the file contains multiple objects which have the same name. So you either have to discard one atom, have multiple versions of the hierarchy, or store duplicate atoms in that spot (beyond disorder, which we already do).

We could have a flag renumber_residues when reading that ignores the residue numbering in the file, and increments the assigned residue number every time the residue number changes. This would read cases like

ATOM   4247  H   HOH     1     -28.310   3.370 -27.343  1.00  0.00      HATI                          
ATOM   4248  H   HOH     1     -27.809   4.855 -27.803  0.00  0.00      HATI                          
ATOM   4249  O   HOH     1     -27.550   4.020 -27.318  0.00  0.00      HATI                          
ATOM   4250  H   HOH     2     -19.010   5.493  44.300  0.00  0.00      HATI                          
ATOM   4251  H   HOH     2     -18.958   5.838  45.896  1.00  0.00      HATI                          
ATOM   4252  O   HOH     2     -19.119   6.215  44.984  0.00  0.00      HATI                          
ATOM   4253  H   HOH     1      12.951   6.641  28.351  1.00  0.00      HATI                          
ATOM   4254  H   HOH     1      13.590   7.926  29.130  0.00  0.00      HATI                          
ATOM   4255  O   HOH     1      13.304   6.973  29.226  0.00  0.00      HATI 

as

ATOM   4247  H   HOH     1     -28.310   3.370 -27.343  1.00  0.00      HATI                          
ATOM   4248  H   HOH     1     -27.809   4.855 -27.803  0.00  0.00      HATI                          
ATOM   4249  O   HOH     1     -27.550   4.020 -27.318  0.00  0.00      HATI                          
ATOM   4250  H   HOH     2     -19.010   5.493  44.300  0.00  0.00      HATI                          
ATOM   4251  H   HOH     2     -18.958   5.838  45.896  1.00  0.00      HATI                          
ATOM   4252  O   HOH     2     -19.119   6.215  44.984  0.00  0.00      HATI                          
ATOM   4253  H   HOH     3      12.951   6.641  28.351  1.00  0.00      HATI                          
ATOM   4254  H   HOH     3      13.590   7.926  29.130  0.00  0.00      HATI                          
ATOM   4255  O   HOH     3      13.304   6.973  29.226  0.00  0.00      HATI 

but would still error on

ATOM   4247  H   HOH     1     -28.310   3.370 -27.343  1.00  0.00      HATI                          
ATOM   4248  H   HOH     1     -27.809   4.855 -27.803  0.00  0.00      HATI                          
ATOM   4249  O   HOH     1     -27.550   4.020 -27.318  0.00  0.00      HATI                          
ATOM   4250  H   HOH     1     -19.010   5.493  44.300  0.00  0.00      HATI                          
ATOM   4251  H   HOH     1     -18.958   5.838  45.896  1.00  0.00      HATI                          
ATOM   4252  O   HOH     1     -19.119   6.215  44.984  0.00  0.00      HATI                          
ATOM   4253  H   HOH     1      12.951   6.641  28.351  1.00  0.00      HATI                          
ATOM   4254  H   HOH     1      13.590   7.926  29.130  0.00  0.00      HATI                          
ATOM   4255  O   HOH     1      13.304   6.973  29.226  0.00  0.00      HATI 

Renumbering like this is a useful feature anyway that is available in a number of PDB packages.

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

I'm not sure if I understand your comment. In the MD files, I think it is safe to assume that the "residue" is the object by excellence. But the "residue number", or "name", "chain", or any other property, are just meaningless labels that help one to draw or select different parts of the structure.

Thus, if there was an option to just flatten the hierarchy to have "residues" at the top level, the MD files could be parsed without major issues (as MIToS does now, and PDBTools parses at the atom level but does provide an iterator over residues).

What I can´t say is that having the file parsed in a different hierarchy would require all the other functions of BioStructures to have additional methods, thus effectively creating a parallel package within it.

By the way, I noticed now that:

julia> pdb = read("./test.pdb", BioStructures.PDB)
ProteinStructure test.pdb with 1 models, 2 chains (A, ), 284 residues, 2078 atoms

julia> typeof(pdb)
BioStructures.ProteinStructure

that per-se would not fit quite well in MD files, as many of them do not have any protein whatsoever.

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

We could have a flag renumber_residues when reading that ignores the residue numbering in the file,

That would help, but I think it would still fall short, as the hierarchy on chains is not really meaningful in this context.

Concerning this:

ATOM   4247  H   HOH     1     -28.310   3.370 -27.343  1.00  0.00      HATI                          
ATOM   4248  H   HOH     1     -27.809   4.855 -27.803  0.00  0.00      HATI                          
ATOM   4249  O   HOH     1     -27.550   4.020 -27.318  0.00  0.00      HATI                          
ATOM   4250  H   HOH     1     -19.010   5.493  44.300  0.00  0.00      HATI                          
ATOM   4251  H   HOH     1     -18.958   5.838  45.896  1.00  0.00      HATI                          
ATOM   4252  O   HOH     1     -19.119   6.215  44.984  0.00  0.00      HATI                          
ATOM   4253  H   HOH     1      12.951   6.641  28.351  1.00  0.00      HATI                          
ATOM   4254  H   HOH     1      13.590   7.926  29.130  0.00  0.00      HATI                          
ATOM   4255  O   HOH     1      13.304   6.973  29.226  0.00  0.00      HATI 

That specifically is more complicated. VMD and Pymol recognize that only because they compute the connectivity from the distances and deduce that there are 3 residues there. That I think is beyond what is expected for a reading package. Also, Amber, for instance, just requires TER to be added after each molecule, and ignores everything else.

@jgreener64
Copy link
Member

as many of them do not have any protein whatsoever

The name ProteinStructure is an early design decision I have long regretted. Perhaps if we do a breaking release for #49 we can rename it to MolecularStructure or something. Hopefully it shouldn't break too much code, and it should be easy enough to search and make PRs for in other repos.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

But the "residue number", or "name", "chain", or any other property, are just meaningless labels that help one to draw or select different parts of the structure...
Thus, if there was an option to just flatten the hierarchy to have "residues" at the top level, the MD files could be parsed without major issues

I think I begin to see your point. You're basically saying that it's OK with you if all the waters use the same label and you can't select them individually, right? That's a valid perspective if either (1) you don't actually care about bonds, or (2) there's an external program to assign bonds. I'd guess that few people are actually in camp 1, but camp 2 is presumably common. Is there any ambiguity, though, about assigning bonds simply by distance? If so, then the reader is still partially responsible for indicating hierarchy, like via the TER that Amber requires.

So to me it seems that some kind of renumbering, based on an unambiguous termination indicator, might be a reasonable solution?

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

Bonds (or more generally the topology of the molecules) are defined in different files in MD simulations. In the PDB files they can be written with the CONECT keyword, at the end, but most commonly (never? not sure) these are not used.

There are certainly problems in assigning bonds based on distances. Sometimes atoms from different molecules are too close and that causes errors. VMD throws erros/warnings in these cases, when it detects that some atom appears to have more bonds that allowed. But assigning bonds based on distances is useful but only a workaround when the topology files are not provided.

I think I begin to see your point. You're basically saying that it's OK with you if all the waters use the same label and you can't select them individually, right?

You could still select one individually by a residue counter (which might or not match the "residue number" written in the PDB file), exactly how it happens when such files are read by MIToS), where the residue counter is just the index of that residue in the residue array that results from reading.

So to me it seems that some kind of renumbering, based on an unambiguous termination indicator, might be a reasonable solution?

The packages dissociate the residue and atom counts from the "residue number" and "atom index" as written in the PDB files. What we do, which I think is a general enough solution, is that the residue counter is increased whenever any of the residue labels change: residue number, chain, residue name, segment. If any of those change, we assume that a new residue started.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

I'd favor strict-by-default and requiring that you pass renumber=true to activate that mode, but renumbering seems like a sensible solution.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

is that the residue counter is increased whenever any of the residue labels change: residue number, chain, residue name, segment. If any of those change, we assume that a new residue started.

Just to check, the case in #48 (comment) doesn't fit that description, does it? Are you wishing that would work anyway or are you OK with requiring some kind of termination signal?

@jgreener64
Copy link
Member

jgreener64 commented May 24, 2024

What we do, which I think is a general enough solution, is that the residue counter is increased whenever any of the residue labels change: residue number, chain, residue name, segment. If any of those change, we assume that a new residue started.

I'd favor strict-by-default and requiring that you pass renumber=true to activate that mode, but renumbering seems like a sensible solution.

This is what I had in mind with renumber_residues, I can look into making the change. Having different molecules in the same chain is quite common in the PDB, e.g. for waters.

Note, this would mean you can't write the original PDB file back out as we would discard the residue numbers in the file.

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

This is what I had in mind with renumber_residues, I can look into making the change. Having different molecules in the same chain is quite common in the PDB, e.g. for waters.

Uhm... I don't think residue renumbering is the correct approach. We do not want residue numbers to be changed when reading protein residues, for example.

What is necessary is to dissociate the residue counter from the residue number as written in the PDB file. (And I would say also dissociate the atom counter from the atom index as written in the PDB file). Note: VMD has different attributes for each: resnum and resid. In PDBTools I also have that, and also added pdb_index for the atom index as written in the PDB file.

I see now that the residues are stored in a Dict in BioStructures, which makes things slightly harder. If it was an ordered Dict, the index of the residue in the ordered Dict would be the residue counter.

I wander if having the residues of each chain in a vector wouldn't be more appropriate - I'm not very comfortable with the order of the residues in the file being meaningless. For instance this is how the residues of h6n6 appear:

julia> pdb[1].chains["A"].residues
Dict{String, BioStructures.AbstractResidue} with 282 entries:
  "407" => Residue 407:A with name LYS, 9 atoms
  "371" => Residue 371:A with name ARG, 11 atoms
  "447" => Residue 447:A with name ILE, 8 atoms
  "335" => Residue 335:A with name ASN, 8 atoms

not having them in order is somewhat strange.

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

Just changing the dicts to OrderedDict provides:

julia> pdb[1].chains["A"].residues
OrderedDict{String, AbstractResidue} with 282 entries:
  "225" => Residue 225:A with name SER, 6 atoms
  "226" => Residue 226:A with name ALA, 5 atoms
  "227" => Residue 227:A with name ASN, 8 atoms
  "228" => Residue 228:A with name GLU, 9 atoms

which then, if read with the "new residue approach", would implicitly store both the counter and the residue numer. But I would probably prefer having a ResidueVector type with a overloaded getindex function to retrieve residue["225"] when appropriate.

Or having to explicitly use something like residue[ResID(225)], as having the residue number as a string is already not ideal.

Just to check, the case in #48 (comment) doesn't fit that description, does it? Are you wishing that would work anyway or are you OK with requiring some kind of termination signal?

@timholy sorry, missed this comment. Yes, that case I think we can leave out. Maybe support TER after each molecule as workaround, as an important MD package (Amber) uses that, but it would be very specific (I had to support that in Packmol, nevertheless).

@jgreener64
Copy link
Member

Am I understanding right that the key in the OrderedDict case would be the residue number from the file, and the ordering is based on the residue counter? How would this work in the case of multiple residues with the same residue number in the file (since that would be multiple values with the same key)?

I guess this could be solved with a vector type, but then what should be returned when you index into it with "225", considering that there might be multiple residues with that residue number in the file?

I'm not very comfortable with the order of the residues in the file being meaningless

The order is not stored directly in the object, but the residues are sorted when writing out. This means that different file representations of the same underlying molecule get written out to the same file, but you do lose the ordering of the input file. In the usual case of ascending residue numbers in the input file (possibly with gaps), this is preserved in the output file.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

We do not want residue numbers to be changed when reading protein residues, for example.

I think @jgreener64's point is that this admirable goal makes things harder---if you only increment the counter when something is in conflict, what happens if a later structure in the file uses the number you "stole" in order to do the increment? Two possible solutions are (1) to parse the file twice to identify all conflicts in advance, or (2) transiently distinguish "unconflicted" ids and "conflicted" ids and then assign final ids after all unconflicted ones have been assigned. While I proposed (1) above, maybe (2) is the better choice.

I see now that the residues are stored in a Dict in BioStructures, which makes things slightly harder. If it was an ordered Dict, the index of the residue in the ordered Dict would be the residue counter.

I've learned that there's a separate field that gives the order of the keys, and you can access the dict with integer indexing as dict[keyorder[i]]. So effectively there is an OrderedDict even if it isn't represented that way. I'm not sure I fully understand how disordered locations (when you have more than one location for a residue or atom) interact that way, though, which I think is the main purpose behind using strings as residue keys.

@timholy...yes, that case I think we can leave out.

👍

@lmiq
Copy link
Contributor Author

lmiq commented May 24, 2024

I think I need to give a step back and explain how we use the data here.

From our perspective, all fields of the PDB are just labels. Apart from the data of the fields, in the process of reading the file, one annotates, incrementally, an independent residue counter and an independent atom index counter.

At the end, we have a vector of atoms where each atom carries all that information:

julia> pdb = PDBTools.readPDB("/home/leandro/Downloads/6hn6.pdb")
   Array{Atoms,1} with 2090 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     SER     A      225        1   43.004   80.351   76.389  1.00 118.26     1       -         1
       2   CA     SER     A      225        1   42.216   81.220   75.509  1.00 116.77     1       -         2
       3    C     SER     A      225        1   42.952   81.518   74.177  1.00 118.83     1       -         3
                                                        
    2088    O     HOH     A      638      306   23.920   74.193   77.532  1.00 55.18     1       -      2089
    2089    O     HOH     A      639      307    6.395   85.305   51.528  1.00 104.62     1       -      2090
    2090    O     HOH     A      640      308   19.024   61.159   68.618  1.00 64.07     1       -      2091

There, for instance, index is the incremental atom counter, and residue is the incremental residue counter, while index_pdb is the content of the index field in the PDB file (rarely used) and resnum is the content of the residue number field in the PDB file.

Then, working with such a data structure consists in applying filters, which return vectors of atoms, or indexes:

julia> filter(sel"resnum < 300", pdb)
   Array{Atoms,1} with 587 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     SER     A      225        1   43.004   80.351   76.389  1.00 118.26     1       -         1
       2   CA     SER     A      225        1   42.216   81.220   75.509  1.00 116.77     1       -         2
       3    C     SER     A      225        1   42.952   81.518   74.177  1.00 118.83     1       -         3
                                                        
     585  CG1     ILE     A      299       75   23.370   60.986   68.597  1.00 44.83     1       -       585
     586  CG2     ILE     A      299       75   23.842   62.784   70.389  1.00 46.63     1       -       586
     587  CD1     ILE     A      299       75   24.450   60.003   68.979  1.00 49.51     1       -       587

julia> findall(sel"residue = 1", pdb)
6-element Vector{Int64}:
 1
 2
 3
 4
 5
 6

We do not need these filtering operations to be particularly fast, or lazy. But we do need to be able to select subsets of the structure with great versatility, potentially using incremental or pdb-defined residue numbers (very common), or any other atom property.

That data structure doesn't prevent us from iterating over residues, with an appropriate iterator:

julia> for res in eachresidue(filter(sel"residue <= 2", pdb))
           println(resname(res)," ",resnum(res))
       end
SER 225
ALA 226

and we could (but didn´t yet) define similar iterators for models, chains, etc, of course. We just don't use those as often.

These way dealing with the data does not have restrictions about duplicate residues, residue numbers, etc. If two residues have the same data, they will just be filtered together, as we would expect them to. Still we can differentiate them by their incremental residue counter and incremental atom indices. There are no conflicts when reading the data, and no special issues associated to repeated fields.

These way of storing and using the data is convenient for us. And I think that we must understand why or how it is not convenient for other people, to justify having different data structures, and to which extent.

For us, using a syntax like pdb["A"].residues["247"].atoms[1] is not very useful nor practical, since most of the time we are selecting subsets of atoms that may belong to different chains, residues, or molecules, like element O and water, or protein and backbone. Having the underlying data in a tree does not really help much.

@timholy
Copy link
Contributor

timholy commented May 24, 2024

We've also gotten a lot of benefit from random-access indexing. For example we might pick out all the positively-charged residues and examine their spatial distribution. We also compute displacement vectors from the alpha carbon to the side chain center-of-mass to determine whether residues in a 7TM are "interior" or "exterior." And so on.

I don't think the pdb["A"] portion of this will get in our way, but indeed when it gets down to the residue level we need utilities that seem missing from BioStructures. I'm optimistic all that is resolvable but it may take some careful design. And I agree that it can't successfully be a platform for the whole community unless we find ways to make that kind of thing easy.

That said, I think BioStructures is doing something important: it focuses on structural representation rather than file format, and it can get multiple file formats into that same representation. It also seems to take the complexities of that process seriously. It's why I'm prepared to do a fair amount of work to port our own code over to BioStructures, if the technical issues (convenience, comprehensiveness, speed, heaviness of dependency, etc) can all be resolved. I for one am quite optimistic that this is the case. And if not, I think the goal is worthy even if we have to have a fresh start. Fundamentally, the current fragmentation of the representations for structures is very detrimental to the long-term health of the Julia ecosystem for questions that relate to structural biology.

I have a conference deadline in a few weeks, so I am not sure how much I can do between now and then, but this is something I'm willing to put some work into to help make it happen.

@timholy
Copy link
Contributor

timholy commented May 25, 2024

What I meant from the previous post is that we should focus first in what is the utility of the packages before deciding the underlying format, and I wanted to understand if struct["A"].residue["274"].atoms[1] is something useful

Overall, this discussion seems a bit like a struct-vs-DataFrame debate. If you have a set of objects with containerization relationships, then x.y[2].z[3] is equivalent to the DataFrames-world x[:y => ==(2), :z => ==(3)]. Both are viable ways to get your work done. Some things might seem easier or more natural depending on

  • what the operation is
  • what your previous experience is
  • what utility functions are available

The last, of course, is something that can be enhanced any time there is an unmet need. But I don't think that all users would agree that x[:y => ==(2), :z => ==(3)] is a lot more natural than x.y[2].z[3].

I think this analogy is useful because it highlights that the three packages seem to pick different levels for their default representation:

  • BioStructures: fully embraces the struct perspective
  • PDBTools: fully embraces the DataFrame perspective, as it flattens everything down to the level of atoms and uses "columns" to indicate containerization relationships.
  • MIToS: hybrid. Selects the chain-level at reading time, and thus you only have one chain in a single variable. It then flattens DataFrame-like at the level of residues, but each residue is a struct.

This diversity highlights the fact that this choice is a bit arbitrary. To me that's a strong argument that there are many ways we can all get our work done, and that the most important thing to do is unify around a single representation, regardless of whether it's hierarchical or flat. We need a standard representation to build a large, high-quality stack in this space.

to justify the possible limitations of the representation that such a data structure brings

I'd propose that we distinguish the discussion about "hierarchical or flat" from the "strict vs permissive" discussion. While a Dict can only associate one value per key, if you really need to represent "both values" assigned to 'a' in

Dict('a' => 1, 'a' => 2)

you can always use

Dict('a' => [1, 2])

as long as the rest of the code knows that the value field should be interpreted as a container of values.

@lmiq
Copy link
Contributor Author

lmiq commented May 25, 2024

  • fully embraces the struct perspective

Sort of, because PDBs have also "model" and "segment" identifiers. There is a lot of arbitrariness in the way the PDB format classifies parts of the structures. "chain" is a synonym of "molecule" from a chemical point of view, but if that was taken to heart then water molecules should split on different chains.

IMHO the underlying representation of molecular structures should be at the "molecule" level, and then flat at the atom level. A "molecule" is effectively meaningful, but even the content of the molecule is subject to ambiguities (as multiple conformations or incomplete atomic positions).

All the remaining information that a PDB (or other format) contains are just arbitrary annotations, which I think can only be represented generally enough if at the atom level with custom fields.

If we had an underlying molecular structure format that represented molecules where atoms have optional fields, that could effectively be useful for interoperability of the various molecular-structure packages. Anticipating the possibility of storing the topology of the molecule in the same format is probably a must.

ps: The only real constraint we have on the underlying format is concerning performance for specific operations. I think the choice of one specific format could be limited in view if we identify an important and common operation that needs to be very fast on very large data sets. But probably in those cases intermediate representations are needed anyway.

@timholy
Copy link
Contributor

timholy commented May 25, 2024

CC @anton083. This discussion started on Slack (I invited @murrellb but a search didn't turn up a user handle for you), but most of the content is here now. The bottom line is that I'm trying to build momentum around the idea that currently we have 3 main packages for reading PDBs and that we really should settle on just one. This productive discussion is mostly about identifying any barriers that currently prevent this, and what can be done to fix them.

@timholy
Copy link
Contributor

timholy commented May 25, 2024

if that was taken to heart then water molecules should split on different chains.

Personally that seems like a good idea to me, if the one-character limit of PDB files weren't a constraint. Of course you'd want tools to "find/discard all waters" but that's basically sugar.

IMHO the underlying representation of molecular structures should be at the "molecule" level, and then flat at the atom level. A "molecule" is effectively meaningful, but even the content of the molecule is subject to ambiguities (as multiple conformations or incomplete atomic positions).

Conceptually there's a lot of value in that idea---you're right that Residue is really awkward for things like, e.g., ligands which are unrelated to amino acids. I guess the only real value of Residue (or an equivalent "column in the dataframe") is that a flat list of atoms without annotation does not allow you to infer bonds, whereas a list of Residues does let you implicitly encode bonds since (1) proteins are (mostly) made up of specific building blocks, and (2) these blocks are connected linearly. That fails, obviously, for things that aren't proteins. So I agree that CO₂ (a non-AA chain) might be more fairly represented as

3×4 DataFrame
 Row │ id     symbol  residue  chain
     │ Int64  String  Missing  Int64
─────┼───────────────────────────────
   11  C       missing      1
   22  O       missing      1
   33  O       missing      1

than via

struct Chain
    residues::Vector{Residue}
end

struct Residue
    atoms::Vector{Atom}
end

For a hierarchical representation, we could fix that with

struct Chain
    residues::Union{Vector{Residue},Vector{Atom}}
end

or

struct Chain
    residues::Union{Vector{Residue}, Nothing}   # specify either (not both!)
    atoms::Union{Vector{Atom}, Nothing}
end

but presumably that would be viable.

The only real constraint we have on the underlying format is concerning performance for specific operations.

This is one place where hierarchical representations have advantages. If you have a completely flat representation, say with 20 chains, each with 300 residues, each with ~8 atoms, then any time you select anything from the "dataframe" you need to traverse 20 * 300 * 8 = 48000 rows. Whereas with a hierarchical representation, if you only need "chain D" then you immediately reduce down to 300 residues regardless of whether any other efficiencies are possible. Of course, hybrid strategies achieve hybrid performance.

As an example, one thing we do in our analyses all the time is compute the center-of-mass of the each residue side-chain. This is easy and efficient with a hierarchical representation. With a flat representation of atoms, you'd need the equivalent of DataFrames' groupby to make that efficient---it's an O(n) operation (n = number of atoms) rather than the naive O(n*m) (m = number of residues) if you did each residue as a separate query on the full list of atoms.

@lmiq
Copy link
Contributor Author

lmiq commented May 25, 2024

Personally that seems like a good idea to me, if the one-character limit of PDB files weren't a constraint.

The problem here is that this not up to us to decide. PDBs are already used in which multiple molecules belong to the same chain.

If for generality, I would rather use

struct Molecule
    atoms::Vector{Atom}
    bonds::Vector{Tuple{Int,Int}}
end

where Atom is a data structure with a dictionary of potentially custom fields, including chain identifiers, etc. I would then rely on higher-level functions to deal with all possible variations and subtleties of how these fields can be used.

As an example, one thing we do in our analyses all the time is compute the center-of-mass of the each residue side-chain.

Yes, those things can be easier with the hierarchical approach. But, is that performance really important? And, at the same time, with the flat approach that specifically can also be done in O(n) (just mentioning because I think it is important to identify actual applications where one or other format is effectively limiting):

julia> using PDBTools, StaticArrays

julia> cm_side_chains = SVector{3,Float64}[]
       for residue in eachresidue(atoms) # lazy
           side_chain = select(residue, by = issidechain) # select side chain atoms of this residue 
           length(side_chain) > 0 && push!(cm_side_chains, center_of_mass(side_chain))
       end

(yes, if I needed the CM of one specific residue, I still have to traverse the array).

@timholy
Copy link
Contributor

timholy commented May 25, 2024

The problem here is that this not up to us to decide.

How we represent structures is/can be distinct from how they are encoded in the file. Several of our posts above concern renumbering schemes that ignore one or more tags in the file.

To the extent that Chain is a synonym for "molecule" (is it? I'm unsure), then obviously in reality each water molecule is its own chain.

struct Molecule

That's getting a bit into MolecularGraph.jl territory. It's an excellent package and we use it heavily.

with the flat approach that specifically can also be done in O(n)

Yes, that's basically what I meant by "groupby"; if your code example is meant to run in O(n), then presumably eachresidue makes a single pass to first collect atoms that are together in the same residue, or it assumes that atoms are listed in order by residue. The first is safe but impossible to do in a lazy manner; the latter is dangerous if the atoms are stored in a representation like Vector which permits reordering. You can do it safely without intermediate collection if you do something like

accum = [zeros(3) for _ = 1:nresidues]
count = fill(0, nresidues)
for atom in list
    idx = residueindex(atom)
    accum[idx] += atom.coords
    count[idx] += 1
end
centerofmass = accum ./ count

but of course that's a bit algorithm-specific (it works for linear measures on the residue, less obviously well for nonlinear measures).

@lmiq
Copy link
Contributor Author

lmiq commented May 25, 2024

To the extent that Chain is a synonym for "molecule" (is it? I'm unsure), then obviously in reality each water molecule is its own chain.

It is because "chain" is thought as "polymer chain", which make sense for proteins or nucleic acids, or other polymers. Calling a water molecule a "chain" would be an abuse of notation, though.

Yet, a molecule could be a vector of general "ChemicalUnit"s, for which a single molecule is a one, or a polymer residue is also one.

struct Molecule
    units::Vector{ChemicalUnit}
end
struct ChemicalUnit
    atoms::Vector{Atom}
end

then a water molecule would have a single chemical unit, a polymer could have many and map what we call "chain" in the specific chain of PDB files.

Just throwing an idea, I'm not claiming "ChemicalUnit" not be necessarily a good name.

@jgreener64
Copy link
Member

jgreener64 commented May 25, 2024

I wanted to understand if struct["A"].residue["274"].atoms[1] is something useful to justify the possible limitations of the representation that such a data structure brings.

I find this useful for interactive structural biology applications, but as you say any representation can probably do this by overloading getindex so I wouldn't say it justifies the data structure by itself.

PDB files accept multiple conformations of the same residue and have a specific notation to it. I don't think having a format that inherently is incapable of representing such is appropriate for a general structure representation package.

I think we are all aware of this but to be clear, BioStructures allows the 3 types of multiple conformations that appear in the PDB: alternative atom (representing ambiguous experimental determination), alternative residue (ambiguous experimental determination or mutation) and MODEL blocks (representing NMR models or MD trajectories). What it doesn't allow is multiple atoms with the same label, because the only way to distinguish these is by the order in the file, and this is a property of a file not the underlying molecule.

I'd propose that we distinguish the discussion about "hierarchical or flat" from the "strict vs permissive" discussion.

I agree with this, I am suggesting that even if the data structure allows it we should think very carefully before being permissive in this sense.

Dict('a' => [1, 2])

We could think about making this change, but would have to think about how it plays with collectatoms etc., I guess we would expand the repeats when a flag like expand_duplicates is given.

If we had an underlying molecular structure format that represented molecules where atoms have optional fields, that could effectively be useful for interoperability of the various molecular-structure packages.

This is exactly the aim of AtomsBase.jl, though it hasn't taken off for the biomolecular case yet. They have some readers and writers too. It would be nice to have support for that in BioStructures.

IMHO the underlying representation of molecular structures should be at the "molecule" level, and then flat at the atom level

Residue-level is also natural and useful for biopolymers, for example for assigning parameters for a molecular dynamics simulation which are parameterised by residue.

@lmiq
Copy link
Contributor Author

lmiq commented May 25, 2024

What it doesn't allow is multiple atoms with the same label, because the only way to distinguish these is by the order in the file, and this is a property of a file not the underlying molecule

But what's the label then? Two identical atoms belonging to different conformations of the same residue do not have a clear way to be distinguished by a label in the PDB. If we need to invent a label, I actually find more useful to just stick with the index of the atom in the "file" (or more generally, "MolecularSystem"), which in fact is a useful index.

@timholy
Copy link
Contributor

timholy commented May 25, 2024

I'm still coming to grips with this myself, but isn't this what alt_loc_id is about? For atoms, that's column 17 in the PDB file, see https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html. In PDBTools I think you bundle that into the residue name, https://github.com/m3g/PDBTools.jl/blob/3e246aec4ecbb2922ab85678404394e3245000a1/src/read_atom.jl#L58

@lmiq
Copy link
Contributor Author

lmiq commented May 25, 2024

It is. I meant it is not useful as a unique identifier, unless some malabarism is made.

@jgreener64
Copy link
Member

jgreener64 commented May 25, 2024

Why is it not useful? The alt loc ID can be A, B etc. which uniquely identifies the conformation.

As I see it small-scale conformational variability is handled okay. The case we are missing might appear when you have >9999 water molecules, you can't write 10000 as a residue number and so you have some scheme where you reuse residue numbers on the same chain to represent completely distinct molecules.

Another case would be where you have different conformations of the whole system, which could represent a MD trajectory, but MODEL blocks can be used there as written by many MD packages.

@lmiq
Copy link
Contributor Author

lmiq commented May 25, 2024

I was just editing my comment above.

Now I could check what BioStructures do, which is have disordered atoms stored with a different type (that was the piece I was missing in terms of using the field for representing the multiple conformations):

julia> s["A"].residues["364"].atoms[" CB "]
DisorderedAtom CB with alt loc IDs A,B

julia> s["A"].residues["364"].atoms[" CB "]['A']
Atom CB with serial 1085, coordinates [25.31, 88.547, 65.041], alt loc ID A

julia> s["A"].residues["364"].atoms[" CB "]['B']
Atom CB with serial 1086, coordinates [25.318, 88.585, 65.034], alt loc ID B

I don't have any objection to that, in principle.

Retrospectively, I think, I didn't use BioStructures for my purposes because it failed reading any molecular simulation PDB files, but that might be just because of the issue here reported to begin with.

If a good solution to that is found, I might test if the typical files I have to handle are parsed and, then, I wouldn't mind moving the reading infrastructure and even the data formats of PDBTools to that. If someone smarter than me can implement a full-featured selection syntax like that of VMD (and which PDBTools implements only partially), I think PDBTools could be retired for good.

9999 water molecules, you can't write 10000 as a residue number

PDBTools (and VMD) switch to hexadecimal representations when that happens (also for the atom indices).

ps: I'm still uneasy about not using the sequence of atoms as written in the files. While I see the appeal in the fact that that sequence is somewhat arbitrary and non-physical, the fact is that editing these molecular systems is very commonly a back-and-forth lookup on the actual files, and I would rather have a clear correlation between the representation of the molecular system and what is written. I do not see any advantage of not being able to readily access the atoms by their incremental indices in the PDB file. Maybe I could get accustomed to using struct[AtomIndex(1)] or struct[sel"index 1"] (which would be an extension of the PDBTools style).

@jgreener64
Copy link
Member

Agree that a selection syntax would be great. MDAnalysis is another reference point there: https://userguide.mdanalysis.org/stable/selections.html.

PDBTools (and VMD) switch to hexadecimal representations when that happens (also for the atom indices).

We could probably extend to read this in.

I'm still uneasy about not using the sequence of atoms as written in the files

We could store this as an extra field. In general I think it is just the atom serial though, which we do store. The order that residues are read in we don't store.

@lmiq
Copy link
Contributor Author

lmiq commented May 25, 2024

9999 water molecules, you can't write 10000 as a residue number

I think it is just the atom serial though, which we do store.

Not necessarily, because the atom indices too frequently overflow the format. VMD just ignores them and counts the atoms sequentially. PDBTools stores the serial numbers of the PDB in the pdb_index field, which is effectively rarely used.

Concerning levels of classification: there is the "segment" level as well, which is actually important. For instance, we are dealing now with a virus particle. Viruses have a lot of symmetry, and the deposited PDB has only a minimal unit that has to be replicated to compose the complete virus capsid. Within that minimal unit there are of course different chains. Upon replication of the unit, we have then multiple repeated chains, and we do not want to change the name of these chains, because it is inconvenient for many sorts of analysis. The different replicas of the minimal unit are then differentiated by the their "segment names". Thus, in this case, we have, in the same MODEL, many repeated CHAIN identifiers, with repeated fields for everything else, except the "segment" name.

@timholy
Copy link
Contributor

timholy commented May 26, 2024

Awesome, @lmiq. I'll be happy to help out once I get past my conference deadline.

Regarding atom order, one thing to think about is the overhead of allocating a Vector{Float64} for each atom xyz coordinate. One alternative would be to store locations at the Chain level in a 3xn matrix, and have each atom refer to the appropriate column in that matrix. That might fit nicely with the representation you're requesting. But the performance implications would have to be checked carefully.

@timholy
Copy link
Contributor

timholy commented May 26, 2024

Not necessarily, because the atom indices too frequently overflow the format.

It is kind of amazing that folks (including me!) haven't adopted mmCIF, because these are the kinds of problems it was designed to solve. In my own case, the small step from "PDB" the database to "PDB" the fileformat made me assume "oh, PDB is the right format to download." I only just read enough about this in the past few days to realize that I should switch to mmCIF files, and how weird it is that we all still prefer a file format heavily tailored to the era when keyboards only had capital letters, monitors had black backgrounds with glowing green fonts, and terminals were fixed at 80-character width. It's a little like the story about how the gap between two horses pulling a chariot ended up affected the design of the space shuttle, because the horse-gap set the wheel spacing, the wheel spacing determined the ruts in the ground and thus the road width, which ended up affecting the dimensions of tunnels in the modern US interstate highway system needed to transport assembled components over long distances.

I wonder if they had named it "PDB+" if it would have wider adoption? There seems to be a lot in a name.

@lmiq
Copy link
Contributor Author

lmiq commented May 26, 2024

The alternative to store the coordinates in a matrix or Vector{SVector} is something that may be worth additional thinking. That would allow using the coordinates efficiently. If that comes with some other benefits, even better.

I'm still, though, not convinced that not having an order for the atoms in the complete system, following the input file, is really an advantage. There's the argument that chains or molecules could be randomly written in the original file, but that is something I've never seen. (on the contrary, the order of the chains, heteroatoms, etc, is usually carefully thought in terms of their importance).

Concerning the PDB format, there are objective advantages of having a format for which one can copy and paste chunks of data from one file to the other. (and it worries me a bit the fact that mmCIF can, as far as I understand, have other fields, with other levels of user-defined molecular architecture organization, which probably will appear as the systems become larger and larger - and a long-lasting reading format should probably be able to adapt to that).

@jgreener64
Copy link
Member

One way to store segment information would be to have a ChainID-like type that stores chain_id and segment_id. Then you can index into a Model with a ChainID; the current behaviour of indexing with a string or character would access the ChainID with that chain_id and no segment_id (or any segment_id, if there is only one chain with that chain_id).

We do currently read all fields from mmCIF files into a MMCIFDict and use a subset to create the molecular representation. All fields are available in the MMCIFDict, though, which is an advantage of that format: files can store arbitrary data in a codified way rather than using the columns left undefined in the PDB format. This ability to store arbitrary data is one of the reasons for the push to mmCIF and in my opinion is a good thing.

@timholy
Copy link
Contributor

timholy commented May 27, 2024

there are objective advantages of having a format for which one can copy and paste chunks of data

Yep. There is a table format (expand the "styling plans"), one that exists now and one which seems planned (not sure what that means...). But it clearly isn't universally used.

@lmiq lmiq mentioned this issue May 27, 2024
@lmiq
Copy link
Contributor Author

lmiq commented May 28, 2024

Found another detail about reading MD PDB files: in these files usually 4-letter residue names are accepted, using the character in position 21 as the fourth character (which does not have any function in the original PDB format). For instance, this is a very common way to represent water:

ATOM   4013  OH2 TIP3B   1     -28.223  19.920 -27.748  1.00  0.00      WAT1 O
ATOM   4014  H1  TIP3B   1     -27.453  20.358 -27.476  1.00  0.00      WAT1 H
ATOM   4015  H2  TIP3B   1     -27.834  19.111 -28.148  1.00  0.00      WAT1 H

where TIP3 is the residue name. Currently BioStructures reads the residue name as TIP.

This raises an additional complication because blank characters are currently meaningful as the residue or atom keys (IMO they should be stripped, but I don't know if this adds additional complications because of the variable length of each key).

@jgreener64
Copy link
Member

To me that seems like a clear format violation.

I did consider whether to strip atom and residue names for storage, but decided against it as there are common cases like " CA " and "CA " representing C-alpha and calcium respectively. The getter functions atomname and resname have a strip keyword argument, which is true by default.

@lmiq
Copy link
Contributor Author

lmiq commented May 28, 2024

To me that seems like a clear format violation.

Yes, it is, but a common one. The question is should this package support that, or do we need another parser?

common cases like " CA " and "CA " representing C-alpha and calcium respectively

Ough. I didn't know about that. I've ever seen Calcium as "CAL". Also do think that is standardized?

@jgreener64
Copy link
Member

We could have it behind a flag I guess.

Atom and residue names don't seem to be well standardised but if you store the spacing you can at least write it out the same way it was read in.

@jgreener64
Copy link
Member

Based on discussions here and on Slack a few changes have recently gone in:

  • Selection strings from @lmiq.
  • Faster PDB parsing from @timholy.
  • ProteinStructure has been renamed to MolecularStructure.
  • Not exporting x, x!, etc.
  • The BioSequences.jl and BioAlignments.jl dependencies have been moved to extensions.

Now is a good time to mention any other breaking changes of interest as the next version will be v4 (I will unify the selection functions a bit first).

Regarding the heaviness of the package, for me using BioStructures now takes 0.15 s on Julia 1.11.0-beta2. @time_imports using BioStructures gives the following for me in case anyone has further suggestions.

      1.1 ms  TranscodingStreams
      1.8 ms  BioGenerics
      0.2 ms  PrecompileTools
      5.8 ms  BioSymbols
               ┌ 0.1 ms Zlib_jll.__init__() 
     89.3 ms  Zlib_jll 97.10% compilation time
      0.8 ms  CodecZlib
      3.7 ms  Format
      0.7 ms  Serialization
      1.3 ms  MsgPack
      0.3 ms  MMTF
      9.3 ms  RecipesBase
               ┌ 44.1 ms STRIDE_jll.__init__() 99.53% compilation time (100% recompilation)
     44.4 ms  STRIDE_jll 98.86% compilation time (100% recompilation)
               ┌ 0.1 ms DSSP_jll.__init__() 
      0.4 ms  DSSP_jll
      0.6 ms  Statistics
     63.2 ms  BioStructures

STRIDE_jll (and DSSP_jll) could be moved to extensions, I'm not sure how conventional that is though.

On Julia 1.10.3 Zlib_jll and STRIDE_jll take considerably less time to load, though the overall load time is longer (0.26 s) since SparseArrays is not a weak dependency in Statistics.

@lmiq
Copy link
Contributor Author

lmiq commented Jun 4, 2024

From the discussion here up to now, my impression is that we cannot escape having a new type (MDPDB maybe) for reading the PDBs of MD simulations with their loose standards (particularly, now, the 4-letter residues), so I don´t think developments on that front will bring breaking changes.

@timholy
Copy link
Contributor

timholy commented Jun 6, 2024

The JLL load time is presumably something that should be fixed elsewhere. I wouldn't tie yourself in knots to work around the issue in this package.

Nice progress on decreasing load time!

@marcom
Copy link
Contributor

marcom commented Jun 6, 2024

One solution to the nonstandard PDB files would be to have an additional data type that serves the equivalent job of MMCIFDict (maybe PDBFile) that simply represents what data is present in the file, regardless of other considerations. That would allow reading and working with these files, even if they are not necessarily considered correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants