Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 69 2 #76

Closed
wants to merge 3 commits into from
Closed

Issue 69 2 #76

wants to merge 3 commits into from

Conversation

manulera
Copy link
Contributor

@manulera manulera commented Jun 4, 2024

Hi @tnrich, apologies in advance for the long issue, but there was a bunch of things.

This is the sequence-utils part of #69. There are three commits:

  1. dcdaf44: re-factor of getAminoAcidDataForEachBaseOfDna.js. It removes special cases, and does everything in the same loop (before the X aminoacid corresponding to truncated codons was added separately and differently for forward and reverse translations). I think it makes the code a bit clearer, and I thought doing a single loop was necessary to implement the next feature. I have also splitted out some pre-processing into a separate function to have only the logic in getAminoAcidDataForEachBaseOfDna.
  2. 878ce24: added a few extra tests for getAminoAcidDataForEachBaseOfDna in the protein case. Some are skipped (see below for questions on this).
  3. 6c45135: actual implementation of the feature. Tests are still missing because I wanted to double-check with you a few things before writing them (see below).

Some tests fail, but I am not sure they are related to the feature I implemented. Some don't seem to (mainMenu fail for instance).

Below the questions I had

Unrelated to the implementation

  • optionalSubrangeRange is a feature-like object (with start and end) passed to getAminoAcidDataForEachBaseOfDna to indicate the subset of the sequence that would be translated. Should this be allowed at all when isProtein === true? The reason why I ask is because now the locations of that feature are used to describe the exons (joined feature CDS-like) and that would be quite meaningless for a protein sequence. Not sure if it's worth to allow optionalSubrangeRange in the first place. The old code ignores the range for protein sequences (I have added some skipped tests to getAminoAcidDataForEachBaseOfDna.test.js, they are self-explaining). In my implementation, I have kept it that way, but it's probably best to return an error instead, since people may think that passing the range works for protein sequences.
  • I found the meaning of the includeEndEdge argument a bit confusing in the function isPositionWithinRange. For instance, with the name of the argument, I would have expected isPositionWithinRange(3,{start:1,end:2}, 4, true, true) to be false, and it's true. But isPositionWithinRange(0,{start:1,end:2}, 4, true, true) is false. I think it would be good to use a more self-explaining argument name or explain more in the docstring what exactly this argument represents.

Related to the implementation

Right now, the output of getAminoAcidDataForEachBaseOfDna is an array that contains the aminoacid-level info for each basepair. For instance, for the bases ATG of the ATG of a CDS, the corresponding array is (where getAA("atg") returns an object with aminoacid properties):

aaData = getAminoAcidDataForEachBaseOfDna("atg", true);
    assert.deepEqual(aaData, [
      {
        aminoAcid: getAA("atg"),
        positionInCodon: 0,
        aminoAcidIndex: 0,
        sequenceIndex: 0,
        codonRange: {
          start: 0,
          end: 2
        },
        fullCodon: true
      },
      {
        aminoAcid: getAA("atg"),
        positionInCodon: 1,
        aminoAcidIndex: 0,
        sequenceIndex: 1,
        codonRange: {
          start: 0,
          end: 2
        },
        fullCodon: true
      },
      {
        aminoAcid: getAA("atg"),
        positionInCodon: 2,
        aminoAcidIndex: 0,
        sequenceIndex: 2,
        codonRange: {
          start: 0,
          end: 2
        },
        fullCodon: true
      }
    ]);

The question is what to do when a basepair is part of an intron. The easiest thing (that gives the right representation in the UI) is to insert some kind of "mock" element in the array. Currently I am inserting something like this:

{
    aminoAcid: null,
    positionInCodon: null,
    aminoAcidIndex: null,
    sequenceIndex: 2,
    codonRange: null,
    fullCodon: null
}

The function expects to have an array of the same length as the input protein sequence, so I think something in these lines makes the most sense. It also gives good rendering (see below). What do you think?

Representation

As it is, it already gives a better representation that before (before the wrong amino acid sequence was displayed, because it was not taking into account the introns), but there are some awkward things.

You can see all the examples below in example.zip. You can reverse the sequence and rotate it and you can see that even if the features span the origin or inverted, they behave the same way.

If the intron does not split an aminoacid, it looks good:

Screenshot 2024-06-04 at 10 54 17

If the intron splits the aminoacid into two, the aminoacid is shown like this (displaced towards the side with most bases).

Screenshot 2024-06-04 at 11 29 51 Screenshot 2024-06-04 at 11 29 19

There is also the case where the aminoacid is splitted into three exons, because there is an exon with a single basepair.

Screenshot 2024-06-04 at 11 30 06

Improvements to representation?

For amino acids that are splitted into several exons, I think it would be better to use the representation that is used when the aminoacid is splitted between two lines in the sequence map (bottom image), so there would be a fragment of the aminoacid symbol in each bit of the intron. I can give a go at this if you give me some guidance.

Screenshot 2024-06-04 at 11 31 31

I think it would also be good to have a thin line joining the translation bits that correspond to the same CDS, like the thin line that connects the different positions in a feature with multiple parts, something like below. I can give a go at this if you give me some advice as well.

Screenshot 2024-06-04 at 10 54 17

ready to add gaps, tests pass

final refactor
* What happens with proteins for which `optionalSubrangeRange` is provided? Should it just return an error?
* If that's allowed, what if they have `locations`? Exons within the protein?
* What should be inserted in the aminoAcidDataForEachBaseOfDNA array for an intron base
* Ask about isPositionWithin range, why do I have to use true, false as the last two arguments?
@tnrich
Copy link
Collaborator

tnrich commented Jun 4, 2024

Hi @manulera nice work! Looking good so far!

To get the AA's truncating in a better way, I'd look into modifying the logic around isTruncatedEnd/isTruncatedStart:

image

I've added a bit of code to start you down the right path with that on this branch https://github.com/TeselaGen/tg-oss/compare/tnwFiddleWithAaGaps

Here's what the example seq looks like rn there:
image

I think with a bit more trial and error you should be able to get it rendering pretty nicely!

Good luck and lemme know if you need anything else!

@tnrich
Copy link
Collaborator

tnrich commented Jun 5, 2024

Closing as this has been merged and published :)

@tnrich
Copy link
Collaborator

tnrich commented Jun 6, 2024

I found the meaning of the includeEndEdge argument a bit confusing in the function isPositionWithinRange. For instance, with the name of the argument, I would have expected isPositionWithinRange(3,{start:1,end:2}, 4, true, true) to be false, and it's true. But isPositionWithinRange(0,{start:1,end:2}, 4, true, true) is false. I think it would be good to use a more self-explaining argument name or explain more in the docstring what exactly this argument represents.

This is working as expected AFAICT with range indices being 0-based.

 G T A C C  <-- sequence
 0 1 2 3 4  <-- sequence indices
   r r      <-- range
0 1 2 3 4 5  <-- position indices

@manulera manulera deleted the issue_69_2 branch June 7, 2024 09:07
@manulera
Copy link
Contributor Author

manulera commented Jun 7, 2024

Oh, I see, I didn't realise that position and index referred to different things. It makes sense now. Thanks for the explanation.

By the way, I meant to say it before, but it has been quite manageable navigating the codebase and contributing despite it being so big, so congrats for the great repo!

@tnrich
Copy link
Collaborator

tnrich commented Jun 7, 2024

Thanks @manulera ! Glad that you're helping in such a big way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants