Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Add tag for reporting number of converted/unconverted bases #328

Closed
y9c opened this issue Oct 7, 2021 · 7 comments
Closed
Labels

Comments

@y9c
Copy link

y9c commented Oct 7, 2021

The Yf tag in hisat2-3n is a little confusing. When run hisat2-3n with --base-change C,T argument, I guess the "number of conversion" should be the number of C bases that converted into T. But I found that this tag is reporting other kinds of mutations (mismatch), say G to T mutation, and the number of C to T mutation is not included. However, the XM tag reports the exact number of C to T conversion. This looks very weird, because the Yf and XM tag are documented as

Yf:i:<N>: Number of conversions are detected in the read.

XM:i:<N> : The number of mismatches in the alignment. 

I would like to know if it is possible to report number of non C to T conversion in to XM tag, and use another two tags to report number of C to T conversion and number of non C to T conversion?
C to T is an example, this can also be applied to other settings.

@imzhangyun
Copy link
Contributor

Hello Chang,

Thank you very much for this issue!!! I introduced this bug in last commit (910c979fdd422d7ae0d78a554cb3e5187b06d267), and I just fixed it. I am really sorry about this bug. HISAT-3N should be normal now.

There are 4 different cases a read can be mapped to reference(genome):
Case 1. Forward sequence of the read mapped to forward strand of reference (Watson strand)
Case 2. Forward sequence of the read mapped to Reverse strand of reference (Crick strand)
Case 3. Reverse complement sequence of the read mapped to forward strand of reference (Watson strand)
Case 4. Reverse complement sequence of the read mapped to Reverse strand of reference (Crick strand)

Yf tag will report the conversion number. For example, with --base-change C,T option, Yf report the C->T count for case 1 and 4, G->A count for case 2 and 3.

XM tag will report mismatch and not include Yf tag.

For example with --base-change C,T option:

Reference Sequence: NNNNNNNAAACCCCGGGGGTTTTTTTTTNNNNNNNNN
                           *   ---        *
Read Sequence:             CAACTTTGGGGGTTTATTTTT

Yf is 3. XM is 2. The MD tag will show all the differences between reference and read sequence. So the MD tag should be: MD:Z:A3C0C0C8T5

Sorry again for the bug.
Leo

@y9c
Copy link
Author

y9c commented Oct 7, 2021

By reading the code, I found that badConversion is assign to XM tag. So I think this is consistent with my understanding. But the badConversion is not calculated properly?

@y9c
Copy link
Author

y9c commented Oct 7, 2021

Thank you very much @imzhangyun.

Can I have another tag, like Yn to count number on unconverted based in the read. It will be 1 in your example.

@imzhangyun
Copy link
Contributor

Thank you for your suggestion. Since the Yn tag could make the output SAM file bigger, we may not add Yn tag to current HISAT-3N.

Also Yn and Yf may not represent all the C in read/reference sequence.
For example:

Reference Sequence: NNNNNNNAAACCCCCCGGGGGTTTTTTTTTNNNNNNNNN
                              ** ---          ****
Read Sequence:             AAAGGCTTTGGGGGTTTTTCCCC

In this case, Yn is 1, XM is 6. Yf is 3. Yn + Yf != count of C in the reference or read sequence. I think it is better to give this job to downstream analysis software/script.

Best,
Leo

@y9c
Copy link
Author

y9c commented Oct 7, 2021

How about make all the tags optional, just in the same way as STAR aligner do? I wonder if there is any concern to make Yn + Yf= C counts?

@imzhangyun
Copy link
Contributor

Hello Chang,

Sorry for the last response. I just add a new tag Zf:i:<N>. The Zf tag counts the unconverted base in the read sequence.

Reference Sequence: NNNNNNNAAACCCCCCGGGGGTTTTTTTTTNNNNNNNNN
                              ** ---          ****
Read Sequence:             AAAGGCTTTGGGGGTTTTTCCCC

In this case, Zf = 1, Yf = 3, XM = 6. (--base-change C,T)

Too get the HISAT-3N with Zf tag, please check our "hisat-3n_Zf" branch. I tested it and it looks OK. Could you also test it and see if it works for you? Then I will merge it to the "hisat-3n' branch.

Best,
Leo

@y9c
Copy link
Author

y9c commented Oct 8, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants