Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group method: not all UMI-containing reads assigned UG tag #635

Open
epiliper opened this issue Mar 20, 2024 · 7 comments
Open

group method: not all UMI-containing reads assigned UG tag #635

epiliper opened this issue Mar 20, 2024 · 7 comments
Assignees
Labels

Comments

@epiliper
Copy link

epiliper commented Mar 20, 2024

Hi! After running group on a sorted bamfile of roughly 3.4 million reads, I noticed that only half (~1.6 million reads) were actually given UG tags.

I checked this by running samtools view -d UG on the grouped file and counting the number of reads meeting these filter conditions.

All reads in the bamfile I used should contain UMIs. Is this expected behavior, and if so, what might cause some reads to not get assigned to a read group?

Thanks in advance for your patience; I'd really appreciate any help/explanation.

I'm a new user so apologies if I'm missing the obvious.

@TomSmithCGAT
Copy link
Member

Hi @epiliper - Could you please post the umi_tools group command used. Cheers

@epiliper
Copy link
Author

Command used:

umi_tools group -I UMI3-RSV90-10_S2.aligned.sorted.bam --output-bam --umi-separator=":" --paired -S UMI3-RSV90-10_S2_umi.bam

Here is the QNAME of a typical read in my input bamfile:

M04202:286:000000000-L5TBR:1:1102:16626:7061:TGTCAGGCTAT

Where TGTCAGGCTAT is the UMI.

When I run the above command, it runs with these options:

UMI-tools version: 1.1.4
 output generated by group -I UMI3-RSV90-10_S2.aligned.sorted.bam --output-bam --umi-separator=: --paired -S UMI3-RSV90-10_S2_umi.bam
job started at Thu Mar 21 04:04:31 2024 on MacBeth.users.gowaveg.com -- fb9c6e30-e5bd-4b2f-9b5f-62ccf0d1abd9
pid: 40348, system: Darwin 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:43 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T6000 arm64
assigned_tag                            : None
cell_tag                                : None
cell_tag_delim                          : None
cell_tag_split                          : -
chimeric_pairs                          : use
chrom                                   : None
compresslevel                           : 6
detection_method                        : None
filter_umi                              : None
gene_tag                                : None
 gene_transcript_map                     : None
get_umi_method                          : read_id
ignore_tlen                             : False
ignore_umi                              : False
in_sam                                  : False
 log2stderr                              : False
loglevel                                : 1
 mapping_quality                         : 0
method                                  : directional
no_sort_output                          : False
out_sam                                 : False
output_bam                              : True
output_unmapped                         : False
paired                                  : True
per_cell                                : False
per_contig                              : False
per_gene                                : False
random_seed                             : None
read_length                             : False
short_help                              : None
skip_regex                              : ^(__|Unassigned)
soft_clip_threshold                     : 4
spliced                                 : False
stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
stdin                                   : <_io.TextIOWrapper name='UMI3-RSV90-10_S2.aligned.sorted.bam' mode='r' encoding='UTF-8'>
stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
stdout                                  : <_io.TextIOWrapper name='UMI3-RSV90-10_S2_umi.bam' mode='w' encoding='UTF-8'>
subset                                  : None
threshold                               : 1
timeit_file                             : None
timeit_header                           : None
timeit_name                             : all
tmpdir                                  : None
tsv                                     : None
umi_group_tag                           : BX
umi_sep                                 : :
umi_tag                                 : RX
umi_tag_delim                           : None
umi_tag_split                           : None
umi_whitelist                           : None
umi_whitelist_paired                    : None
unmapped_reads                          : discard
unpaired_reads                          : use
whole_contig                            : False

Thanks again.

@epiliper
Copy link
Author

epiliper commented Mar 21, 2024

Update: I think I know what might be happening...

Looking at part of the definition of get_bundles() in lines 320-327 of sam-methods:

if read.is_read2:
                if self.return_read2:
                    if not read.is_unmapped or (
                            read.is_unmapped and self.return_unmapped):
                        yield read, None, "single_read"
                continue
            else:
                self.read_events['Input Reads'] += 1

If I'm understanding this right, umi-tools doesn't consider read2 in grouping and downstream analysis. That would explain why roughly half of my reads are getting tagged.

Does this sound like a reasonable cause?

I apologize if I missed this in the docs; I guess checking the percentage of reads tagged with "UG" after running group isn't commonly done.

@TomSmithCGAT
Copy link
Member

Second time in a week that an issue has been 'solved' by a user before I get around to it. I should be neglectful more often 😉

Yes, you're right that group doesn't add read tags to read2s, which are just written out as soon as they are read in, along with unmapped and/or chimeric reads depending on the options

for bundle, key, status in bundle_iterator(inreads):
# write out read2s and unmapped/chimeric (if these options are set)
if status == 'single_read':
# bundle is just a single read here
nInput += 1
if outfile:
outfile.write(bundle)
nOutput += 1
continue

Looking over the documentation, I don't think this is stated anywhere. @IanSudbery, do you agree this is an oversight, or have I missed it too?! I'm happy to rectify.

@IanSudbery
Copy link
Member

IanSudbery commented Mar 26, 2024 via email

@TomSmithCGAT
Copy link
Member

OK, I'll put updating the docs regarding this on the to-do.

If we do want to include read tags on read2, I'm torn between allowing this in group and a separate tool to add these afterwards. The former seem cleaner and shouldn't be too hard, but the later will probably be quicker to actually get done! Thoughts?..

@TomSmithCGAT TomSmithCGAT self-assigned this Mar 26, 2024
@epiliper
Copy link
Author

Thank you so much for confirming the cause of half-tagged reads! Peace of mind has been restored.

If you're asking me for my thoughts:

  • I asked some labmates also involved in UMI NGS, and we think a solution built within UMI-tools group for R2 processing would be great.

  • A coworker proposed separating, grouping and deduplicating R1 and R2 in parallel, then aligning them both to the reference in a final bam file.

All of this is of course just proposals; Either way, we'd be interested to hear more about this considering it would have a massive impact on how we set up our NGS runs.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants