AssertionError: not all umis are the same length(!): 4 - 5 #461

MonalisaHota · 2021-03-04T07:53:24Z

Hello,
i am trying to run UMI-tools to remove duplicated reads based on UMI. But getting error "AssertionError: not all umis are the same length(!): 4 - 5". Can anyone suggest how to resolve this error?

I am using following command :

umi_tools dedup --stdin=filtered.bam --umi-separator=":" --log=LOGFILE --output-stats=stats.txt -S output.bam > OUTFILE

This is the head of my bam file:

A00609:116:H7JCGDSXY:1:2171:13937:29825 16 1 10534 255 3S96M2D21M31S * 0 0 CGCAGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCCCCCATGTACTCTGCGTTGATACCACTGCTT FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF,FFFFFFFFF,FFFFFFFFFFFFFFFFFF:FFFF:FF:FFFFFFF:FFFFFFF:FFFFF:FFFF NH:i:1 HI:i:1 AS:i:103 nM:i:3 RE:A:I xf:i:0 li:i:0 CR:Z:CCGTGAGAGAACGCGT CY:Z::FF:F:FF:FFF,F:F CB:Z:CCGTGAGAGAACGCGT-1 UR:Z:TTTTCCGCACTT UY:ZG:Z:cells:0:1:H7JCGDSXY:1

I am getting following error:

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/bin/umi_tools", line 11, in
sys.exit(main())
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/umi_tools/umi_tools.py", line 61, in main
module.main(sys.argv)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/umi_tools/dedup.py", line 329, in main
reads, umis, umi_counts = processor(
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/umi_tools/network.py", line 419, in call
clusters = self.UMIClusterer(counts, threshold)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/umi_tools/network.py", line 367, in call
assert max(len_umis) == min(len_umis), (
AssertionError: not all umis are the same length(!): 4 - 5

TomSmithCGAT · 2021-03-04T08:49:47Z

By default, umi_tools dedup expects the UMI +/- cell barcode to be encoded in the read name. E.g, in the read above, it will take 29825 to be the read name. You can ignore the error since the issue is defining where the UMI is encoded in the command.

It looks like you have output from cellranger, with UMI in bam tag UR and cell barcode in bam tags CR and CB. From the cell ranger documentation, we can see that CR is the cell barcode directly from the fastq sequence, and CB is post error correction, in which case, you'll want to use CB I assume.

If you add the following to your command, umi_tools will deduplicate the reads in each cell independently, which I assume is what you want to achieve. You can drop the --cell-tag if not.

--extract-umi-method=tag --umi-tag=UR --cell-tag=CB

Also note that running -output-stats can slow down the processing and increase memory usage considerably. If you find this is the case, you may want to collect stats on a suitable subset of the data.

MonalisaHota · 2021-03-05T03:57:07Z

Thank you so much for your quick response. I tried with these options and the dedup ran fine.

TomSmithCGAT · 2022-02-21T08:46:02Z

Hi @iammrtza. In the 3 lines above, the UMIs are all different lengths (UMI is TCCCCGCCC in first line), so the error message appears to be correct. Overall, it appears your UMI lengths are between 4-12. They need to all be the same length.

What was the command you used to extract the UMIs from the fastq?

iammrtza · 2022-02-21T08:52:39Z

Hi Tom,

My Illumina reads structure is depicted here. After removing adapter/junk (using cutadapt), I did grep for the "common sequence" and what is left was UMI and I added those to the reads headers in FASTA file. Then I removed the "common sequence" using cutadapt again.

iammrtza · 2022-02-21T08:58:27Z

Is it ok if I add dummy letters (e.g. A) to the shorter UMI so they can reach to the maximum length of 12 (in my Illumina reads, the longest UMI length should be 12)

TomSmithCGAT · 2022-02-21T09:21:35Z

I think your strategy is probably suboptimal. As far as I understand what your doing, the potential issues are:

Cutadapt is unaware that you have UMIs and could in theory remove part of them from your reads.
Cutadapt will also struggle to identify junk unless it's very clearly so, e.g low sequence quality
I expect your grep is not allowing for any errors in your common sequence

The read structure is identical with respect to the common sequence and UMI length, so you can use umi_tools extract to perform all these steps in one go. The following should work, with COMMON_SEQEUNCE replaced by the expected common sequence. Sequences matching the regex groups discard_1 and discard_2 will be removed. Hence, umi_tools extract is replacing all 3 of your steps above.

UMI-tools uses the regex package rather than base re, The {s<=1} after the common sequence in the regex, allows up to one substition error. See https://umi-tools.readthedocs.io/en/latest/regex.html for more details on how to specify the regex (you can ignore the stuff about cell barcodes, that's for single cell applications).

umi_tools extract
--extract-method=regex
--bc-pattern="(?P<discard_1>COMMON_SEQUENCE{s<=1})(?P<umi_1>.{12})(?P<discard_2>.*)"
-L extract.log

This will demand the UMIs are 12 nt. If your read length is insufficiently long to get through the sRNA, common sequence and UMI, you may find many of your reads don't match the regex. This will be reported in the log file.

TomSmithCGAT · 2022-02-21T09:24:54Z

Just to add, looks like @IanSudbery has previously addressed this exact question here: https://www.biostars.org/p/9469084/

TomSmithCGAT closed this as completed Mar 7, 2021

TomSmithCGAT mentioned this issue Feb 21, 2022

Error: UMI lengths are not the same! #515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: not all umis are the same length(!): 4 - 5 #461

AssertionError: not all umis are the same length(!): 4 - 5 #461

MonalisaHota commented Mar 4, 2021

TomSmithCGAT commented Mar 4, 2021

MonalisaHota commented Mar 5, 2021

TomSmithCGAT commented Feb 21, 2022 •

edited

iammrtza commented Feb 21, 2022

iammrtza commented Feb 21, 2022

TomSmithCGAT commented Feb 21, 2022 •

edited

TomSmithCGAT commented Feb 21, 2022

AssertionError: not all umis are the same length(!): 4 - 5 #461

AssertionError: not all umis are the same length(!): 4 - 5 #461

Comments

MonalisaHota commented Mar 4, 2021

I am using following command :

This is the head of my bam file:

I am getting following error:

TomSmithCGAT commented Mar 4, 2021

MonalisaHota commented Mar 5, 2021

TomSmithCGAT commented Feb 21, 2022 • edited

iammrtza commented Feb 21, 2022

iammrtza commented Feb 21, 2022

TomSmithCGAT commented Feb 21, 2022 • edited

TomSmithCGAT commented Feb 21, 2022

TomSmithCGAT commented Feb 21, 2022 •

edited

TomSmithCGAT commented Feb 21, 2022 •

edited