Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: not all umis are the same length(!): 4 - 5 #461

Closed
MonalisaHota opened this issue Mar 4, 2021 · 7 comments
Closed

AssertionError: not all umis are the same length(!): 4 - 5 #461

MonalisaHota opened this issue Mar 4, 2021 · 7 comments

Comments

@MonalisaHota
Copy link

Hello,
i am trying to run UMI-tools to remove duplicated reads based on UMI. But getting error "AssertionError: not all umis are the same length(!): 4 - 5". Can anyone suggest how to resolve this error?

I am using following command :

umi_tools dedup --stdin=filtered.bam --umi-separator=":" --log=LOGFILE --output-stats=stats.txt -S output.bam > OUTFILE

This is the head of my bam file:

A00609:116:H7JCGDSXY:1:2171:13937:29825 16 1 10534 255 3S96M2D21M31S * 0 0 CGCAGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCCCCCATGTACTCTGCGTTGATACCACTGCTT FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF,FFFFFFFFF,FFFFFFFFFFFFFFFFFF:FFFF:FF:FFFFFFF:FFFFFFF:FFFFF:FFFF NH:i:1 HI:i:1 AS:i:103 nM:i:3 RE:A:I xf:i:0 li:i:0 CR:Z:CCGTGAGAGAACGCGT CY:Z::FF:F:FF:FFF,F:F CB:Z:CCGTGAGAGAACGCGT-1 UR:Z:TTTTCCGCACTT UY:ZG:Z:cells:0:1:H7JCGDSXY:1

I am getting following error:

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/bin/umi_tools", line 11, in
sys.exit(main())
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/umi_tools/umi_tools.py", line 61, in main
module.main(sys.argv)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/umi_tools/dedup.py", line 329, in main
reads, umis, umi_counts = processor(
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/umi_tools/network.py", line 419, in call
clusters = self.UMIClusterer(counts, threshold)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/umi_tools/network.py", line 367, in call
assert max(len_umis) == min(len_umis), (
AssertionError: not all umis are the same length(!): 4 - 5

@TomSmithCGAT
Copy link
Member

By default, umi_tools dedup expects the UMI +/- cell barcode to be encoded in the read name. E.g, in the read above, it will take 29825 to be the read name. You can ignore the error since the issue is defining where the UMI is encoded in the command.

It looks like you have output from cellranger, with UMI in bam tag UR and cell barcode in bam tags CR and CB. From the cell ranger documentation, we can see that CR is the cell barcode directly from the fastq sequence, and CB is post error correction, in which case, you'll want to use CB I assume.

If you add the following to your command, umi_tools will deduplicate the reads in each cell independently, which I assume is what you want to achieve. You can drop the --cell-tag if not.

--extract-umi-method=tag --umi-tag=UR --cell-tag=CB

Also note that running -output-stats can slow down the processing and increase memory usage considerably. If you find this is the case, you may want to collect stats on a suitable subset of the data.

@MonalisaHota
Copy link
Author

Thank you so much for your quick response. I tried with these options and the dedup ran fine.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Feb 21, 2022

Hi @iammrtza. In the 3 lines above, the UMIs are all different lengths (UMI is TCCCCGCCC in first line), so the error message appears to be correct. Overall, it appears your UMI lengths are between 4-12. They need to all be the same length.

What was the command you used to extract the UMIs from the fastq?

@iammrtza
Copy link

Hi Tom,

My Illumina reads structure is depicted here. After removing adapter/junk (using cutadapt), I did grep for the "common sequence" and what is left was UMI and I added those to the reads headers in FASTA file. Then I removed the "common sequence" using cutadapt again.

@iammrtza
Copy link

Is it ok if I add dummy letters (e.g. A) to the shorter UMI so they can reach to the maximum length of 12 (in my Illumina reads, the longest UMI length should be 12)

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Feb 21, 2022

I think your strategy is probably suboptimal. As far as I understand what your doing, the potential issues are:

  1. Cutadapt is unaware that you have UMIs and could in theory remove part of them from your reads.
  2. Cutadapt will also struggle to identify junk unless it's very clearly so, e.g low sequence quality
  3. I expect your grep is not allowing for any errors in your common sequence

The read structure is identical with respect to the common sequence and UMI length, so you can use umi_tools extract to perform all these steps in one go. The following should work, with COMMON_SEQEUNCE replaced by the expected common sequence. Sequences matching the regex groups discard_1 and discard_2 will be removed. Hence, umi_tools extract is replacing all 3 of your steps above.

UMI-tools uses the regex package rather than base re, The {s<=1} after the common sequence in the regex, allows up to one substition error. See https://umi-tools.readthedocs.io/en/latest/regex.html for more details on how to specify the regex (you can ignore the stuff about cell barcodes, that's for single cell applications).

umi_tools extract
--extract-method=regex
--bc-pattern="(?P<discard_1>COMMON_SEQUENCE{s<=1})(?P<umi_1>.{12})(?P<discard_2>.*)"
-L extract.log

This will demand the UMIs are 12 nt. If your read length is insufficiently long to get through the sRNA, common sequence and UMI, you may find many of your reads don't match the regex. This will be reported in the log file.

@TomSmithCGAT
Copy link
Member

Just to add, looks like @IanSudbery has previously addressed this exact question here: https://www.biostars.org/p/9469084/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants