-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identification of P evermanni transcript fasta #1793
Comments
I did also try googling, and I found someone with the exact same issue, but there was no resolution |
Do not collapse- use fasta as is. |
This Fasta is from bedtools? Only get mRNAs |
I see, that should fix it! Can I ask why we're using only mRNA? My understanding is that CDS are protein-coding sequences, so I was thinking that CDS sequences were kind of a proxy for mRNA, since presumably any protein coding sequence would eventually be transcribed/translated into mRNA. Is the difference just that CDS still contains untrranslated regions that we don't want? |
This is usually the nature of scaffolded assemblies, and is why the term "scaffold" is used instead of "contig." Briefly, a "contig" is an abbreviation for "contiguous." As you might be able to guess, a "contig" is a contiguous sequence, with no gaps. A "scaffold" is sort of the structure on the way to figuring out the entire sequence (contig). As such, it's kind of a place holder. There's sequencing data that indicates that the 5' and 3' sequences are part of the same region of DNA, however, there's not enough sequencing data to sequence the entire stretch of DNA. Thus, the two ends of the DNA fragment which were sequenced are "joined" by a stretch of BTW, this info is certainly not meant to be condescending! Just am not sure how much you (or others in the lab who might encounter this issue) know about how genome assemblies are generated and represented. |
Oh interesting, that's super helpful! So, if that means the mRNA fasta has been extracted correctly, then the Ns shouldn't be a problem for pseudoalignment, right? My basic understanding of kallisto pseudoalignment is that it avoids directly aligning reads to a reference by instead breaking reads into k-mers and then mapping them to identify which reference sequence the read likely originated from. It seems like internal stretches of Ns (or the "pseudorandom nucleotides" they're replaced with) shouldn't interefere with that process |
Can you please share code you used (link to file in GitHub repo) to extract mRNA seqs? |
sure, here's the current .md file |
Kallisto indexing problem is now being addressed in #1795 |
looking more at it I think the meaning it looks like |
I've been trying to use fastx_collapser to collapse a CDS fasta file to unique sequences only (so I can build a kallisto index with it). However, I've been getting the persitent error:
"fastx_collapser: Invalid input: This looks like a multi-line FASTA file.
Line 149 contains a nucleotides string instead of a '>' prefix."
I examined the first several lines of my CDS fasta and, while it looked like it was already single-line, I ran it through fasta_format to format it as a single-line fasta just to be sure. However, I'm still getting the same error. I also looked specifically at Line 149 and it starts with a '>' before the sequence ID and the following sequence is single-line...
Does anyone have any idea why fastx_collapser doesn't want to run on my CDSfasta file? Maybe someone's had this issue before?
file linked here, can also see a screenshot of relevant section below
The text was updated successfully, but these errors were encountered: