Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identification of P evermanni transcript fasta #1793

Closed
shedurkin opened this issue Jan 31, 2024 · 12 comments
Closed

Identification of P evermanni transcript fasta #1793

shedurkin opened this issue Jan 31, 2024 · 12 comments

Comments

@shedurkin
Copy link
Contributor

shedurkin commented Jan 31, 2024

I've been trying to use fastx_collapser to collapse a CDS fasta file to unique sequences only (so I can build a kallisto index with it). However, I've been getting the persitent error:

"fastx_collapser: Invalid input: This looks like a multi-line FASTA file.
Line 149 contains a nucleotides string instead of a '>' prefix."

I examined the first several lines of my CDS fasta and, while it looked like it was already single-line, I ran it through fasta_format to format it as a single-line fasta just to be sure. However, I'm still getting the same error. I also looked specifically at Line 149 and it starts with a '>' before the sequence ID and the following sequence is single-line...

Does anyone have any idea why fastx_collapser doesn't want to run on my CDSfasta file? Maybe someone's had this issue before?

file linked here, can also see a screenshot of relevant section below
Screenshot (227)

@shedurkin
Copy link
Contributor Author

I did also try googling, and I found someone with the exact same issue, but there was no resolution

@sr320
Copy link
Member

sr320 commented Jan 31, 2024

Do not collapse- use fasta as is.

@shedurkin
Copy link
Contributor Author

shedurkin commented Jan 31, 2024

When I originally ran kallisto index on my un-collapsed CDS fasta file I recieved the error:

Error: repeated name in FASTA file /home/shared/8TB_HDD_02/shedurkin/deep-dive/E-Peve/data/Porites_evermanni_CDS.fasta
Porites_evermani_scaffold_1:486174-486885
Run with --make-unique to replace repeated names with unique names

I was able to look at the CDS gff file and see that duplicate listing is real (screenshot below), so I assumed there could be other duplicates that would require collapsing
Screenshot (228)

@sr320
Copy link
Member

sr320 commented Jan 31, 2024

This Fasta is from bedtools? Only get mRNAs

@shedurkin
Copy link
Contributor Author

I see, that should fix it! Can I ask why we're using only mRNA? My understanding is that CDS are protein-coding sequences, so I was thinking that CDS sequences were kind of a proxy for mRNA, since presumably any protein coding sequence would eventually be transcribed/translated into mRNA. Is the difference just that CDS still contains untrranslated regions that we don't want?

@shedurkin
Copy link
Contributor Author

I extracted only mRNA lines in the gff, used the gff and a scaffold fasta to build a fasta file of every mRNA sequence, and then tried building a kallisto index with that fasta. I was able to build an index, but I'm getting another issue now.

During indexing I was warned: [build] warning: replaced 10722052 non-ACGUT characters in the input sequence with pseudorandom nucleotides

Since the file contains 40,389 sequences, that means kallisto index is identifying an average of 265 non-ACGUT character per sequence, which seems super concerning. The fasta file is too big to look at all at once, but I copied some of the first few sequences into a google doc to try character-searching for non-ACGUT characters, and several of the first few sequences have very long stretches of just Ns in the middle of the read (example pic below). Does anyone have any idea why this is happening?
Looking through the gff, some of the mRNAs also look to be much longer than I would expect -- could that be related?
Screenshot (229)

@kubu4
Copy link
Contributor

kubu4 commented Jan 31, 2024

long stretches of just Ns in the middle of the read

This is usually the nature of scaffolded assemblies, and is why the term "scaffold" is used instead of "contig."

Briefly, a "contig" is an abbreviation for "contiguous." As you might be able to guess, a "contig" is a contiguous sequence, with no gaps.

A "scaffold" is sort of the structure on the way to figuring out the entire sequence (contig). As such, it's kind of a place holder. There's sequencing data that indicates that the 5' and 3' sequences are part of the same region of DNA, however, there's not enough sequencing data to sequence the entire stretch of DNA. Thus, the two ends of the DNA fragment which were sequenced are "joined" by a stretch of N to indicate that the nucleotides between the 5' and 3' ends of the sequencing data is not known.

BTW, this info is certainly not meant to be condescending! Just am not sure how much you (or others in the lab who might encounter this issue) know about how genome assemblies are generated and represented.

@shedurkin
Copy link
Contributor Author

Oh interesting, that's super helpful! So, if that means the mRNA fasta has been extracted correctly, then the Ns shouldn't be a problem for pseudoalignment, right? My basic understanding of kallisto pseudoalignment is that it avoids directly aligning reads to a reference by instead breaking reads into k-mers and then mapping them to identify which reference sequence the read likely originated from. It seems like internal stretches of Ns (or the "pseudorandom nucleotides" they're replaced with) shouldn't interefere with that process

@kubu4
Copy link
Contributor

kubu4 commented Jan 31, 2024

Can you please share code you used (link to file in GitHub repo) to extract mRNA seqs?

@shedurkin
Copy link
Contributor Author

sure, here's the current .md file

@kubu4
Copy link
Contributor

kubu4 commented Jan 31, 2024

Kallisto indexing problem is now being addressed in #1795

@sr320
Copy link
Member

sr320 commented Feb 1, 2024

looking more at it I think the mRNA feature is misleading and maybe is gene.... Easiest and funnest way to proceed is to load the evermanni gff in IGV to decide how to get a fasta file of evermanni trancripts <- ultimate goal

meaning it looks like mRNA has introns which would be a problem for alignment

@sr320 sr320 changed the title Issue using fastx_collapser Identification of P evermanni transcript fasta Feb 1, 2024
@sr320 sr320 closed this as completed Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants