Identification of P evermanni transcript fasta #1793

shedurkin · 2024-01-31T01:50:48Z

I've been trying to use fastx_collapser to collapse a CDS fasta file to unique sequences only (so I can build a kallisto index with it). However, I've been getting the persitent error:

"fastx_collapser: Invalid input: This looks like a multi-line FASTA file.
Line 149 contains a nucleotides string instead of a '>' prefix."

I examined the first several lines of my CDS fasta and, while it looked like it was already single-line, I ran it through fasta_format to format it as a single-line fasta just to be sure. However, I'm still getting the same error. I also looked specifically at Line 149 and it starts with a '>' before the sequence ID and the following sequence is single-line...

Does anyone have any idea why fastx_collapser doesn't want to run on my CDSfasta file? Maybe someone's had this issue before?

file linked here, can also see a screenshot of relevant section below

shedurkin · 2024-01-31T01:52:40Z

I did also try googling, and I found someone with the exact same issue, but there was no resolution

sr320 · 2024-01-31T01:57:11Z

Do not collapse- use fasta as is.

shedurkin · 2024-01-31T03:17:53Z

When I originally ran kallisto index on my un-collapsed CDS fasta file I recieved the error:

Error: repeated name in FASTA file /home/shared/8TB_HDD_02/shedurkin/deep-dive/E-Peve/data/Porites_evermanni_CDS.fasta
Porites_evermani_scaffold_1:486174-486885
Run with --make-unique to replace repeated names with unique names

I was able to look at the CDS gff file and see that duplicate listing is real (screenshot below), so I assumed there could be other duplicates that would require collapsing

sr320 · 2024-01-31T03:59:52Z

This Fasta is from bedtools? Only get mRNAs

shedurkin · 2024-01-31T18:56:56Z

I see, that should fix it! Can I ask why we're using only mRNA? My understanding is that CDS are protein-coding sequences, so I was thinking that CDS sequences were kind of a proxy for mRNA, since presumably any protein coding sequence would eventually be transcribed/translated into mRNA. Is the difference just that CDS still contains untrranslated regions that we don't want?

shedurkin · 2024-01-31T20:50:53Z

I extracted only mRNA lines in the gff, used the gff and a scaffold fasta to build a fasta file of every mRNA sequence, and then tried building a kallisto index with that fasta. I was able to build an index, but I'm getting another issue now.

During indexing I was warned: [build] warning: replaced 10722052 non-ACGUT characters in the input sequence with pseudorandom nucleotides

Since the file contains 40,389 sequences, that means kallisto index is identifying an average of 265 non-ACGUT character per sequence, which seems super concerning. The fasta file is too big to look at all at once, but I copied some of the first few sequences into a google doc to try character-searching for non-ACGUT characters, and several of the first few sequences have very long stretches of just Ns in the middle of the read (example pic below). Does anyone have any idea why this is happening?
Looking through the gff, some of the mRNAs also look to be much longer than I would expect -- could that be related?

kubu4 · 2024-01-31T21:21:29Z

long stretches of just Ns in the middle of the read

This is usually the nature of scaffolded assemblies, and is why the term "scaffold" is used instead of "contig."

Briefly, a "contig" is an abbreviation for "contiguous." As you might be able to guess, a "contig" is a contiguous sequence, with no gaps.

A "scaffold" is sort of the structure on the way to figuring out the entire sequence (contig). As such, it's kind of a place holder. There's sequencing data that indicates that the 5' and 3' sequences are part of the same region of DNA, however, there's not enough sequencing data to sequence the entire stretch of DNA. Thus, the two ends of the DNA fragment which were sequenced are "joined" by a stretch of N to indicate that the nucleotides between the 5' and 3' ends of the sequencing data is not known.

BTW, this info is certainly not meant to be condescending! Just am not sure how much you (or others in the lab who might encounter this issue) know about how genome assemblies are generated and represented.

shedurkin · 2024-01-31T21:44:04Z

Oh interesting, that's super helpful! So, if that means the mRNA fasta has been extracted correctly, then the Ns shouldn't be a problem for pseudoalignment, right? My basic understanding of kallisto pseudoalignment is that it avoids directly aligning reads to a reference by instead breaking reads into k-mers and then mapping them to identify which reference sequence the read likely originated from. It seems like internal stretches of Ns (or the "pseudorandom nucleotides" they're replaced with) shouldn't interefere with that process

kubu4 · 2024-01-31T22:11:44Z

Can you please share code you used (link to file in GitHub repo) to extract mRNA seqs?

shedurkin · 2024-01-31T22:33:51Z

sure, here's the current .md file

kubu4 · 2024-01-31T23:56:21Z

Kallisto indexing problem is now being addressed in #1795

sr320 · 2024-02-01T00:02:52Z

looking more at it I think the mRNA feature is misleading and maybe is gene.... Easiest and funnest way to proceed is to load the evermanni gff in IGV to decide how to get a fasta file of evermanni trancripts <- ultimate goal

meaning it looks like mRNA has introns which would be a problem for alignment

kubu4 mentioned this issue Jan 31, 2024

Kallisto indexing [build] warning: replaced 10722052 non-ACGUT characters in the input sequence with pseudorandom nucleotides #1795

Closed

sr320 changed the title ~~Issue using fastx_collapser~~ Identification of P evermanni transcript fasta Feb 1, 2024

sr320 closed this as completed Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identification of P evermanni transcript fasta #1793

Identification of P evermanni transcript fasta #1793

shedurkin commented Jan 31, 2024 •

edited

shedurkin commented Jan 31, 2024

sr320 commented Jan 31, 2024

shedurkin commented Jan 31, 2024 •

edited

sr320 commented Jan 31, 2024

shedurkin commented Jan 31, 2024

shedurkin commented Jan 31, 2024

kubu4 commented Jan 31, 2024

shedurkin commented Jan 31, 2024

kubu4 commented Jan 31, 2024

shedurkin commented Jan 31, 2024

kubu4 commented Jan 31, 2024

sr320 commented Feb 1, 2024 •

edited

Identification of P evermanni transcript fasta #1793

Identification of P evermanni transcript fasta #1793

Comments

shedurkin commented Jan 31, 2024 • edited

shedurkin commented Jan 31, 2024

sr320 commented Jan 31, 2024

shedurkin commented Jan 31, 2024 • edited

sr320 commented Jan 31, 2024

shedurkin commented Jan 31, 2024

shedurkin commented Jan 31, 2024

kubu4 commented Jan 31, 2024

shedurkin commented Jan 31, 2024

kubu4 commented Jan 31, 2024

shedurkin commented Jan 31, 2024

kubu4 commented Jan 31, 2024

sr320 commented Feb 1, 2024 • edited

shedurkin commented Jan 31, 2024 •

edited

shedurkin commented Jan 31, 2024 •

edited

sr320 commented Feb 1, 2024 •

edited