-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive clustering observed for one gene #39
Comments
Hi @mathog , That's not entirely surprisingly at this "preclustering" stage. You have so far only run Please continue with the true family finding step as indicated in the tutorial which would be commands such as:
Once you have run family finding, please let me know how the dlp gene cluster looks. If it's still too relaxed, you can play with the parameters to increase k-mer percentage stringency. |
On 25-Jun-2018 12:45, Elizabeth Tseng wrote:
Hi @mathog ,
So you're saying that the cluster that contains the dlp gene (cluster
54527), in addition to correctly containing half the dlp gene, also
contained a whole bunch of other genes?
Yes
That's not entirely surprisingly at this "preclustering" stage. You
have so far only run `run_preCluster.py` which is actually a
pre-processing step of the attempt to find gene families. The
preclustering step is extremely generous, basically grouping
everything that has a minimap2 hit, and minimap2 can detect very low
identity (~60% sometimes low) partial hits amongst sequences.
Please continue with the true family finding step as indicated in the
[tutorial](https://github.com/Magdoll/Cogent/wiki/Running-Cogent#findbig)
which would be commands such as:
```
generate_batch_cmd_for_Cogent_family_finding.py --cpus=12
--cmd_filename=cmd preCluster.cluster_info.csv preCluster_out
test_human
```
Ran the equivalent. Then did
bash cmd
and it blew up after 20 directories. Just to see how this works the dlp
directory was located (54527) and did:
cd preCluster_out/54527
nice run_mash.py -k 30 --cpus=40 isoseq_flnc.fasta >rmp.log 2>&1 &
#13:03 - 16:54.
process_kmer_to_graph.py isoseq_flnc.fasta
isoseq_flnc.fasta.s1000k30.dist \
../../test_sp 54527 > pktg.log 2>&1 &
#about 7 min
ls -tal
161346067 Jun 25 17:07 54527.graphml
8301840 Jun 25 17:06 54527.partition.txt
236904627 Jun 22 13:18 isoseq_flnc.fasta
2068796284 Jun 25 16:54 isoseq_flnc.fasta.s1000k30.dist
9564315 Jun 25 17:06 pktg.log
3479564 Jun 25 16:54 rmp.log
ls -1 ../../test_sp | grep ^54527_ | wc -l
#43335
The next step is supposed to be
generate_batch_cmd_for_Cogent_reconstruction.py <dirname>
but putting 54527 in there fails in the top level (no such subdirectory)
or in test_sp (names are 54527_*). It runs in preCluster_out without
crashing
but didn't make any files (or give any warnings). At what location is
that
command supposed to be run?
Thanks,
David Mathog
mathog@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
|
Hi @mathog , The last parameter in
is where the new output will be in. In this case, it's This means, assuming everything went through, you can run
If you are only interested in, say, cluster 54527 and don't care the other ones failed (you said it blew up after 20 directories, which I think should still be looked into), then as long as It'd still be nice to have all the family finding finish. There's a bash script in the tutorial that let's you find out failed dirs:
|
I used this script to run the final reconstruct_contig.py stage:
There were ~45000 subdirectories in test_sp, nearly all of them derived from 54527 (because the initial run blew up and didn't make many, and 54527 was run manually because it contained the test gene). It took 108 minutes to run. After that completed this was used to find the subdirectory for the dlp test case in the results and to characterize its contents:
So, essentially what Cogent did was to collapse classes of similar Trinity transcripts into single (or at least fewer) instances. Functionally that is pretty close to what EvidentialGene does - either way the result is a list of transcripts with more or less one per transcript class. What neither did though was to merge between the families. Path5 and path11 have an essentially identical region of 8kbp where they overlap. In this test case we have reason to believe that path11 is a reasonably accurate representation of the 5' end of (at least some) transcripts: it has what looks like a complete 5' end of the coding sequence for a protein matching (as well as can be expected) that in other organisms. The 5' end of path5 is missing the start of that enormous ORF and the first start in its large ORF is preceded by many starts, so it likely wouldn't code for anything. It is not clear at this point if the path5 transcripts indicate Trinity assembly errors or if they represent a real type of defective transcript. Brian Haas has been making good progress in getting Trinity to work better with polymorphic, non-single direction reads (what we have), and if that class stops appearing I won't be greatly surprised. dlp is a very hard case - the gene is ~400kbp long and the most recent Trinity (development branch) runs produced a few reasonable looking (full CDS, all transcript sequence mapping in order onto genomic sequence within the gene) transcripts over 15kbp in size. These are essentially the same as path5 from 865bp to the 3' end, but the 5' ends differ. Then there are the ~45k other test_sp subdirectories also derived from 54527 which presumably contain results for other genes, but perhaps also some dlp? (Commands below uses my extract and execinput programs, they are in drm_tools on sourceforge, or use perl, awk, etc.)
It looks like bits of dlp ended up in 10 other directories. The only known repeat in dlpRNA appears once in the transcript at about 7440bp and corresponds to a short genomic direct repeat, and that position isn't in any of these matches. |
Hi @mathog ,
What do you refer to by "families" here? Do you mean merge transcripts/isoforms of the same gene (same locus, different splice forms) or different genes (different loci)? The intention of Cogent was not to merge different genes (unless it can't help it because the two genes are way too similar). I'm trying to understand the relationship between path5 and path11. Do I understand correctly that: path11 matches completely to dlp gene 1-9288 bp. But is on the reverse strand? I should mention Cogent is strand-aware (because it's used for Iso-Seq data which is strand aware). It looks to me that the input transcripts may be flipped? Not that it matters much, but you may want to flip them to the transcript sense strand. I'm wondering out loud why path5 and path11 are not merged by Cogent, assembling effectively 1-13356bp of dlp gene (the whole gene). It seems to be it should be based on the mapped coordinates to dlp, unless there's some sequence difference between them that I can't tell based on this report. Can you show an alignment between path5 and path11? |
The dystrophin like protein (dlp) is being employed as a hard test case while building transcriptomes for several different sea urchin species. The full length dlp mRNA is more than 13kbp and the protein more than 4kaa. 22 RNAseq libraries were built using Trinity 2.6.6 and concatenated (with minor header renaming to avoid conflicts) into all_trin.fasta.
This large fasta file is the input for Cogent. It was searched for dlp pieces with Last like so:
which found 275 hits. The input sequences were clustered with:
There were no errors in the log file.
The largest dlp transcript in the Trinity results was TRINITYtrinLocDN33715c0g1t2
which was found in cluster 54527.
found 121 hits. So about half the hits that Last found are in this cluster. The missing hits are not the problem, the problem is all the other stuff in there:
In other words 7.3% of all sequences and 8.1% of all sequence landed in this cluster.
The Last results and cluster results are not consistent.
A plot of (genomic) tuple counts along the entirety of NCBI's dlp transcript does not show any repetitive sequences, which would have been prone to recruiting other sequences.
What explains this result?
The text was updated successfully, but these errors were encountered: