Issues running EvidenceModeler #55

juanlu16 · 2022-04-07T10:27:11Z

I have some doubts about EVM and also some execution problems. I was wondering if you could help me. I would be extremely grateful to you if you could help me solve these doubts and problems.

The doubts is the next: I have several prediction files with transcriptomes from PASA. I would just like to merge the predictions or alignments from these files, I prefer not to enter an Ab initio gene prediction file. However, the gene prediction file is mandatory to run EVM. What could I do? I had thought about using as prediction file ( --gene_predictions) some file coming from PASA results, but I don't know if this is the right thing to do.

On the other hand, i have a trouble too. To test that evidence modeler works correctly in my computer, I have launched a test with several PASA transcript files, but before finishing, the execution stops and I get the message "Out of memory!" What can this message be due to?

Thanks so much in advantage for your helps. I'm so gratefull with you.

Best regards,

Juan Luis

brianjohnhaas · 2022-04-07T12:09:10Z

Hi Juan, responses below:

On Thu, Apr 7, 2022 at 6:27 AM juanlu16 ***@***.***> wrote: Hi @brianjohnhaas <https://github.com/brianjohnhaas> I have some doubts about EVM and also some execution problems. I was wondering if you could help me. I would be extremely grateful to you if you could help me solve these doubts and problems. The doubts is the next: I have several prediction files with transcriptomes from PASA. I would just like to merge the predictions or alignments from these files, I prefer not to enter an Ab initio gene prediction file. However, the gene prediction file is mandatory to run EVM. What could I do? I had thought about using as prediction file ( --gene_predictions) some file coming from PASA results, but I don't know if this is the right thing to do.

EVM primarily works of ab initio gene predictions and uses alignments (protein and transcripts) to guide selection of exons and introns. You mentioned you prefer to not use ab initio gene predictions. If you have PASA transcripts, you could run TransDecoder on them to predict ORFs and propagate them to genome structures.

On the other hand, i have a trouble too. To test that evidence modeler works correctly in my computer, I have launched a test with several PASA transcript files, but before finishing, the execution stops and I get the message "Out of memory!" What can this message be due to?

EVM should come with a bunch of example data sets. Try running some of these to see that it works on your system. If you're getting out of memory errors, then explore the EVM genome partitioning step and making the partitions smaller so they will contain multiple complete genes but not be overwhelmed by the amount of data per genome partition. hope this helps, ~brian

…

Thanks so much in advantage for your helps. I'm so gratefull with you. Best regards, Juan Luis — Reply to this email directly, view it on GitHub <#55>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZRKX2TCCX7WO4ZYTMMMU3VD22AXANCNFSM5SY4ROHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

juanlu16 · 2022-04-18T12:28:03Z

Hi Brian Jonnas,

I am really grateful for your reply.

From what I understand in your explanation, the ab initio prediction is totally necessary, isn't it. I have that prediction done with Augustus, but I would like to use only PASA data since the data coming from PASA I consider to be more secure. what do you recommend me?

Regarding the "out of memory!" problem, I have implemented what you told me about reducing the partition parameter, and I have reduced it to 10000 instead of leaving 100000 as it comes in the Evidence Modeler web. However, I get the same error when I run this test with the prediction of 2000 genes from each of the four species I want to merge, i.e. with a total of 8000 genes to merge into 2000 genes approximately. To do this test, I had 165 Gb of RAM, so I suppose that by saturating all this RAM is because Evidence Modeler really requires a lot of RAM. In this case, my question is more specific. Taking into account that I am working with genomes of about 24 really big plant varieties (between 13-17 Gb), and that each raisin alignment-prediction (gff3 files) contains about 125000 alignments-assemblies on average and a prediction of a reference genome with 71000 predicted genes, could you know how much is the approximate RAM requirement in a system for a correct functioning of Evidence Modeler with all this data?

In case this process needs a lot of RAM and other resources in quantity, we had thought about the option of extracting the predicted genes in each chromosome, and run evidence modeler chromosome by chromosome. Would this be a good option?

Thank you very much in advance for your help.

Best regards,

Juan Luis

brianjohnhaas · 2022-04-18T14:01:14Z

I'm not sure what the issue is with EVM but it's likely related to some aspect of the data. If you have a small example that you could share with me that demonstrates the problem, I could look into it further. If you have augustus predictions and transcript data that you want to integrate, you could just upload the augustus predictions as annotations into PASA and have PASA add UTRs and alt splice variants. If you run PASA with the --TRANSDECODER option, it'll add novel genes and revise the annotations more aggressively based on what it infers as full-length transcripts. best, ~b

…

On Mon, Apr 18, 2022 at 8:28 AM juanlu16 ***@***.***> wrote: Hi Brian Jonnas, I am really grateful for your reply. From what I understand in your explanation, the ab initio prediction is totally necessary, isn't it. I have that prediction done with Augustus, but I would like to use only PASA data since the data coming from PASA I consider to be more secure. what do you recommend me? Regarding the "out of memory!" problem, I have implemented what you told me about reducing the partition parameter, and I have reduced it to 10000 instead of leaving 100000 as it comes in the Evidence Modeler web. However, I get the same error when I run this test with the prediction of 2000 genes from each of the four species I want to merge, i.e. with a total of 8000 genes to merge into 2000 genes approximately. To do this test, I had 165 Gb of RAM, so I suppose that by saturating all this RAM is because Evidence Modeler really requires a lot of RAM. In this case, my question is more specific. Taking into account that I am working with genomes of about 24 really big plant varieties (between 13-17 Gb), and that each raisin alignment-prediction (gff3 files) contains about 125000 alignments-assemblies on average and a prediction of a reference genome with 71000 predicted genes, could you know how much is the approximate RAM requirement in a system for a correct functioning of Evidence Modeler with all this data? In case this process needs a lot of RAM and other resources in quantity, we had thought about the option of extracting the predicted genes in each chromosome, and run evidence modeler chromosome by chromosome. Would this be a good option? Thank you very much in advance for your help. Best regards, Juan Luis — Reply to this email directly, view it on GitHub <#55 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZRKXZ45OT3C2LR3GO2WJTVFVIN5ANCNFSM5SY4ROHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

juanlu16 · 2022-04-19T14:00:48Z

Dear Brian Jonnas:

On the one hand, I attach a document containing screenshots and a description of all the documents I used as input to run evidence modeler, plus the initial run command I used.

On the other hand, evidence modeler has allowed me to run a test with my own data without memory problems as it happened to me before, since I have reduced the "segmentSize" parameter to 40000, instead of 100000 as the example tutorial does. However, I have not been successful in the result, since the final gff3 file does not contain any data. Therefore I wanted to ask you two specific questions:

Does the --transcript_alignments parameter only accept a single input file or can it accept multiple? In case it only accepts a single input file, could I merge all the files coming from the PASA run for each variety into a single final file, and use this final file as input for the "--transcript_alignments" parameter?
Why could it be that the gff3 files generated in each reference genome partition and the final gff3 file do not contain any data?

Thank you very much for your help and your patience Brian Jonnas.

Best regards,

Juan Luis

INPUT FILES.pdf

brianjohnhaas · 2022-04-19T15:58:15Z

hi, EVM generally expects at least three sets of ab initio predictions and then the alignment data. It only takes individual gff3 files per parameter, so you must concatenate data together for the corresponding data type. Since you're just using Augustus with transcript data, try just using PASA to update your Augustus predictions. Not sure who Jonnas is. ;-)

…

On Tue, Apr 19, 2022 at 10:01 AM juanlu16 ***@***.***> wrote: Dear Brian Jonnas: On the one hand, I attach a document containing screenshots and a description of all the documents I used as input to run evidence modeler, plus the initial run command I used. On the other hand, evidence modeler has allowed me to run a test with my own data without memory problems as it happened to me before, since I have reduced the "segmentSize" parameter to 40000, instead of 100000 as the example tutorial does. However, I have not been successful in the result, since the final gff3 file does not contain any data. Therefore I wanted to ask you two specific questions: 1. Does the --transcript_alignments parameter only accept a single input file or can it accept multiple? In case it only accepts a single input file, could I merge all the files coming from the PASA run for each variety into a single final file, and use this final file as input for the "--transcript_alignments" parameter? 2. Why could it be that the gff3 files generated in each reference genome partition and the final gff3 file do not contain any data? Thank you very much for your help and your patience Brian Jonnas. Best regards, Juan Luis INPUT FILES.pdf <https://github.com/EVidenceModeler/EVidenceModeler/files/8513622/INPUT.FILES.pdf> — Reply to this email directly, view it on GitHub <#55 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZRKXY54LJYMPRPYA7FZLLVF24BVANCNFSM5SY4ROHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

juanlu16 · 2022-04-20T17:35:00Z

Hi Brian

I have concatenated the gff3 files coming from PASA, and subsequently run it again. So, the gff3 files generated for me are still empty.

On the other hand, I am looking at enhancing the Augustus data with the PASA data, but from what I have seen, the prediction has to be redone and this would be a long and costly process, whereas I need to merge all the predictions as soon as possible, as this is a matter I am in a hurry.
I had thought that perhaps the fault might be in the input data. To check that the input gff3s are OK, I used the script " gff3_gene_prediction_file_validator.pl ". When I ran that script using the Augustus prediction as input, I got the following error:

(base) support@srvapp02:/BIOINFOR/EvidenceModeler/test2$ /home/support/sw_installed/EVidenceModeler-1.1.1/EvmUtils/gff3_gene_prediction_file_validator.pl ab_inition_prediction.gff3
Fatal Error: cannot parse ID from entry
at /home/support/sw_installed/EVidenceModeler-1.1.1/EvmUtils/gff3_gene_prediction_file_validator.pl line 54, <$fh> line 18.
(base) support@srvapp02:/BIOINFOR/EvidenceModeler/test2$

I think it is because in this gff3 file the identifier is missing. how could I correct this gff3 file? how could I introduce the ID of each gene? could it be the lack of these gene identifiers in the augustus file that causes that the gff3 that is generated when I run evidence modeler is empty?

If you want, I could send you the original files to your email in case you want to take a look at them to see if they are ok or not.

Thank you very much for your attention and speed Brian.

Best regards!

Juan Luis

brianjohnhaas · 2022-04-20T18:29:36Z

Sure, send me what you can and I'll take a look. We have a converter that should make Augustus gff3 look more compatible to PASA or EVM in case that helps. The converter is in the evm_utils/ area of the EVM software. best, ~b

…

On Wed, Apr 20, 2022 at 1:35 PM juanlu16 ***@***.***> wrote: Hi Brian I have concatenated the gff3 files coming from PASA, and subsequently run it again. So, the gff3 files generated for me are still empty. On the other hand, I am looking at enhancing the Augustus data with the PASA data, but from what I have seen, the prediction has to be redone and this would be a long and costly process, whereas I need to merge all the predictions as soon as possible, as this is a matter I am in a hurry. I had thought that perhaps the fault might be in the input data. To check that the input gff3s are OK, I used the script " gff3_gene_prediction_file_validator.pl ". When I ran that script using the Augustus prediction as input, I got the following error: *(base) ***@***.***:/BIOINFOR/EvidenceModeler/test2$ /home/support/sw_installed/EVidenceModeler-1.1.1/EvmUtils/gff3_gene_prediction_file_validator.pl <http://gff3_gene_prediction_file_validator.pl> ab_inition_prediction.gff3 Fatal Error: cannot parse ID from entry at /home/support/sw_installed/EVidenceModeler-1.1.1/EvmUtils/gff3_gene_prediction_file_validator.pl <http://gff3_gene_prediction_file_validator.pl> line 54, <$fh> line 18. (base) ***@***.***:/BIOINFOR/EvidenceModeler/test2$* I think it is because in this gff3 file the identifier is missing. how could I correct this gff3 file? how could I introduce the ID of each gene? could it be the lack of these gene identifiers in the augustus file that causes that the gff3 that is generated when I run evidence modeler is empty? If you want, I could send you the original files to your email in case you want to take a look at them to see if they are ok or not. Thank you very much for your attention and speed Brian. Best regards! Juan Luis — Reply to this email directly, view it on GitHub <#55 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZRKX56EYTIGMSLRJN2SYDVGA547ANCNFSM5SY4ROHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

juanlu16 · 2022-04-21T07:35:23Z

Hi Brian

I have sent you the files to your email.

Best,

Juan Luis

juanlu16 · 2022-05-10T08:03:45Z

Hi Brian

I hope all goes well for you.

I sent you an email with the files I use as input for EVM. However, I don't think it would reach you due to the lack of response. On the other hand, I have tried to share these files with you here so that you can access them more easily, but it does not allow me to upload these file formats. What I am doing is to send you a new email with the files to see if you get them this time and you can take a look at them.

Sorry for the inconvenience Brian, but EVM is a software that interests me a lot.

Thank you very much for your attention and help in advance.

Best regards,

Juan Luis

brianjohnhaas · 2022-05-10T14:16:13Z

Hi Juan, I don't think I received anything from you. Please try again: bhaas at broadinstitute dot org many thanks, ~b

…

On Tue, May 10, 2022 at 4:03 AM juanlu16 ***@***.***> wrote: Hi Brian I hope all goes well for you. I sent you an email with the files I use as input for EVM. However, I don't think it would reach you due to the lack of response. On the other hand, I have tried to share these files with you here so that you can access them more easily, but it does not allow me to upload these file formats. What I am doing is to send you a new email with the files to see if you get them this time and you can take a look at them. Sorry for the inconvenience Brian, but EVM is a software that interests me a lot. Thank you very much for your attention and help in advance. Best regards, Juan Luis — Reply to this email directly, view it on GitHub <#55 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZRKXYURBGLOPC25FBN4JDVJIJ6XANCNFSM5SY4ROHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

brianjohnhaas · 2022-05-10T14:28:32Z

actually did receive them. Came through my personal email more soon On Tue, May 10, 2022 at 10:15 AM Brian Haas ***@***.***> wrote:

…

Hi Juan, I don't think I received anything from you. Please try again: bhaas at broadinstitute dot org many thanks, ~b On Tue, May 10, 2022 at 4:03 AM juanlu16 ***@***.***> wrote: > Hi Brian > > I hope all goes well for you. > > I sent you an email with the files I use as input for EVM. However, I > don't think it would reach you due to the lack of response. On the other > hand, I have tried to share these files with you here so that you can > access them more easily, but it does not allow me to upload these file > formats. What I am doing is to send you a new email with the files to see > if you get them this time and you can take a look at them. > > Sorry for the inconvenience Brian, but EVM is a software that interests > me a lot. > > Thank you very much for your attention and help in advance. > > Best regards, > > Juan Luis > > — > Reply to this email directly, view it on GitHub > <#55 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABZRKXYURBGLOPC25FBN4JDVJIJ6XANCNFSM5SY4ROHA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

brianjohnhaas · 2022-05-10T18:41:27Z

Hi Juan, Lots of problems with these inputs: - the genome has CHR1 as the accession, whereas the annotation files have chr1. These need to match exactly. - the weights file evidence source field doesn't match exactly with the 2nd column of the gff3 input files. - the Augustus format needs to be converted to a gff3 file that EVM can work with: EVidenceModeler/EvmUtils/misc/augustus_GTF_to_EVM_GFF3.pl ab_inition_prediction.gff3 > gene_predictions.gff3 seemed to do the conversion. - but even after I partitioned the data into 1M chunks with 100k overlap, the gene predictions for one of the partitions I examined didn't have splice sites according to the expected positions in the genome - so not sure if there's an overall problem with the genome file you sent me not matching up with the Augustus predictions (given the CHR1 vs. chr1 issues, I suspect this might be the case). Then, when it comes to the inputs, there's only the one ab initio type (Augustus), and EVM won't do much other than rearrange some intron/exon combinations and maybe merge some predictions based on the transcript data. EVM is really targeted towards combining multiple ab initio types along with the alignments. I'm going to continue to encourage you to use PASA for integrating the transcript data with the Augustus predictions, as it would be your best bet towards achieving your goal here given your inputs. Hope this helps, ~b On Tue, May 10, 2022 at 10:27 AM Brian Haas ***@***.***> wrote:

…

actually did receive them. Came through my personal email more soon On Tue, May 10, 2022 at 10:15 AM Brian Haas ***@***.***> wrote: > Hi Juan, > > I don't think I received anything from you. Please try again: > bhaas at broadinstitute dot org > > many thanks, > > ~b > > On Tue, May 10, 2022 at 4:03 AM juanlu16 ***@***.***> > wrote: > >> Hi Brian >> >> I hope all goes well for you. >> >> I sent you an email with the files I use as input for EVM. However, I >> don't think it would reach you due to the lack of response. On the other >> hand, I have tried to share these files with you here so that you can >> access them more easily, but it does not allow me to upload these file >> formats. What I am doing is to send you a new email with the files to see >> if you get them this time and you can take a look at them. >> >> Sorry for the inconvenience Brian, but EVM is a software that interests >> me a lot. >> >> Thank you very much for your attention and help in advance. >> >> Best regards, >> >> Juan Luis >> >> — >> Reply to this email directly, view it on GitHub >> <#55 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ABZRKXYURBGLOPC25FBN4JDVJIJ6XANCNFSM5SY4ROHA> >> . >> You are receiving this because you were mentioned.Message ID: >> ***@***.***> >> > > > -- > -- > Brian J. Haas > The Broad Institute > http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas> > > > -- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas <http://broad.mit.edu/~bhaas>

brianjohnhaas · 2022-10-11T08:46:55Z

Thanks, Juan! From looking at the files, I expect: EVidenceModeler/EvmUtils/misc/augustus_GTF_to_EVM_GFF3.pl should convert your Augustus ab initio predictions to a compatible gff3 format. Note, the example fasta file you sent has 'CHR1' as the accession instead of 'chr1', which is a problem. You'll need to have the chromosome or contig identifiers match up exactly. Also, when I tried to take the resulting gff3 file and match it up to this genome sequence, the coding regions didn't translate as I expected them too, so maybe there's a disconnect there. hope this helps. best, ~b

On Thu, Apr 21, 2022 at 3:35 AM juanlu16 ***@***.***> wrote: Hi Brian I have sent you the files to your email. Best, Juan Luis — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID:

***@***.***>

…

-- -- Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues running EvidenceModeler #55

Issues running EvidenceModeler #55

juanlu16 commented Apr 7, 2022

brianjohnhaas commented Apr 7, 2022 via email

juanlu16 commented Apr 18, 2022

brianjohnhaas commented Apr 18, 2022 via email

juanlu16 commented Apr 19, 2022

brianjohnhaas commented Apr 19, 2022 via email

juanlu16 commented Apr 20, 2022 •

edited

brianjohnhaas commented Apr 20, 2022 via email

juanlu16 commented Apr 21, 2022

juanlu16 commented May 10, 2022

brianjohnhaas commented May 10, 2022 via email

brianjohnhaas commented May 10, 2022 via email

brianjohnhaas commented May 10, 2022 via email

brianjohnhaas commented Oct 11, 2022 via email

Issues running EvidenceModeler #55

Issues running EvidenceModeler #55

Comments

juanlu16 commented Apr 7, 2022

brianjohnhaas commented Apr 7, 2022 via email

juanlu16 commented Apr 18, 2022

brianjohnhaas commented Apr 18, 2022 via email

juanlu16 commented Apr 19, 2022

brianjohnhaas commented Apr 19, 2022 via email

juanlu16 commented Apr 20, 2022 • edited

brianjohnhaas commented Apr 20, 2022 via email

juanlu16 commented Apr 21, 2022

juanlu16 commented May 10, 2022

brianjohnhaas commented May 10, 2022 via email

brianjohnhaas commented May 10, 2022 via email

brianjohnhaas commented May 10, 2022 via email

brianjohnhaas commented Oct 11, 2022 via email

juanlu16 commented Apr 20, 2022 •

edited