Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.NullPointerException when running MS-GF+ #13

Open
RiegardtJohnson opened this issue Jun 7, 2017 · 11 comments
Open

java.lang.NullPointerException when running MS-GF+ #13

RiegardtJohnson opened this issue Jun 7, 2017 · 11 comments

Comments

@RiegardtJohnson
Copy link

I ran an MS-GF+(v2017.01.13) search using SearchGUI, and received the following errors when the output files were being generated:

Writing results...
java.lang.NullPointerException
at edu.ucsd.msjava.mzid.MZIdentMLGen.getDBSequence(MZIdentMLGen.java:661)
at edu.ucsd.msjava.mzid.MZIdentMLGen.getPeptideEvidenceList(MZIdentMLGen.java:619)
at edu.ucsd.msjava.mzid.MZIdentMLGen.addSpectrumIdentificationResults(MZIdentMLGen.java:347)
at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:397)
at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:106)
at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:57)

The search finishes without any errors, however no output .mzid files are generated. The command used to run the search was as follows:
ms-gf+ command:
/home/user/anaconda2/jre/bin/java -Xmx50g -jar /run/media/user/Data/rmj_proteomics/SearchGUI-3.2.18/resources/MS-GF+/MSGFPlus.jar -s /run/media/user/Data/rmj_proteomics/proteomics/RECONVERTED/RJ_FC2_DCE.mgf -d /run/media/user/Data/rmj_proteomics/TREMBL_database/nr_fungal/nr_fungal_concatenated_target_decoy.fasta -o /run/media/user/Data/rmj_proteomics/proteomics/nr_fungal_lin/.SearchGUI_temp/RJ_FC2_DCE.msgf.mzid -t 10.0ppm -tda 0 -mod /run/media/user/Data/rmj_proteomics/SearchGUI-3.2.18/resources/MS-GF+/params/Mods.txt -minCharge 2 -maxCharge 6 -inst 3 -thread 23 -m 3 -e 1 -ntt 2 -protocol 0 -minLength 8 -maxLength 45 -n 10 -addFeatures 0 -ti 0,4

Can you advise on how to resolve this error?

Kind regards,
Riegardt Johnson

@alchemistmatt
Copy link
Collaborator

That's useful information that you provided, but it's not enough for us to solve the problem. It may be related to the protein names or protein sequences in the FASTA file, but without the actual files, we won't be able to diagnose. Please send SearchGUI-3.2.18/resources/MS-GF+/params/Mods.txt along with a portion of the .mgf file (e.g. a sampling of 25 spectra from the middle of the scan range) to proteomics@pnnl.gov

@alchemistmatt
Copy link
Collaborator

Also, please provide us info on where you obtained the TREMBL nr_fungal FASTA file. It would also be helpful if you sent us a portion of your FASTA file, including both the normal proteins and the decoy proteins that you added. This will let us see the format you're using for protein names, descriptions, and sequences.

I'm going to guess you're using uniprot_trembl_fungi.dat.gz from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/ but please confirm.

@alchemistmatt
Copy link
Collaborator

If you're using the full-size TREMBL nr_fungal FASTA file, I'm frankly surprised that MSGF+ is not running out of memory. We have found that when FASTA files get larger than ~800 MB, we get memory usage issues (in that the system requires 16 GB of memory or more, scaling with FASTA file size). In cases like that we split the FASTA file into multiple parts, run MSGF+ once on each FASTA file part, then merge the results together.

The May 2017 release of uniprot_trembl_fungi.dat has 6.6 million proteins, giving a 4 GB FASTA file. The decoy version of that is 8 GB. I see you're allocating 50 GB via /java -Xmx50g so hopefully that's enough memory, but I suggest you first get things working with a sampling of that huge FASTA file. Something like head -5000000 uniprot_trembl_fungi.fasta > uniprot_trembl_fungi_excerpt.fasta

@Stortebecker
Copy link

I have got a similar error when running an mzML file, which has undergone PeakPicking on MS2 level with the OpenMS tool PeakPickerHiRes. When I instead use the vendor peak picking provided by MSConvert, MSGF runs without any error.

You can find the database, the original file and the vendor-peak-picked file here. I uploaded the PeakPickerHiRes output to Dropbox.

The command I ran:
java -jar MSGFPlus.jar -s PeakPickerHiRes_on_qExactive01819.mzml -d Human_database_cRAP_added.fasta -t 10ppm

The error I got:

Loading database finished (elapsed time: 20,16 sec)
Reading spectra...
java.lang.NullPointerException
at edu.ucsd.msjava.msutil.Spectrum.getCharge(Spectrum.java:124)
at edu.ucsd.msjava.msutil.SpecKey.getSpecKeyList(SpecKey.java:91)
at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:220)
at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:105)
at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:56)

@hroest
Copy link

hroest commented Dec 7, 2017

@Stortebecker maybe this is related to OpenMS/OpenMS#3082

@alchemistmatt is it possible that MSGF+ relies on optional elements in the mzML file?

@FarmGeek4Life
Copy link
Collaborator

@Stortebecker That file has no charge state information for the precursors, which is what MS-GF+ is trying to read when it crashes. PeakPickerHiRes does not report the charge states, but as of 2014 there was work in progress to implement charge state determination/deconvolution algorithms as options in OpenMS, according to OpenMS issue #877.

@RiegardtJohnson: This is a problem with the implementation of the search in MS-GF+, and limitations of Java. Java uses a 32-bit integer as the index for an array, which limits values to ~2.147 billion entries; MS-GF+ accesses all peptides in the fasta file in a way that means each residue is one entry in an array. Your database file, at 4GB, is big enough to have this problem for just a target or decoy search; when creating the concatenated target/decoy files for a target and decoy combined search, the number of residues is doubled, which doesn't make it any easier.

@Wang-kaifei
Copy link

Dear Developers,

I had the same problem recently. I was using a fasta file size of 14GB, and by reading the replies between everyone, I realised that I needed to slice the database for searching.

Because there are cases where a single MSMS is matched to different peptides in different searches, it seems to me that it is not possible to directly concatenate the results of these searches.

So I wonder if there is an official tool for merging the results from these sliced searches?

The command I use is: java -Xms150G -Xmx210G -jar MSGFPlus.jar -conf param_file_path
I am using the software version: MSGFPlus_v20230112

Any replies will be appreciated!

@alchemistmatt
Copy link
Collaborator

Use the MzidMerger to combine .mzid files from separate MS-GF+ searches of the same instrument file

@Wang-kaifei
Copy link

Use the MzidMerger to combine .mzid files from separate MS-GF+ searches of the same instrument file

Thanks a lot, I will try it!

@Wang-kaifei
Copy link

Dear,

I've got another problem.
When I use the command: dotnet /data/liuqingxiu/wkf/MSGFMerge/net5.0/MzidMerger.exe -inDir a -out b, I receive the following error:

Error:
An assembly specified in the application dependencies manifest (MzidMerger.deps.json) has already been found but with a different file extension:
package: 'MzidMerger', version: '1.3.1'
path: 'MzidMerger.dll'
previously found assembly: '/data/liuqingxiu/wkf/MSGFMerge/net5.0/MzidMerger.exe'

I'm using Ubuntu 20.04 with dotnet version 5.0.408.

Any replies will be appreciated!

@FarmGeek4Life
Copy link
Collaborator

FarmGeek4Life commented Dec 22, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants