Skip to content

BenderGroup/PIDGINv4

Repository files navigation

Prediction IncluDinG INactivity (PIDGIN) Version 4.2 =======================================

UPDATE MAR 2022: The new no-orthologue models can be downloaded at https://doi.org/10.6084/m9.figshare.19108382.v1 - remove the old no_ortho directory, and download and unzip the folder in the PIDGINv4 root directory. See the ReadtheDocs for full installation instructions.

For now, the orthologue (--ortho) command is deprecated - use the old models at your own risk. If you require the orthologue models for your research, please get in touch!

Documentation Status betarelease

Author : Maria-Anna Trapotsi, Layla Hosseini-Gerami and Lewis Mervin

Email: mat64@cam.ac.uk and lh605@cam.ac.uk

Supervisor : Dr. A. Bender

Protein target prediction using Random Forests (RFs) trained on bioactivity data from PubChem (extracted Mar 2020) and ChEMBL (version 26), using the RDKit and Scikit-learn, which employ a modification of the reliability-density neighbourhood Applicability Domain (AD) analysis by Aniceto1. This project is the sucessor to PIDGIN version 12 and PIDGIN version 23. This is the updated and retrained version of PIDGIN version 3 Target prediction with extended NCBI pathway and DisGeNET disease enrichment calculation is available as implemented in4.

  • Molecular Descriptors : 2048bit Rdkit Extended Connectivity FingerPrints (ECFP)5
  • Algorithm: Random Forests with dynamic number of trees (see docs for details), class weight = 'balanced', sample weight = ratio Inactive:Active
  • Models generated at four different cut-off's: 100μM, 10μM, 1μM and 0.1μM
  • Models generated both with and without mapping to orthologues, as implemented in6
  • Pathway information from NCBI BioSystems
  • Disease information from DisGeNET
  • Target/pathway/disease enrichment calculated using Fisher's exact test and the Chi-squared test

Details for sizes across all activity cut-offs

Without orthologues With orthologues
Distinct Models 11,782 16,772
Distinct Targets [exhaustive total] 3,698 [11,782] 17,021 [63,140]
Total Bioactivities Over all models 50,210,041 437,574,005
Actives 4,079,996 4,087,155
Inactives [Of which are Sphere Exclusion (SE)] 46,130,045 [35,119,663] 463,237,781 [314,117,438]

Full details on all models are provided in the uniprot_information.txt files in the orthologue and no_orthologue directories

INSTRUCTIONS

Development occurs on GitHub.

Install with Conda

Documentation, installation and instructions are on ReadtheDocs.

IMPORTANT

  • Use the ReadtheDocs! You MUST download the models before running!
  • The program recognises as input line-separated SMILES in either .smi/.smiles or .sdf format
  • If the SMILES input contains data additional to the SMILES string, the first entries after the SMILES are automatically interpreted as identifiers (see the OpenSMILES specification §4.5) - although there are options to change this behaviour
  • Molecules are automatically standardized when running models (can be turned off)
  • Do not modify the 'pkls', 'ad_data' etc. names or directories
  • Files in the examples directory are included for testing as on the ReadtheDocs tutorials.
  • For installation and usage instructions, see the documentation.

License

PIDGINv4 is available under the GNU General Public License v3.0 (GPLv3).

References


  1. Aniceto, N, et al. A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: Reliability-density neighbourhood. J. Cheminform. 8: 69 (2016).

  2. Mervin, L H., et al. Target prediction utilising negative bioactivity data covering large chemical space. J. Cheminform. 7: 51 (2015).

  3. Mervin, L H., et al. Orthologue chemical space and its influence on target prediction. Bioinformatics. 34: 72–79 (2018).

  4. Mervin, L H., et al. Understanding Cytotoxicity and Cytostaticity in a High-Throughput Screening Collection. ACS Chem. Biol. 11: 11 (2016)

  5. Rogers D & Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50: 742-54 (2010).

  6. Mervin, L H., et al. Orthologue chemical space and its influence on target prediction. Bioinformatics. 34: 72–79 (2018).