Analysis of log P predictions
General analysis of log P predictions include calculated vs predicted log P correlation plots and 6 performance statistics (RMSE, MAE, ME, R^2, linear regression slope(m), and error slope(ES)) for all the submissions. 95%-percentile bootstrap confidence intervals of all the statistics were reported. Error slope (ES) statistic is calculated as the slope of the line fit the the QQ plot of model uncertainty predictions.
Molecular statistics analysis was performed to indicate logP values of which molecules of SAMPL6 logP Challenge set were more difficult to predict accurately across participated methods. Error statistics (MAE and RMSE) were calculated for each molecule averaging across all methods or for all methods within a method category (Physical, Empirical, Mixed, or Other).
run.sh- Bash script that run python analysis scripts and compiles TeX files.
logP_analysis.py- Python script that parses submissions and performs the analysis.
logP_analysis2.py- Python script that performs the analysis of molecular statistics (Error statistics, MAE and RMSE, calculated across methods for each molecule.)
logP_predictions/- This directory includes SAMPL6 type III pKa submission files.
analysis_outputs/- All analysis outputs are organized under this directory. Also includes submission IDs assigned to each submission.
error_for_each_logP.pdf- Violin plots that show error distribution of predictions related to each experimental log P.
logPCorrelationPlots/- This directory contains plots of predicted vs. experimental log P values with linear regression line (blue) for each method. Files are named by submission ID of each method, which can be found in
statistics_table.pdf. In correlation plots, dashed black line has slope of 1. Dark and light green shaded areas indicate +-0.5 and +-1.0 log P unit error regions, respectively.
pKaCorrelationPlotsWithSEM/- This directory contains similar plots to the
pKaCorrelationPlots/directory with error bars added for Standard Error of the Mean(SEM) of experimental and predicted values for submissions that reported these values. Since experimental log P SEM values are small horizontal error bars are mostly not visible.
AbsoluteErrorPlots/- This directory contains a bar plot for each method showing the absolute error for each log P prediction compared to experimental value.
StatisticsTables/- This directory contains machine readable copies of Statistics Table, bootstrap distributions of performance statistics, and overall performance comparison plots based on RMSE and MAE values.
statistics.pdf- A table of performance statistics (RMSE, MAE, ME, R^2, linear regression slope(m), and error slope(ES)) for all the submissions.
statistics_bootstrap_distributions.pdf- Violin plots showing bootstrap distributions of performance statistics of each method. Each method is labelled by submission ID.
QQPlots/- Quantile-Quantile plots for the analysis of model uncertainty predictions.
MolecularStatisticsTables/- This directory contains tables and barplots of molecular statistics analysis (Error statistics, MAE and RMSE, calculated across methods for each molecule.)
MAE_vs_molecule_ID_plot.pdf- Barplot of MAE calculated for each molecule averaging over all prediction methods.
RMSE_vs_molecule_ID_plot.pdf- Barplot of RMSE calculated for each molecule averaging over all prediction methods.
molecular_error_statistics.csv- MAE and RMSE statistics calculated for each molecule averaging over all prediction methods. 95% confidence intervals were calculated via bootstrapping (10000 samples).
molecular_MAE_comparison_between_method_categories.pdf- Barplot of MAE calculated for each method category for each molecule averaging over all predictions in that method category. Colors of bars indicate method categories.
Empirical/- This directory contains table and barplots of molecular statistics analysis calculated only for methods in Empirical method category.
Mixed/- This directory contains table and barplots of molecular statistics analysis calculated only for methods in Mixed method category.
Other/- This directory contains table and barplots of molecular statistics analysis calculated only for methods in Other method category.
Physical/- This directory contains table and barplots of molecular statistics analysis calculated only for methods in Physical method category.
Submission IDs for log P prediction methods
SAMPL6 log P challenge submissions were listed in the ascending order of RMSE.
|Submission ID||Method Name||Category|
|gmoq5||Global XGBoost-Based QSPR LogP Predictor||Empirical|
|sq07q||Local XGBoost-Based QSPR LogP Predictor||Empirical|
|hdpuj||RayLogP-II, a cheminformatic QSPR model predicting the octanol/water partition coefficient, logP.||Empirical|
|0a7a8||ML Prediction using MD Feature Vector Trained on logP_octanol_water, with Additional Meta-learner||Mixed|
|w6jta||ML Prediction using MD Feature Vector Trained on logP_octanol_water||Mixed|
|5krdi||ZINC15 versus PM3||Mixed|
|gnxuu||ML Prediction using MD Feature Vector Trained on logP_octanol_water||Mixed|
|kxsp3||PLS2 from NIST data and QM-generated QSAR Descriptors||Mixed|
|5mahv||ML Prediction using MD Feature Vector Trained on Hydration Free Energy||Mixed|
|y0xxd||FS-GM (Fast switching Growth Method)||Physical|
|2ggir||FS-AGM (Fast switching Annihilation/Growth Method)||Physical|
|dyxbt||B3PW91-TZ SMD set1||Physical|
|h83sb||Linear Regression with B3LYP/6-31G+||Mixed|
|f3dpg||PLS from NIST data and QM-generated QSAR Descriptors||Mixed|
|25s67||FS-AGM (Fast switching Annihilation/Growth Method)||Physical|
|7gg6s||MLR from NIST data and QM-generated QSAR Descriptors||Mixed|
|hwf2k||Extended solvent-contact model approach||Empirical|
|ggm6n||FS-GM (Fast switching Growth Method)||Physical|
|cr3hs||PLS3 from NIST data and QM-generated QSAR Descriptors subset||Mixed|
|ahmtf||B3PW91-TZ SMD kcl-wet-oct||Physical|
|o7djk||B3PW91-TZ SMD wetoct||Physical|
|bzeez||FS-AGM (Fast switching Annihilation/Growth Method)||Physical|
|5svjv||FS-GM (Fast switching Growth Method)||Physical|
|bq6fo||Extended solvent-contact model approach||Mixed|