This notebook will describe the methods used to generate a control dataset for benchmarking the pipeline against a set of structures said to be in a **'protomeric'** relationship. Before proceeding to the methods, the definition of what structural relationships fall into this category will be presented.

**Protomers**

In the aspect of this pipeline, protomers are structural pairs which have the same connectivity and spatial configuration, but differ in their protonation state. This does not include explicit handling for oxidation state differences, a limitation of the current approach. Essentially, a case where structure A and A' have the exact same connectivity and stereochemical properties identified in the pipeline, but have a difference in charge as calculated with RDKit. These structures would then be assigned protomers by the pipeline.

In [None]:
import pandas as pd
import sqlite3
from pathlib import Path
from rdkit import Chem, DataStructs
from rdkit.Chem import rdFingerprintGenerator as rdFG
from stereomapper.domain.chemistry import ChemistryUtils
import logging
import os
from dotenv import load_dotenv
load_dotenv()
import subprocess

logger = logging.getLogger()
logger.setLevel(logging.INFO)

## Run pipeline on the set of diversified pairs of possible protomers

In [None]:
protomers = Path('protomers_benchmarking') # obtained from zenodo repo as per manuscript

PosixPath('/home/jackmcgoldrick/drive/Recon4IMD/WP5_Reconstruction/T5.1_ReconXKG/data/stereomapper/control_sets/stereoisomer_control/protomer_benchmark_data')

In [None]:
# run stereomapper as a subprocess
results = Path.home() / "drive/Recon4IMD/WP5_Reconstruction/T5.1_ReconXKG/results/stereomapper/project_results/benchmarking/041125_protomer_benchmark.sqlite"
cache_path = Path.home() / "/home/jackmcgoldrick/drive/Recon4IMD/WP5_Reconstruction/T5.1_ReconXKG/results/stereomapper/project_results/benchmarking/cache/041125_protomer_benchmark.sqlite"
cmd =[
    "stereomapper",
    "run",
    "-d", protomers.as_posix(),
    "-o", results.as_posix(),
    "-p", cache_path.as_posix(),
    "--fresh-cache"
]

subprocess.run(cmd) 

INFO    Logging initialised. File: /media/JACK/stereomapper_logs/stereomapper_20251104_123324.log
Pipeline: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| , Complete! [00:21<00:00]


âœ… Pipeline completed in 21.2s
ðŸ“¦ Inputs attempted: 6,730 (skipped 0)
ðŸ“Š Successes: 6,724 | Failures: 6
ðŸ”— Inchikey groups â€” processed 3,467, skipped 0, failed 0
ðŸ§® Relationship rows: 3,312
ðŸ§¾ Unique inchikeys observed: 3,467
ðŸ’¾ Cache hit rate: 0.0%


Pipeline: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| , Complete! [00:21<00:00]


CompletedProcess(args=['stereomapper', 'run', '-d', '/home/jackmcgoldrick/drive/Recon4IMD/WP5_Reconstruction/T5.1_ReconXKG/data/stereomapper/control_sets/stereoisomer_control/protomer_benchmark_data', '-o', '/home/jackmcgoldrick/drive/Recon4IMD/WP5_Reconstruction/T5.1_ReconXKG/results/stereomapper/project_results/benchmarking/041125_protomer_benchmark.sqlite', '-p', '/home/jackmcgoldrick/drive/Recon4IMD/WP5_Reconstruction/T5.1_ReconXKG/results/stereomapper/project_results/benchmarking/cache/041125_protomer_benchmark.sqlite', '--fresh-cache'], returncode=0)

In [27]:
# open file with sqlite3
conn = sqlite3.connect(results.as_posix())
cursor = conn.cursor()
df_res_protomers = pd.read_sql_query("SELECT * FROM relationships", conn)

# # take only cases where cluster size == 1
# df_res_protomers = df_res_protomers.loc[ (df_res_protomers['cluster_a_size'].astype(int) == 1) &
#                                         (df_res_protomers['cluster_b_size'].astype(int) == 1) ]

df_res_protomers

Unnamed: 0,cluster_a,cluster_b,cluster_a_members,cluster_b_members,cluster_a_size,cluster_b_size,classification,score,score_details,extra_info,version_tag
0,1,2,"[""chebi:CHEBI:52504""]","[""chebi:CHEBI:40617""]",1,1,Protomers,74.0,"{""confidence_bin"":""medium""}",,v1.0
1,4,5,"[""chebi:CHEBI:17521""]","[""chebi:CHEBI:29751""]",1,1,Protomers,91.0,"{""confidence_bin"":""high""}",,v1.0
2,6,7,"[""chebi:CHEBI:18381""]","[""chebi:CHEBI:58467""]",1,1,Protomers,96.0,"{""confidence_bin"":""high""}",,v1.0
3,8,9,"[""chebi:CHEBI:33462""]","[""chebi:CHEBI:16215""]",1,1,Protomers,89.0,"{""confidence_bin"":""medium""}",,v1.0
4,10,11,"[""chebi:CHEBI:64808""]","[""chebi:CHEBI:64790""]",1,1,Protomers,93.0,"{""confidence_bin"":""high""}",,v1.0
...,...,...,...,...,...,...,...,...,...,...,...
3307,6692,6693,"[""chebi:CHEBI:17793""]","[""chebi:CHEBI:77992""]",1,1,Protomers,95.0,"{""confidence_bin"":""high""}",,v1.0
3308,6694,6695,"[""chebi:CHEBI:136357""]","[""chebi:CHEBI:133821""]",1,1,Protomers,95.0,"{""confidence_bin"":""high""}",,v1.0
3309,6696,6697,"[""chebi:CHEBI:49269""]","[""chebi:CHEBI:58803""]",1,1,Protomers,96.0,"{""confidence_bin"":""high""}",,v1.0
3310,6698,6699,"[""chebi:CHEBI:17803""]","[""chebi:CHEBI:58277""]",1,1,Protomers,91.0,"{""confidence_bin"":""high""}",,v1.0


We need to link back using pairkeys to the original pairs to ensure we only account for actual protomer pairs in the original dataset

In [28]:
df_res_protomers['cluster_a_members'] = df_res_protomers['cluster_a_members'].apply(safe_load_members)
df_res_protomers['cluster_b_members'] = df_res_protomers['cluster_b_members'].apply(safe_load_members)

df_res_protomers['pairkey'] = df_res_protomers.apply(
    lambda r: canonical_pairkey(r['cluster_a_members'], r['cluster_b_members']),
    axis=1
)

df_res_protomers

Unnamed: 0,cluster_a,cluster_b,cluster_a_members,cluster_b_members,cluster_a_size,cluster_b_size,classification,score,score_details,extra_info,version_tag,pairkey
0,1,2,[chebi:CHEBI:52504],[chebi:CHEBI:40617],1,1,Protomers,74.0,"{""confidence_bin"":""medium""}",,v1.0,CHEBI_40617__CHEBI_52504
1,4,5,[chebi:CHEBI:17521],[chebi:CHEBI:29751],1,1,Protomers,91.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_17521__CHEBI_29751
2,6,7,[chebi:CHEBI:18381],[chebi:CHEBI:58467],1,1,Protomers,96.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_18381__CHEBI_58467
3,8,9,[chebi:CHEBI:33462],[chebi:CHEBI:16215],1,1,Protomers,89.0,"{""confidence_bin"":""medium""}",,v1.0,CHEBI_16215__CHEBI_33462
4,10,11,[chebi:CHEBI:64808],[chebi:CHEBI:64790],1,1,Protomers,93.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_64790__CHEBI_64808
...,...,...,...,...,...,...,...,...,...,...,...,...
3307,6692,6693,[chebi:CHEBI:17793],[chebi:CHEBI:77992],1,1,Protomers,95.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_17793__CHEBI_77992
3308,6694,6695,[chebi:CHEBI:136357],[chebi:CHEBI:133821],1,1,Protomers,95.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_133821__CHEBI_136357
3309,6696,6697,[chebi:CHEBI:49269],[chebi:CHEBI:58803],1,1,Protomers,96.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_49269__CHEBI_58803
3310,6698,6699,[chebi:CHEBI:17803],[chebi:CHEBI:58277],1,1,Protomers,91.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_17803__CHEBI_58277


In [29]:
# Now merge using canonical pairkey
merged = df_res_protomers.merge(divese_pairs_df[['pairkey']], on='pairkey', how='inner')

merged

Unnamed: 0,cluster_a,cluster_b,cluster_a_members,cluster_b_members,cluster_a_size,cluster_b_size,classification,score,score_details,extra_info,version_tag,pairkey
0,1,2,[chebi:CHEBI:52504],[chebi:CHEBI:40617],1,1,Protomers,74.0,"{""confidence_bin"":""medium""}",,v1.0,CHEBI_40617__CHEBI_52504
1,4,5,[chebi:CHEBI:17521],[chebi:CHEBI:29751],1,1,Protomers,91.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_17521__CHEBI_29751
2,6,7,[chebi:CHEBI:18381],[chebi:CHEBI:58467],1,1,Protomers,96.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_18381__CHEBI_58467
3,8,9,[chebi:CHEBI:33462],[chebi:CHEBI:16215],1,1,Protomers,89.0,"{""confidence_bin"":""medium""}",,v1.0,CHEBI_16215__CHEBI_33462
4,10,11,[chebi:CHEBI:64808],[chebi:CHEBI:64790],1,1,Protomers,93.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_64790__CHEBI_64808
...,...,...,...,...,...,...,...,...,...,...,...,...
3194,6692,6693,[chebi:CHEBI:17793],[chebi:CHEBI:77992],1,1,Protomers,95.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_17793__CHEBI_77992
3195,6694,6695,[chebi:CHEBI:136357],[chebi:CHEBI:133821],1,1,Protomers,95.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_133821__CHEBI_136357
3196,6696,6697,[chebi:CHEBI:49269],[chebi:CHEBI:58803],1,1,Protomers,96.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_49269__CHEBI_58803
3197,6698,6699,[chebi:CHEBI:17803],[chebi:CHEBI:58277],1,1,Protomers,91.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_17803__CHEBI_58277


Have to use cases where cluster sizes == 1, to avoid discrepancies. Of course this results in slight data loss which will be mentioned in methods of manuscript.

In [30]:
df_res_protomers['classification'].value_counts()

classification
Protomers        3216
Unclassified       83
Parent-child       11
Diastereomers       1
Enantiomers         1
Name: count, dtype: int64

Very promising results, vast majority are assigned the classification of "Protomers" indicating the pipelines predictivity in these cases is very reliable. Lets take a closser look at the 'No classification' cases to determine the cause, maybe its possible they were in fact stereoisomers with different charges. 

In [31]:

df_fp_extra_info = df_res_protomers.loc[
    (df_res_protomers['classification'] == 'Unclassified') &
    (df_res_protomers['extra_info'].notnull()),
    :
]
df_fp_extra_info


Unnamed: 0,cluster_a,cluster_b,cluster_a_members,cluster_b_members,cluster_a_size,cluster_b_size,classification,score,score_details,extra_info,version_tag,pairkey
106,223,224,[chebi:CHEBI:86002],"[chebi:CHEBI:16685, chebi:CHEBI:2012]",1,2,Unclassified,,"{""confidence_bin"":null}",Stereo and charge differ (complex); no classif...,v1.0,CHEBI_16685__CHEBI_86002
107,223,225,[chebi:CHEBI:86002],[chebi:CHEBI:57859],1,1,Unclassified,,"{""confidence_bin"":null}",Stereo and charge differ (complex); no classif...,v1.0,CHEBI_57859__CHEBI_86002
127,262,264,[chebi:CHEBI:15978],[chebi:CHEBI:15943],1,1,Unclassified,,"{""confidence_bin"":null}",SRU presence mismatch; cannot compare,v1.0,CHEBI_15943__CHEBI_15978
128,263,264,[chebi:CHEBI:57597],[chebi:CHEBI:15943],1,1,Unclassified,,"{""confidence_bin"":null}",SRU presence mismatch; cannot compare,v1.0,CHEBI_15943__CHEBI_57597
146,299,300,[chebi:CHEBI:90408],[chebi:CHEBI:87351],1,1,Unclassified,,"{""confidence_bin"":null}",Parent-child stereochemical relationships must...,v1.0,CHEBI_87351__CHEBI_90408
...,...,...,...,...,...,...,...,...,...,...,...,...
3211,6511,6512,[chebi:CHEBI:133440],[chebi:CHEBI:134518],1,1,Unclassified,,"{""confidence_bin"":null}",Diastereomers must share protonation/charge; n...,v1.0,CHEBI_133440__CHEBI_134518
3219,6528,6529,[chebi:CHEBI:29512],[chebi:CHEBI:194199],1,1,Unclassified,,"{""confidence_bin"":null}",Parent-child stereochemical relationships must...,v1.0,CHEBI_194199__CHEBI_29512
3286,6657,6658,[chebi:CHEBI:57937],[chebi:CHEBI:16886],1,1,Unclassified,,"{""confidence_bin"":null}",Parent-child stereochemical relationships must...,v1.0,CHEBI_16886__CHEBI_57937
3290,6658,6660,[chebi:CHEBI:16886],[chebi:CHEBI:77859],1,1,Unclassified,,"{""confidence_bin"":null}",Parent-child stereochemical relationships must...,v1.0,CHEBI_16886__CHEBI_77859


In [34]:
df_fp_extra_info_charged = df_fp_extra_info[
    df_fp_extra_info['extra_info'].str.contains("charge", na=False)
]
df_fp_non_charged = df_fp_extra_info[
    ~df_fp_extra_info['extra_info'].str.contains("charge", na=False)
]
df_fp_extra_info_charged

Unnamed: 0,cluster_a,cluster_b,cluster_a_members,cluster_b_members,cluster_a_size,cluster_b_size,classification,score,score_details,extra_info,version_tag,pairkey
106,223,224,[chebi:CHEBI:86002],"[chebi:CHEBI:16685, chebi:CHEBI:2012]",1,2,Unclassified,,"{""confidence_bin"":null}",Stereo and charge differ (complex); no classif...,v1.0,CHEBI_16685__CHEBI_86002
107,223,225,[chebi:CHEBI:86002],[chebi:CHEBI:57859],1,1,Unclassified,,"{""confidence_bin"":null}",Stereo and charge differ (complex); no classif...,v1.0,CHEBI_57859__CHEBI_86002
146,299,300,[chebi:CHEBI:90408],[chebi:CHEBI:87351],1,1,Unclassified,,"{""confidence_bin"":null}",Parent-child stereochemical relationships must...,v1.0,CHEBI_87351__CHEBI_90408
360,742,743,[chebi:CHEBI:57438],[chebi:CHEBI:15618],1,1,Unclassified,,"{""confidence_bin"":null}",Parent-child stereochemical relationships must...,v1.0,CHEBI_15618__CHEBI_57438
376,775,776,[chebi:CHEBI:87636],[chebi:CHEBI:87634],1,1,Unclassified,,"{""confidence_bin"":null}",Parent-child stereochemical relationships must...,v1.0,CHEBI_87634__CHEBI_87636
...,...,...,...,...,...,...,...,...,...,...,...,...
3211,6511,6512,[chebi:CHEBI:133440],[chebi:CHEBI:134518],1,1,Unclassified,,"{""confidence_bin"":null}",Diastereomers must share protonation/charge; n...,v1.0,CHEBI_133440__CHEBI_134518
3219,6528,6529,[chebi:CHEBI:29512],[chebi:CHEBI:194199],1,1,Unclassified,,"{""confidence_bin"":null}",Parent-child stereochemical relationships must...,v1.0,CHEBI_194199__CHEBI_29512
3286,6657,6658,[chebi:CHEBI:57937],[chebi:CHEBI:16886],1,1,Unclassified,,"{""confidence_bin"":null}",Parent-child stereochemical relationships must...,v1.0,CHEBI_16886__CHEBI_57937
3290,6658,6660,[chebi:CHEBI:16886],[chebi:CHEBI:77859],1,1,Unclassified,,"{""confidence_bin"":null}",Parent-child stereochemical relationships must...,v1.0,CHEBI_16886__CHEBI_77859


In [35]:
df_fp_non_charged

Unnamed: 0,cluster_a,cluster_b,cluster_a_members,cluster_b_members,cluster_a_size,cluster_b_size,classification,score,score_details,extra_info,version_tag,pairkey
127,262,264,[chebi:CHEBI:15978],[chebi:CHEBI:15943],1,1,Unclassified,,"{""confidence_bin"":null}",SRU presence mismatch; cannot compare,v1.0,CHEBI_15943__CHEBI_15978
128,263,264,[chebi:CHEBI:57597],[chebi:CHEBI:15943],1,1,Unclassified,,"{""confidence_bin"":null}",SRU presence mismatch; cannot compare,v1.0,CHEBI_15943__CHEBI_57597
1286,2643,2644,[chebi:CHEBI:30916],[chebi:CHEBI:178083],1,1,Unclassified,,"{""confidence_bin"":null}",Radioactivity mismatch; cannot compare,v1.0,CHEBI_178083__CHEBI_30916
1288,2643,2646,[chebi:CHEBI:30916],[chebi:CHEBI:178058],1,1,Unclassified,,"{""confidence_bin"":null}",Radioactivity mismatch; cannot compare,v1.0,CHEBI_178058__CHEBI_30916
1289,2644,2645,[chebi:CHEBI:178083],[chebi:CHEBI:16810],1,1,Unclassified,,"{""confidence_bin"":null}",Radioactivity mismatch; cannot compare,v1.0,CHEBI_16810__CHEBI_178083
1291,2645,2646,[chebi:CHEBI:16810],[chebi:CHEBI:178058],1,1,Unclassified,,"{""confidence_bin"":null}",Radioactivity mismatch; cannot compare,v1.0,CHEBI_16810__CHEBI_178058
1592,3268,3269,[chebi:CHEBI:39745],[chebi:CHEBI:52641],1,1,Unclassified,,"{""confidence_bin"":null}",SRU presence mismatch; cannot compare,v1.0,CHEBI_39745__CHEBI_52641
1593,3268,3270,[chebi:CHEBI:39745],"[chebi:CHEBI:16838, chebi:CHEBI:43474]",1,2,Unclassified,,"{""confidence_bin"":null}",SRU presence mismatch; cannot compare,v1.0,CHEBI_16838__CHEBI_39745


All of the no classification cases are due to possible stereoisomers being used instead of truly identical species to define the term protomers in this case. There is no way to determine which source is true, as of course both sources / methods could introduce mistakes. when evaluating, two seperate results will be produced, the set including these false positives, and without them to highlight the difference and transparency of results.

### Manual review of 14 other of target relationships

In [39]:


# now clean up by removing 'chebi:CHEBI:' and replace with 'chebi:'
df_res_protomers['cluster_a_members'] = df_res_protomers['cluster_a_members'].apply(lambda members: [m.replace("chebi:CHEBI:", "chebi:") for m in members])
df_res_protomers['cluster_b_members'] = df_res_protomers['cluster_b_members'].apply(lambda members: [m.replace("chebi:CHEBI:", "chebi:") for m in members])

df_res_protomers.loc[ df_res_protomers['classification'] == 'Parent-child', : ]


Unnamed: 0,cluster_a,cluster_b,cluster_a_members,cluster_b_members,cluster_a_size,cluster_b_size,classification,score,score_details,extra_info,version_tag,pairkey
804,1660,1661,[chebi:32812],[chebi:6331],1,1,Parent-child,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_32812__CHEBI_6331
862,1768,1770,[chebi:32531],[chebi:32513],1,1,Parent-child,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_32513__CHEBI_32531
1385,2839,2840,[chebi:32810],[chebi:155841],1,1,Parent-child,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_155841__CHEBI_32810
1755,3590,3591,[chebi:32696],[chebi:32689],1,1,Parent-child,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_32689__CHEBI_32696
1766,3606,3607,[chebi:58070],[chebi:58539],1,1,Parent-child,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_58070__CHEBI_58539
2293,4680,4681,[chebi:17242],[chebi:27956],1,1,Parent-child,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_17242__CHEBI_27956
2711,5522,5525,[chebi:57293],[chebi:64285],1,1,Parent-child,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_57293__CHEBI_64285
2768,5625,5626,[chebi:14321],[chebi:29986],1,1,Parent-child,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_14321__CHEBI_29986
2858,5803,5805,[chebi:17742],[chebi:62213],1,1,Parent-child,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_17742__CHEBI_62213
2994,6068,6069,[chebi:32456],[chebi:32442],1,1,Parent-child,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_32442__CHEBI_32456


upon manual review of all `2D vs 3D structures`, these are all in fact correct classifications by stereomapper. They are not necessarily mistakes by ChEBI, just different approaches for classifications causing subtle disagreements.

In [40]:
df_res_protomers.loc[ (df_res_protomers['classification'] == 'Diastereomers') |
                     (df_res_protomers['classification'] == "Enantiomers"), : 
]

Unnamed: 0,cluster_a,cluster_b,cluster_a_members,cluster_b_members,cluster_a_size,cluster_b_size,classification,score,score_details,extra_info,version_tag,pairkey
944,1933,1934,[chebi:33817],[chebi:30624],1,1,Diastereomers,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_30624__CHEBI_33817
2712,5523,5524,[chebi:15395],[chebi:64298],1,1,Enantiomers,100.0,"{""confidence_bin"":""high""}",,v1.0,CHEBI_15395__CHEBI_64298


All of these are correct classifications by stereomapper.

## Evaluation

In [41]:
# get tp, fp, fn counts
tp = len(df_res_protomers[ df_res_protomers['classification'] == 'Protomers' ])
fp = len(df_res_protomers[ df_res_protomers['classification'] != 'Protomers' ])
fn = divese_pairs_df.shape[0] - tp

print(f"TP: {tp}, FP: {fp}, FN: {fn}")

TP: 3216, FP: 96, FN: 149


Majority of false positives have been accounted for. False negatives are likely due to the structures which were noted to be clustered, resulting in missing comparisons / mappings, or the identifiers in question have no downloadable structure from chebi. The former is the most likely as these molfiles which have been validated to exist are taken from chebi. 

Wildcard structures have been removed so it cannot be their influence

In [42]:
# calculate precision, recall, f1
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0 

print(f"Precision: {precision:.2%}, Recall: {recall:.2%}, F1 Score: {f1:.2%}")

Precision: 97.10%, Recall: 95.57%, F1 Score: 96.33%


Very good results, indicates the pipeline is highly reliable when predicting for protomers. This is without exclusion of those false positives mentioned previously.

In [43]:
# remove false positives which have no classification but have extra_info indicating the diff in scope
df_filtered = df_res_protomers[~(
    (df_res_protomers['classification'] == 'Unclassified') &
    (df_res_protomers['extra_info'].notnull())
)]

# calc tp, fp, fn again
tp_v2 = len(df_filtered[ df_filtered['classification'] == 'Protomers' ])
fp_v2 = len(df_filtered[ df_filtered['classification'] != 'Protomers' ])
fn_v2 = divese_pairs_df.shape[0] - tp_v2

fp_v2 -= 13 # account for cases where their is disagreement in scope (parent-child, diasteromers etc.)

print(f"TP: {tp_v2}, FP: {fp_v2}, FN: {fn_v2}")

TP: 3216, FP: 0, FN: 149


In [44]:
# calculate precision, recall, f1
precision_v2 = tp_v2 / (tp_v2 + fp_v2) if (tp_v2 + fp_v2) > 0 else 0
recall_v2 = tp_v2 / (tp_v2 + fn_v2) if (tp_v2 + fn_v2) > 0 else 0
f1_v2 = 2 * (precision_v2 * recall_v2) / (precision_v2 + recall_v2) if (precision_v2 + recall_v2) > 0 else 0

print(f"Precision: {precision_v2:.2%}, Recall: {recall_v2:.2%}, F1 Score: {f1_v2:.2%}")

Precision: 100.00%, Recall: 95.57%, F1 Score: 97.74%


This time we see 100% accuracy, though this was to be expected. This is a reflection of the internal scope of the pipeline, where as the initial results were a reflection of the real world scenario, which is certainly more applicable.