## **03 — Build strain-space (S-space) signatures with Chemical Checker (S1.001):**

This notebook builds the strain-space feature representation (S-space) from the unified fitness table by running the Chemical Checker signature pipeline:
- create type-0 matrix (fitness per strain),
- fit `sign0` → `sign1` (PCA/scaled continuous signatures) → `neig1` (nearest-neighbor graph) → `sign2` (128-dim embedding),
and export the final type-II signatures for modeling.


### Inputs
- `feature_pipeline/strain_space/inputs/stage1/raw_fitness.csv` (drug identifiers + strain fitness columns)
- Environment variable `CC_CONFIG` pointing to a valid ChemicalChecker `cc_config.json`

### Outputs
- Cached CC dataset artifacts under: `feature_pipeline/strain_space/cache/S1.001/`
  - `sign0_input/S_sign0_input.tsv`
  - `sign0/S_sign0.tsv`
  - `signI/S_sign1.tsv`
  - `signII/S_sign2.tsv`
- Final exported strain-space signatures:
  - `data/features/strain_space_ss/S_sign2.tsv`


# **S space**
### using CC protocol:
1) Pre-processing & finalize type-0 matrix: Decide a dataset name (e.g. S1.001) and lock in the final df_fitness (rows = InChIKeys, cols = strain IDs, values = fitness).

2) Generate type 0 signatures (sign0)
- Load that matrix into a CC instance as sign0 for S1.001.
- Let CC do its internal feature filtering and NA handling on top of your cleaning.

3) Generate type I signatures (sign1): For continuous data (like your fitness values), CC will scale each column (median 0, MAD 1, capped at ±10) and run PCA, keeping enough components to explain ~90% variance.

4) Build type-I nearest-neighbor graph (neig1): Compute nearest neighbors between compounds using type I signatures; this defines the similarity network used in the next step.

5) Generate type II signatures (sign2): Run node2vec on that graph to get a 128-dimensional embedding per compound → these are your S-space type II signatures.

6) (Optional) QC: diagnosis plots: Generate diagnosis plots for sign0/sign1/sign2 to check that the S-space behaves sensibly (value ranges, distance distributions, redundancy, etc.).

In [None]:
import pandas as pd
import numpy as np
import os
from halo.paths import FEATURE_PIPELINE, SS_FEATURES

# setting up chemical checker protocol
cc_config_path = FEATURE_PIPELINE / "chemicalchecker" / "cc_config.json"
assert cc_config_path.exists(), f"Missing cc_config.json in: {cc_config_path}"
os.environ["CC_CONFIG"] = str(cc_config_path)

from chemicalchecker import ChemicalChecker

In [2]:
raw = pd.read_csv(FEATURE_PIPELINE / "strain_space" / "inputs" / "stage1" / "raw_fitness.csv").copy()
raw.head()

Unnamed: 0,drug,3_letter_code,inchikey,escherichia coli bw25113,escherichia coli iai1,salmonella typhimurium lt2,salmonella typhimurium 14028,pseudomonas aeruginosa pao1,pseudomonas aeruginosa pa14,bacillus subtilis,staphylococcus aureus dsm 20231,staphylococcus aureus newman,streptococcus pneumoniae
0,spiramycin,spm,actoxuheucptew-ceuobaopsa-n,0.73,0.79,0.8,0.84,0.86,0.88,,,,
1,clarithromycin,clr,agoydepgaoxock-kcbohyoisa-n,0.45,0.57,0.6,0.54,0.36,0.44,0.4,0.47,0.39,0.34
2,doxorubicin,dxr,aojjsuzboxzqnb-tzssrymlsa-n,0.81,0.86,0.81,0.84,0.9,0.91,0.85,0.92,0.82,0.92
3,auranofin,aur,aujrcfubupvwsz-xtzhgvarsa-m,,,,,,,0.32,0.37,0.66,0.31
4,teicoplanin,tec,bjnllbuohpvgft-cayrisatsa-n,0.88,0.88,0.85,0.87,0.82,0.83,0.76,0.99,0.87,0.34


In [3]:
raw.shape

(103, 13)

### **1) Preprocessing & finilizing type-0 signatures** 

 1) Check if any drug has all missing values:  
 If a row has no fitness info, the drug is useless for S-space.

In [27]:
fitness_cols = ['escherichia coli bw25113', 'escherichia coli iai1', 
                'salmonella typhimurium lt2', 'salmonella typhimurium 14028',
                'pseudomonas aeruginosa pao1', 'pseudomonas aeruginosa pa14',
                'bacillus subtilis', 'staphylococcus aureus dsm 20231', 'staphylococcus aureus newman',
                'streptococcus pneumoniae']

raw[fitness_cols].isna().all(axis=1).sum()

np.int64(0)

In [28]:
raw[fitness_cols].notna().all(axis=1).sum()

np.int64(42)

- Every drug has at least some strain data, which is what matters for CC type-0. 42 drugs with complete fitness across 10 strains.

2) Check missingness per strain:   
If a strain has >50% missing, CC will remove it.

In [29]:
raw[fitness_cols].isna().mean().sort_values() # missingness percentage in each strain

staphylococcus aureus dsm 20231    0.242718
escherichia coli bw25113           0.320388
salmonella typhimurium lt2         0.320388
escherichia coli iai1              0.320388
salmonella typhimurium 14028       0.320388
pseudomonas aeruginosa pao1        0.339806
pseudomonas aeruginosa pa14        0.339806
bacillus subtilis                  0.446602
staphylococcus aureus newman       0.446602
streptococcus pneumoniae           0.446602
dtype: float64

3) Check strain variance:  
CC pipeline will mark these strains with std < 0.05 - 0.1 as redundant 

In [30]:
raw[fitness_cols].std().sort_values()

staphylococcus aureus dsm 20231    0.219115
streptococcus pneumoniae           0.230113
bacillus subtilis                  0.234837
staphylococcus aureus newman       0.252149
salmonella typhimurium 14028       0.309034
salmonella typhimurium lt2         0.314783
escherichia coli bw25113           0.315354
escherichia coli iai1              0.350362
pseudomonas aeruginosa pao1        0.375330
pseudomonas aeruginosa pa14        0.382240
dtype: float64

4) Check for duplicated strain profiles across drugs:  
CC will mark these as redundant features and will drop the duplicated rows

In [31]:
raw[fitness_cols].duplicated().sum()

np.int64(5)

In [32]:
raw.loc[raw[fitness_cols].duplicated()]

Unnamed: 0,drug,3_letter_code,inchikey,escherichia coli bw25113,escherichia coli iai1,salmonella typhimurium lt2,salmonella typhimurium 14028,pseudomonas aeruginosa pao1,pseudomonas aeruginosa pa14,bacillus subtilis,staphylococcus aureus dsm 20231,staphylococcus aureus newman,streptococcus pneumoniae
55,hexestrol,hex,pbbgszcbwvpool-hdicaceksa-n,,,,,,,,0.83,,
56,promethazine,pro,pwwvaxiegoywee-uhfffaoysa-n,,,,,,,,0.98,,
63,celecoxib,cel,rzekvgvhfleqil-uhfffaoysa-n,,,,,,,,0.94,,
90,gefitinib,gef,xgallcvxezpnrq-uhfffaoysa-n,,,,,,,,0.95,,
100,chlorpromazine,cpr,zpeimtdsqakgnt-uhfffaoysa-n,,,,,,,,0.94,,


- there are 7 drugs that only tested on `staphylococcus aureus dsm 20231` and no other strains.

5) Chekc for duplicated inchikeys:

In [33]:
raw['inchikey'].duplicated().sum()

np.int64(0)

In [34]:
raw.loc[raw['inchikey'].duplicated(keep=False), ['drug', '3_letter_code', 'inchikey']]

Unnamed: 0,drug,3_letter_code,inchikey


In [35]:
raw['inchikey'].duplicated().sum() == 0

np.True_

6) Building type-0 matrix:

In [36]:
df = raw.copy()
df = df[['inchikey'] + fitness_cols].copy()

rename_map = {
    'escherichia coli bw25113': 'ECOLI_BW25113',
    'escherichia coli iai1': 'ECOLI_IAI1',
    'salmonella typhimurium lt2': 'STYPI_LT2',
    'salmonella typhimurium 14028': 'STYPI_14028',
    'pseudomonas aeruginosa pao1': 'PA_PAO1',
    'pseudomonas aeruginosa pa14': 'PA_PA14',
    'bacillus subtilis': 'BSUBTILIS',
    'staphylococcus aureus dsm 20231': 'SA_DSM20231',
    'staphylococcus aureus newman': 'SA_NEWMAN',
    'streptococcus pneumoniae': 'SPNEUMO'
}

df = df.rename(columns=rename_map)
df = df.set_index("inchikey")

df.columns

Index(['ECOLI_BW25113', 'ECOLI_IAI1', 'STYPI_LT2', 'STYPI_14028', 'PA_PAO1',
       'PA_PA14', 'BSUBTILIS', 'SA_DSM20231', 'SA_NEWMAN', 'SPNEUMO'],
      dtype='object')

In [37]:
df.shape

(103, 10)

In [38]:
df.head()

Unnamed: 0_level_0,ECOLI_BW25113,ECOLI_IAI1,STYPI_LT2,STYPI_14028,PA_PAO1,PA_PA14,BSUBTILIS,SA_DSM20231,SA_NEWMAN,SPNEUMO
inchikey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
actoxuheucptew-ceuobaopsa-n,0.73,0.79,0.8,0.84,0.86,0.88,,,,
agoydepgaoxock-kcbohyoisa-n,0.45,0.57,0.6,0.54,0.36,0.44,0.4,0.47,0.39,0.34
aojjsuzboxzqnb-tzssrymlsa-n,0.81,0.86,0.81,0.84,0.9,0.91,0.85,0.92,0.82,0.92
aujrcfubupvwsz-xtzhgvarsa-m,,,,,,,0.32,0.37,0.66,0.31
bjnllbuohpvgft-cayrisatsa-n,0.88,0.88,0.85,0.87,0.82,0.83,0.76,0.99,0.87,0.34


7) Cheking numeric / NA again

In [39]:
df = df.apply(pd.to_numeric, errors="coerce")
df = df.replace([np.inf, -np.inf], np.nan)

In [40]:
df.head()

Unnamed: 0_level_0,ECOLI_BW25113,ECOLI_IAI1,STYPI_LT2,STYPI_14028,PA_PAO1,PA_PA14,BSUBTILIS,SA_DSM20231,SA_NEWMAN,SPNEUMO
inchikey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
actoxuheucptew-ceuobaopsa-n,0.73,0.79,0.8,0.84,0.86,0.88,,,,
agoydepgaoxock-kcbohyoisa-n,0.45,0.57,0.6,0.54,0.36,0.44,0.4,0.47,0.39,0.34
aojjsuzboxzqnb-tzssrymlsa-n,0.81,0.86,0.81,0.84,0.9,0.91,0.85,0.92,0.82,0.92
aujrcfubupvwsz-xtzhgvarsa-m,,,,,,,0.32,0.37,0.66,0.31
bjnllbuohpvgft-cayrisatsa-n,0.88,0.88,0.85,0.87,0.82,0.83,0.76,0.99,0.87,0.34


In [41]:
df.shape

(103, 10)

In [42]:
mask = df.isna().all(axis=1)
df = df.loc[~mask].copy()
df.shape

(103, 10)

In [43]:
df.iloc[:, 1:].notna().sum()

ECOLI_IAI1     70
STYPI_LT2      70
STYPI_14028    70
PA_PAO1        68
PA_PA14        68
BSUBTILIS      57
SA_DSM20231    78
SA_NEWMAN      57
SPNEUMO        57
dtype: int64

In [None]:
df.head() # keeping index as inchikey in the first column

Unnamed: 0_level_0,ECOLI_BW25113,ECOLI_IAI1,STYPI_LT2,STYPI_14028,PA_PAO1,PA_PA14,BSUBTILIS,SA_DSM20231,SA_NEWMAN,SPNEUMO
inchikey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
actoxuheucptew-ceuobaopsa-n,0.73,0.79,0.8,0.84,0.86,0.88,,,,
agoydepgaoxock-kcbohyoisa-n,0.45,0.57,0.6,0.54,0.36,0.44,0.4,0.47,0.39,0.34
aojjsuzboxzqnb-tzssrymlsa-n,0.81,0.86,0.81,0.84,0.9,0.91,0.85,0.92,0.82,0.92
aujrcfubupvwsz-xtzhgvarsa-m,,,,,,,0.32,0.37,0.66,0.31
bjnllbuohpvgft-cayrisatsa-n,0.88,0.88,0.85,0.87,0.82,0.83,0.76,0.99,0.87,0.34


In [50]:
# Output path to save the features:
out_path = FEATURE_PIPELINE / "strain_space" / "cache" / "S1.001"
(out_path / "sign0_input").mkdir(parents=True, exist_ok=True)
(out_path / "sign0").mkdir(parents=True, exist_ok=True)
(out_path / "signI").mkdir(parents=True, exist_ok=True)
(out_path / "signII").mkdir(parents=True, exist_ok=True)

In [51]:
df.to_csv(out_path / "sign0_input" / "S_sign0_input.tsv", sep="\t", index=True)
df.head()

Unnamed: 0_level_0,ECOLI_BW25113,ECOLI_IAI1,STYPI_LT2,STYPI_14028,PA_PAO1,PA_PA14,BSUBTILIS,SA_DSM20231,SA_NEWMAN,SPNEUMO
inchikey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
actoxuheucptew-ceuobaopsa-n,0.73,0.79,0.8,0.84,0.86,0.88,,,,
agoydepgaoxock-kcbohyoisa-n,0.45,0.57,0.6,0.54,0.36,0.44,0.4,0.47,0.39,0.34
aojjsuzboxzqnb-tzssrymlsa-n,0.81,0.86,0.81,0.84,0.9,0.91,0.85,0.92,0.82,0.92
aujrcfubupvwsz-xtzhgvarsa-m,,,,,,,0.32,0.37,0.66,0.31
bjnllbuohpvgft-cayrisatsa-n,0.88,0.88,0.85,0.87,0.82,0.83,0.76,0.99,0.87,0.34


### **2) Generate type-0 signature (sign0)**

In [52]:
local_dir = FEATURE_PIPELINE / "strain_space" / "cache" / "S1.001" / "cc_instance"
local_dir.mkdir(parents=True, exist_ok=True)

cc_local = ChemicalChecker(str(local_dir), dbconnect=False)
print("CC instance created!")

CC instance created!


In [53]:
sign0_input = FEATURE_PIPELINE / "strain_space" / "cache" / "S1.001" / "sign0_input" / "S_sign0_input.tsv"
df = pd.read_csv(sign0_input, sep="\t", index_col=0)

print(df.shape)
df.head()

(103, 10)


Unnamed: 0_level_0,ECOLI_BW25113,ECOLI_IAI1,STYPI_LT2,STYPI_14028,PA_PAO1,PA_PA14,BSUBTILIS,SA_DSM20231,SA_NEWMAN,SPNEUMO
inchikey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
actoxuheucptew-ceuobaopsa-n,0.73,0.79,0.8,0.84,0.86,0.88,,,,
agoydepgaoxock-kcbohyoisa-n,0.45,0.57,0.6,0.54,0.36,0.44,0.4,0.47,0.39,0.34
aojjsuzboxzqnb-tzssrymlsa-n,0.81,0.86,0.81,0.84,0.9,0.91,0.85,0.92,0.82,0.92
aujrcfubupvwsz-xtzhgvarsa-m,,,,,,,0.32,0.37,0.66,0.31
bjnllbuohpvgft-cayrisatsa-n,0.88,0.88,0.85,0.87,0.82,0.83,0.76,0.99,0.87,0.34


In [59]:
dataset = "S1.001"

sign0 = cc_local.signature(dataset, "sign0")
sign0.clear_all()

X = df.values                # 2D numpy array, shape (n_compounds, n_strains)
keys = list(df.index)        # inchikeys
features = list(df.columns)  # strain IDs

sign0.fit(X=X, keys=keys, features=features)

print("sign0 done")

Iterating on `V` axis 1: 100%|██████████| 1/1 [00:00<00:00, 344.47it/s]


Features frequency (103, 10)


Iterating on `V` axis 0: 100%|██████████| 1/1 [00:00<00:00, 627.33it/s]
Iterating on `V` axis 0: 100%|██████████| 1/1 [00:00<00:00, 1243.86it/s]
Iterating on `V` axis 1: 100%|██████████| 1/1 [00:00<00:00, 168.87it/s]
Iterating on `V` axis 0: 100%|██████████| 1/1 [00:00<00:00, 535.47it/s]


Flter nans and inf (103, 10)
Filter too many features (103, 10)
sign0 done


### **3) Generate type-I signatures (sign1)**

In [60]:
sign1 = cc_local.signature(dataset, "sign1")
sign1.clear_all()
sign1.fit(sign0)

print("sign1 done")

sign1 done


In [61]:
neig1 = cc_local.signature(dataset, "neig1")
neig1.clear_all()
neig1.fit(sign1)

print("neig1 done")

neig1 done


In [62]:
print(len(sign1.keys), len(sign1.data))
print(len(neig1.keys), len(neig1.data))

103 103
103 103


### **4) Generate type-II signatures (sign2)**

In [71]:
sign2 = cc_local.signature(dataset, "sign2")
sign2.clear_all()
sign2.fit(sign1, neig1, oos_predictor=False)

print("sign2 done")

[ERROR   ] Problem with LinkPrediction: Insufficient nodes for validation: 82


sign2 done


The warning:  

[ERROR   ] Problem with LinkPrediction: Insufficient nodes for validation: 82  
- After building sign2, CC tries to do link prediction on the graph (kind of a sanity check / diagnostic). -> How good is this graph at predicting missing similarities between compounds? -> CC measures AUC, percision, recall, etc  
- For that, it needs to split nodes/edges into train/validation/test.  
- With this tiny graph (~80–100 compounds), there just aren’t enough nodes/edges to carve out a proper validation set.  
so it skips the link-prediction stats, and continues.
Link prediction QC = an optional graph sanity check.
This dataset is too small to compute it.

Saving signatures:

In [None]:
X0 = sign0.data
df_sign0 = pd.DataFrame(X0, index=sign0.keys, columns = sign0.features)

df_sign0.to_csv(out_path / "sign0" / "S_sign0.tsv" , sep="\t")
print("saved sign0 to", out_path / "sign0" / "S_sign0.tsv")

saved sign0 to /home/hany/projects/cc_ml/training_data/cc/S1.001/sign0/S_sign0.tsv


In [None]:
X1 = sign1.data
df_sign1 = pd.DataFrame(X1, index=sign1.keys, columns=sign1.features)

df_sign1.to_csv(out_path / "signI" / "S_sign1.tsv", sep="\t")
print("saved sign1 to", out_path / "signI" / "S_sign1.tsv")

saved sign1 to /home/hany/projects/cc_ml/training_data/cc/S1.001/signI/S_sign1.tsv


In [None]:
X2 = sign2.data
df_sign2 = pd.DataFrame(X2, index=sign2.keys, columns=sign2.features)

df_sign2.to_csv(out_path / "signII" / "S_sign2.tsv", sep="\t")
print("saved sign2 to", out_path / "signII" / "S_sign2.tsv")


saved sign2 to /home/hany/projects/cc_ml/training_data/cc/S1.001/signII/S_sign2.tsv


In [None]:
final_out = SS_FEATURES / "S_sign2.tsv"
final_out.parent.mkdir(parents=True, exist_ok=True)
df_sign2.to_csv(final_out, sep="\t")
print("exported final sign2 to", final_out)