# **Prepare Chemical Checker feature** matrix

This notebook converts the raw Chemical Checker export (`cc_features_raw.csv`, one row per **drug × CC sublevel**) into fixed-length feature tables used by HALO models.

## Inputs
- `data/features/chemicalchecker_cc/cc_features_raw.csv` (via `fetch_cc_features.py`)

## Outputs
- `data/features/chemicalchecker_cc/cc_features_concat_25x128.csv` <br>
  Concatenation of all 25 CC sublevels (A1–E5) → **3200-d** vector per drug (25×128).
- `data/features/chemicalchecker_cc/cc_features_concat_15x128.csv` <br>  
  Concatenation of the first 15 sublevels (A1–C5) → **1920-d** vector per drug (15×128).
- *(optional, if saved)* `data/features/chemicalchecker_cc/cc_features_concat_5x128_by_level.csv`  <br>
  Concatenation of 5 sublevels per top level → **640-d** vector per drug per level (A–E).


### Notes / assumptions

- Rows are filtered to `fetch_status == "success"`.
- A drug is kept only if **all required sublevels** are present (25-of-25 or 15-of-15), ensuring consistent vector length.
- InChIKeys are standardized to uppercase for stable joins with downstream datasets.
- CC vectors are expected to already be bounded roughly within [-1, 1]; this notebook reports out-of-range counts as a sanity check (it does not rescale CC features).


In [None]:
import pandas as pd
import numpy as np

from halo.paths import CC_FEATURES
from halo.mappers.drug_mapper import DrugMapper

mapper = DrugMapper()

In [None]:
cc_data = pd.read_csv(CC_FEATURES / "cc_features_raw.csv").copy()

### **Cleaning CC vectors**

In [None]:
cc_data.head()

Unnamed: 0,drug,inchikey,level,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,...,dim_119,dim_120,dim_121,dim_122,dim_123,dim_124,dim_125,dim_126,dim_127,fetch_status
0,a22,LZTCFLDZLBOLDW-UHFFFAOYSA-N,A1,-0.090141,0.090142,0.089605,0.090142,-0.090142,0.090142,0.090142,...,-0.090032,-0.087664,-0.090142,0.087257,0.090142,-0.090142,-0.089546,-0.090142,0.090142,success
1,acetylsalicylicacid,BSYNRYMUTXBXSQ-UHFFFAOYSA-N,A1,-0.092068,-0.089593,-0.092069,0.092069,0.091756,0.092069,0.092069,...,-0.092069,0.092069,0.061538,0.091995,-0.021338,-0.092069,-0.082015,-0.092064,0.092069,success
2,alahopcin,NTBVVEFUJUCXPF-FYCPLRARSA-N,A1,,,,,,,,...,,,,,,,,,,not_found
3,amikacin,LKCWBDHBTVXHDL-RMDFUYIESA-N,A1,-0.089872,0.089872,0.089872,-0.089872,0.089872,-0.089872,0.089872,...,-0.089872,0.089872,0.089872,0.089872,-0.089872,-0.088932,0.089872,0.089872,0.089872,success
4,amikacin liposomal,LKCWBDHBTVXHDL-CAIQVSFASA-N,A1,,,,,,,,...,,,,,,,,,,not_found


### Clenaing rows with NA values:

In [None]:
# Create a dummy object so i can use the check_na function from DrugMapper
mapper = DrugMapper()

vector_cols = [f'dim_{i}' for i in range(128)]
cc_data = mapper.check_na(cc_data, critical_columns=['drug', 'inchikey', 'level'] + vector_cols)

Missing values report (before dropping): dim_0      850
dim_1      850
dim_2      850
dim_3      850
dim_4      850
          ... 
dim_123    850
dim_124    850
dim_125    850
dim_126    850
dim_127    850
Length: 128, dtype: int64


### Drugs that have all the 25 levels:

Only keeping those drugs that have vectors for all the 25 levels. I will be eliminating all the other drugs even if they are missing one of the 25 vectors. these dataset will be used for scenario 1 (5 models for 5 levels) and scenario 2 (1 model for 25 levels) models.

In [None]:
cc_data_success = cc_data[cc_data['fetch_status'] == 'success'] # First, filtering all the rows that has the 128d vector

levels_per_drug = cc_data_success.groupby('drug')['level'].unique() 
drugs_with_all_25_levels = levels_per_drug[levels_per_drug.apply(lambda x: len(x) == 25)].index # Now, filtering those drugs with all 25 levels present

cc_raw_25 = cc_data_success[cc_data_success['drug'].isin(drugs_with_all_25_levels)]
cc_raw_25['inchikey'] = cc_raw_25['inchikey'].astype(str).str.strip().str.upper()

In [None]:
len(cc_raw_25) 
print(f'The number of drugs with all 25 levels of data present: {int(len(cc_raw_25) / 25)}')

The number of drugs with all 25 levels of data present: 261


### Drugs that have the first three levels:

In [None]:
first_15_levels = [
    'A1', 'A2', 'A3', 'A4', 'A5',
    'B1', 'B2', 'B3', 'B4', 'B5',
    'C1', 'C2', 'C3', 'C4', 'C5'
]

cc_data_success = cc_data[
    (cc_data['fetch_status'] == 'success') &
    (cc_data['level'].isin(first_15_levels))
]

levels_per_drug = cc_data_success.groupby('drug')['level'].unique() 
drugs_with_first_15_levels = levels_per_drug[levels_per_drug.apply(lambda x: len(x) == 15)].index

cc_raw_15 = cc_data_success[cc_data_success['drug'].isin(drugs_with_first_15_levels)]
cc_raw_15['inchikey'] = cc_raw_15['inchikey'].astype(str).str.strip().str.upper()
# cc_raw_15.to_csv('cc_raw_15.csv', index=False)

In [None]:
cc_data_success.groupby('drug')['level'].unique()

drug
a22                       [A1, A2, A3, A4, A5, B1, B2, B3, B4, B5, C1, C...
acetylsalicylicacid       [A1, A2, A3, A4, A5, B1, B2, B3, B4, B5, C1, C...
alfacalcidol              [A1, A2, A3, A4, A5, B1, B2, B3, B4, B5, C1, C...
amikacin                  [A1, A2, A3, A4, A5, B1, B2, B3, B4, B5, C1, C...
aminosalicylate sodium    [A1, A2, A3, A4, A5, B1, B2, B3, B4, B5, C1, C...
                                                ...                        
valnemulin                [A1, A2, A3, A4, A5, B1, B2, B3, B4, B5, C1, C...
vancomycin                [A1, A2, A3, A4, A5, B1, B2, B3, B4, B5, C1, C...
vanillin                  [A1, A2, A3, A4, A5, B1, B2, B3, B4, B5, C1, C...
viomycin                  [A1, A2, A3, A4, A5, B1, B2, B3, B4, B5, C1, C...
virginiamycin m1          [A1, A2, A3, A4, A5, B1, B2, B3, B4, B5, C1, C...
Name: level, Length: 261, dtype: object

In [None]:
len(cc_raw_15)
print(f'The number of drugs with the first 15 levels: {int(len(cc_raw_15) / 15)}')

The number of drugs with the first 15 levels: 261


### **Concatenating CC vectors**

In [None]:
levels_map = {
    'A': ['A1', 'A2', 'A3', 'A4', 'A5'],
    'B': ['B1', 'B2', 'B3', 'B4', 'B5'],
    'C': ['C1', 'C2', 'C3', 'C4', 'C5'],
    'D': ['D1', 'D2', 'D3', 'D4', 'D5'],
    'E': ['E1', 'E2', 'E3', 'E4', 'E5']
}
vector_cols = [f'dim_{i}' for i  in range(128)]

### 25 sublevels into 5 levels:

In [None]:
result_rows = []

for drug in cc_raw_25['drug'].unique():
    drug_data = cc_raw_25[cc_raw_25['drug'] == drug]
    inchikey = drug_data['inchikey'].iloc[0]

    for level, sublevels in levels_map.items():
        sublevel_vectors = drug_data[drug_data['level'].isin(sublevels)][vector_cols].values
        concatenated_vector = np.concatenate(sublevel_vectors, axis=0)
        result_rows.append([drug, inchikey, level] + concatenated_vector.tolist())

vector_cols_concat = [f'dim_{i}' for i in range(128*5)]
features_5_levels = pd.DataFrame(result_rows, columns=['drug', 'inchikey', 'level'] + vector_cols_concat)
features_5_levels['inchikey'] = features_5_levels['inchikey'].astype(str).str.strip().str.upper()

In [None]:
len(features_5_levels)
features_5_levels.head()

Unnamed: 0,drug,inchikey,level,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,...,dim_630,dim_631,dim_632,dim_633,dim_634,dim_635,dim_636,dim_637,dim_638,dim_639
0,a22,LZTCFLDZLBOLDW-UHFFFAOYSA-N,A,-0.090141,0.090142,0.089605,0.090142,-0.090142,0.090142,0.090142,...,-0.090884,-0.090883,-0.090884,0.090884,-0.090884,0.090883,0.090884,-0.090884,-0.090878,-0.090884
1,a22,LZTCFLDZLBOLDW-UHFFFAOYSA-N,B,0.106896,0.007379,0.100421,0.100793,-0.113483,-0.114617,0.096072,...,0.094317,-0.092843,0.087745,-0.094319,0.094318,-0.094319,0.094319,-0.093529,-0.094319,0.085881
2,a22,LZTCFLDZLBOLDW-UHFFFAOYSA-N,C,0.100826,0.099161,0.048998,0.096343,-0.099945,-0.055314,-0.095259,...,-0.088958,-0.088958,-0.088958,0.088958,-0.088958,-0.088958,-0.088958,-0.088958,0.075624,0.088958
3,a22,LZTCFLDZLBOLDW-UHFFFAOYSA-N,D,-0.090586,-0.090572,0.090585,-0.088503,0.090586,-0.090586,-0.090422,...,0.086173,0.091215,-0.09122,-0.090666,-0.091202,-0.054083,0.091219,0.09122,-0.090715,0.09122
4,a22,LZTCFLDZLBOLDW-UHFFFAOYSA-N,E,-0.091627,-0.03136,0.098398,-0.098952,0.052909,-0.098256,0.095695,...,-0.062263,0.101168,-0.072766,-0.100151,0.104266,0.104266,0.104267,0.002837,0.104063,0.055747


### all 25 sublevels into one:


In [None]:
result_rows = []

for drug in cc_raw_25['drug'].unique():
    drug_data = cc_raw_25[cc_raw_25['drug'] == drug]
    inchikey = drug_data['inchikey'].iloc[0]

    # Making sure that for each drug, the levels are sorted A through E
    drug_data_sorted = drug_data.sort_values('level') 
    concatenated_vector = drug_data_sorted[vector_cols].to_numpy().reshape(-1)
    result_rows.append([drug, inchikey] + concatenated_vector.tolist())


vector_cols_concat = [f'dim_{i}' for i in range(128*25)]
features_25_levels_into_1 = pd.DataFrame(result_rows, columns=['drug', 'inchikey'] + vector_cols_concat) 
features_25_levels_into_1['inchikey'] = features_25_levels_into_1['inchikey'].astype(str).str.strip().str.upper()
features_25_levels_into_1.to_csv(CC_FEATURES / "features_25_levels_into_1.csv", index=False)

In [None]:
len(features_25_levels_into_1)
features_25_levels_into_1.head()

Unnamed: 0,drug,inchikey,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,dim_7,...,dim_3190,dim_3191,dim_3192,dim_3193,dim_3194,dim_3195,dim_3196,dim_3197,dim_3198,dim_3199
0,a22,LZTCFLDZLBOLDW-UHFFFAOYSA-N,-0.090141,0.090142,0.089605,0.090142,-0.090142,0.090142,0.090142,-0.090142,...,-0.062263,0.101168,-0.072766,-0.100151,0.104266,0.104266,0.104267,0.002837,0.104063,0.055747
1,acetylsalicylicacid,BSYNRYMUTXBXSQ-UHFFFAOYSA-N,-0.092068,-0.089593,-0.092069,0.092069,0.091756,0.092069,0.092069,-0.092069,...,0.089002,0.089002,0.089002,-0.089002,0.089002,0.089002,0.088994,-0.089002,-0.089002,0.089002
2,amikacin,LKCWBDHBTVXHDL-RMDFUYIESA-N,-0.089872,0.089872,0.089872,-0.089872,0.089872,-0.089872,0.089872,0.089872,...,0.091894,-0.091891,0.091894,0.091894,-0.090335,-0.09036,-0.065358,-0.091894,0.091894,-0.091894
3,aminosalicylate sodium,GMUQJDAYXZXBOT-UHFFFAOYSA-M,-0.091514,0.064881,0.091503,0.091512,0.091503,0.091514,0.091503,-0.091514,...,0.138266,-0.073934,0.038391,0.134433,-0.048002,-0.111298,-0.114748,0.04023,-0.062523,-0.028201
4,aminosalicylic acid,WUBBRNOQWQTFEX-UHFFFAOYSA-N,-0.091677,-0.091677,-0.091677,0.091619,-0.07631,0.091244,0.087477,-0.091676,...,0.089803,-0.089803,-0.089803,0.089803,-0.089803,-0.089803,0.089803,-0.089803,-0.089803,0.089803


### first 15 sublevels (first 3 levels) into one:

In [None]:
result_rows = []
levels_to_use = ['A', 'B', 'C']

for drug in cc_raw_15['drug'].unique():
    drug_data = cc_raw_15[cc_raw_15['drug'] == drug]
    inchikey = drug_data['inchikey'].iloc[0]

    # Making sure that for each drug, the levels are sorted A through E
    # drug_data_selected = drug_data[drug_data['level'].isin(levels_to_use)].sort_values('level')
    drug_data_sorted = drug_data.sort_values('level') 
    concatenated_vector = drug_data_sorted[vector_cols].to_numpy().reshape(-1)
    result_rows.append([drug, inchikey] + concatenated_vector.tolist())


vector_cols_concat = [f'dim_{i}' for i in range(128*15)]
features_15_levels_into_1 = pd.DataFrame(result_rows, columns=['drug', 'inchikey'] + vector_cols_concat) 
features_15_levels_into_1['inchikey'] = features_15_levels_into_1['inchikey'].astype(str).str.strip().str.upper()
features_15_levels_into_1.to_csv(CC_FEATURES / "features_15_levels_into_1.csv", index=False)

In [None]:
features_15_levels_into_1.head()
# len(features_15_levels_into_1)

Unnamed: 0,drug,inchikey,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,dim_7,...,dim_1910,dim_1911,dim_1912,dim_1913,dim_1914,dim_1915,dim_1916,dim_1917,dim_1918,dim_1919
0,a22,LZTCFLDZLBOLDW-UHFFFAOYSA-N,-0.090141,0.090142,0.089605,0.090142,-0.090142,0.090142,0.090142,-0.090142,...,-0.088958,-0.088958,-0.088958,0.088958,-0.088958,-0.088958,-0.088958,-0.088958,0.075624,0.088958
1,acetylsalicylicacid,BSYNRYMUTXBXSQ-UHFFFAOYSA-N,-0.092068,-0.089593,-0.092069,0.092069,0.091756,0.092069,0.092069,-0.092069,...,-0.089182,0.089061,-0.089182,-0.089182,0.089182,0.089182,-0.073587,-0.089182,0.089182,-0.089182
2,amikacin,LKCWBDHBTVXHDL-RMDFUYIESA-N,-0.089872,0.089872,0.089872,-0.089872,0.089872,-0.089872,0.089872,0.089872,...,0.069477,0.076226,-0.037971,0.114151,0.114751,-0.112671,0.112541,0.014948,-0.109783,0.043254
3,aminosalicylate sodium,GMUQJDAYXZXBOT-UHFFFAOYSA-M,-0.091514,0.064881,0.091503,0.091512,0.091503,0.091514,0.091503,-0.091514,...,-0.097063,-0.078369,-0.061149,0.104115,0.074525,-0.096866,-0.111583,-0.059589,-0.10958,-0.112264
4,aminosalicylic acid,WUBBRNOQWQTFEX-UHFFFAOYSA-N,-0.091677,-0.091677,-0.091677,0.091619,-0.07631,0.091244,0.087477,-0.091676,...,0.089816,-0.089816,0.089816,-0.089816,-0.089816,0.089816,0.089816,-0.089816,-0.089816,-0.089816


### **Cheking if features are scaled**

In [None]:
feat_cols = [c for c in features_25_levels_into_1.columns if c not in ['drug', 'inchikey']]
min_vals = np.min(features_25_levels_into_1[feat_cols], axis=0)
max_vals = np.max(features_25_levels_into_1[feat_cols], axis=0)

scaled_check = np.all((min_vals >= -1) & (max_vals <= 1))

if scaled_check:
    print("Data is likely scaled between -1 and 1 (Min-Max scaling).")
else:
    print("Data is not Min-Max scaled.")

Data is likely scaled between -1 and 1 (Min-Max scaling).


In [None]:
out_of_range = np.sum((min_vals < -1) | (max_vals > 1))
print(f"{out_of_range} out of {features_25_levels_into_1[feat_cols].shape[1]} features exceed [-1, 1] range")

0 out of 3200 features exceed [-1, 1] range
