# **Build interim datasets** (curated drug–drug interaction tables)

This notebook harmonizes drug identifiers and strain names across source datasets and exports curated “interim” tables used downstream for model training/testing and external validation.

## Inputs
- Extracted source tables under `data/b_extracted/`:
  - Brochado (training/test; Bliss score)
  - Cacace (training/test; Bliss score)
  - ACDB (training/test; Bliss-only subset)
  - Chandrasekaran (external validation; Loewe α)
- Drug reference list: `data/reference/drug_lists/list_antibacterial.csv`
- CC availability tables:
  - `data/features/chemicalchecker_cc/features_25_levels_into_1.csv`
  - `data/features/chemicalchecker_cc/features_15_levels_into_1.csv` (optional)

## Outputs (written to `data/c_interim/`)
- `source_a_brochado/brochado_cleaned_data.csv`
- `source_b_cacace/cacace_cleaned_data.csv`
- `source_c_acdb/acdb_cleaned_data_bliss.csv`
- `source_d_chandrasekaran/chandrasekaran_cleaned_data.csv`

## Notes
- `list_antibacterial.csv` intentionally includes alternative names mapped to the same InChIKey; these aliases help match naming in heterogeneous sources.
- All exports are filtered to compounds with available CC features (25-level file) for downstream modeling consistency.

In [1]:
import pandas as pd

from halo.paths import DRUG_LISTS, CC_FEATURES, EXTRACTED, INTERIM
INTERIM.mkdir(parents=True, exist_ok=True)

In [2]:
from halo.mappers.drug_mapper import DrugMapper

mapper = DrugMapper()

In [3]:
list_antibacterial = pd.read_csv(DRUG_LISTS / "list_antibacterial.csv").copy()
list_antibacterial.head()

Unnamed: 0,drug,inchikey
0,a22,LZTCFLDZLBOLDW-UHFFFAOYSA-N
1,acetylsalicylic acid,BSYNRYMUTXBXSQ-UHFFFAOYSA-N
2,acetylsalicylicacid,BSYNRYMUTXBXSQ-UHFFFAOYSA-N
3,acetylsalisylic acid,BSYNRYMUTXBXSQ-UHFFFAOYSA-N
4,alahopcin,NTBVVEFUJUCXPF-FYCPLRARSA-N


In [4]:
features_25_levels_into_1 = pd.read_csv(CC_FEATURES / "features_25_levels_into_1.csv").copy()
features_15_levels_into_1 = pd.read_csv(CC_FEATURES / "features_15_levels_into_1.csv").copy()

In [5]:
fixed_strains = {
    "e. coli bw25113": "escherichia coli bw25113",
    "e. coli iai1": "escherichia coli iai1",
    "st lt2": "salmonella typhimurium lt2",
    "st 14028": "salmonella typhimurium 14028",
    "pao1": "pseudomonas aeruginosa pao1",
    "pa14": "pseudomonas aeruginosa pa14",
    "escherichia coli iai1": "escherichia coli iai1",
    "escherichia coli bw25113": "escherichia coli bw25113",
    "s. aureus dsm 20231": "staphylococcus aureus dsm 20231",
    "s. aureus newman": "staphylococcus aureus newman",
    "s. pneumoniae": "streptococcus pneumoniae",
    "salmonella typhimurium lt2": "salmonella typhimurium lt2",
    "salmonella typhimurium 14028": "salmonella typhimurium 14028",
    "b. subtilis": "bacillus subtilis",
    "pseudomonas aeruginosa pao1": "pseudomonas aeruginosa pao1",
    "pseudomonas aeruginosa pa14": "pseudomonas aeruginosa pa14"
}

fixed_species = {
    "escherichia coli bw25113": "escherichia coli",
    "escherichia coli iai1": "escherichia coli",
    "salmonella typhimurium lt2": "salmonella typhimurium",
    "salmonella typhimurium 14028": "salmonella typhimurium",
    "pseudomonas aeruginosa pao1": "pseudomonas aeruginosa",
    "pseudomonas aeruginosa pa14": "pseudomonas aeruginosa",
    "staphylococcus aureus dsm 20231": "staphylococcus aureus",
    "staphylococcus aureus newman": "staphylococcus aureus",
    "streptococcus pneumoniae": "streptococcus pneumoniae",
    "bacillus subtilis": "bacillus subtilis"
    }

abbrev_to_name = {
    "AMK": "amikacin",
    "GEN": "gentamicin",
    "TOB": "tobramycin",
    "TET": "tetracycline",
    "CHL": "chloramphenicol",
    "CLA": "clarithromycin",
    "ERY": "erythromycin",
    "CIP": "ciprofloxacin",
    "LEV": "levofloxacin",
    "NAL": "nalidixic acid",
    "TRI": "trimethoprim",
    "OXA": "oxacillin",
    "CEF": "cefoxitin",
    "NIT": "nitrofurantoin",
    "FUS": "fusidic acid",
    "RIF": "rifampicin",
    "VAN": "vancomycin",
    "SPE": "spectinomycin",
}

# **Chandrasekaran_data**

In [6]:
chan_data = pd.read_csv(EXTRACTED / "source_d_chandrasekaran" / "chandrasekaran_data.csv").copy()
chan_data.head()

Unnamed: 0,Drug A code,Drug B code,Experimental Interaction Score
0,AMK,CEF,1.3735
1,AMK,CHL,2.7295
2,AMK,CIP,-0.024
3,AMK,CLA,-0.326
4,AMK,ERY,1.4335


In [7]:
chan_data.shape

(171, 3)

In [8]:
chan_data['Drug A'] = chan_data['Drug A code'].map(abbrev_to_name).str.lower().str.strip()
chan_data['Drug B'] = chan_data['Drug B code'].map(abbrev_to_name).str.lower().str.strip()
chan_data['Experimental Interaction Score'] = pd.to_numeric(chan_data['Experimental Interaction Score'], errors='coerce') 
chan_data['Experimental Interaction Score'] = chan_data['Experimental Interaction Score'].round(3)

chan_data = chan_data.dropna(subset=('Drug A', 'Drug B'))

chan_data.head()

Unnamed: 0,Drug A code,Drug B code,Experimental Interaction Score,Drug A,Drug B
0,AMK,CEF,1.374,amikacin,cefoxitin
1,AMK,CHL,2.73,amikacin,chloramphenicol
2,AMK,CIP,-0.024,amikacin,ciprofloxacin
3,AMK,CLA,-0.326,amikacin,clarithromycin
4,AMK,ERY,1.434,amikacin,erythromycin


In [9]:
chan_data = mapper.inspect_and_clean(chan_data)
all_chan_compounds = mapper.compounds_list(chan_data)

number of initial combinations in this dataset: 153
number of unique antibacterial/antiviral compounds in this dataset:18


In [10]:
chan_data = mapper.enrich(chan_data, list_antibacterial)
chan_missing_compounds = mapper.missing_compounds(chan_data, dataset_name='chan')
chan_missing_compounds

missing Drug A Inchikeys: 0
missing Drug B Inchikeys: 0


[]

In [11]:
chan_data['Drug Pair'] = chan_data.apply(lambda x: tuple(sorted([x['Drug A Inchikey'], x['Drug B Inchikey']])), axis=1)

In [12]:
chan_data.shape

(153, 8)

In [13]:
chan_data.head()

Unnamed: 0,Drug A code,Drug B code,Experimental Interaction Score,Drug A,Drug B,Drug A Inchikey,Drug B Inchikey,Drug Pair
0,AMK,CEF,1.374,amikacin,cefoxitin,LKCWBDHBTVXHDL-RMDFUYIESA-N,WZOZEZRFJCJXNZ-ZBFHGGJFSA-N,"(LKCWBDHBTVXHDL-RMDFUYIESA-N, WZOZEZRFJCJXNZ-Z..."
1,AMK,CHL,2.73,amikacin,chloramphenicol,LKCWBDHBTVXHDL-RMDFUYIESA-N,WIIZWVCIJKGZOK-RKDXNWHRSA-N,"(LKCWBDHBTVXHDL-RMDFUYIESA-N, WIIZWVCIJKGZOK-R..."
2,AMK,CIP,-0.024,amikacin,ciprofloxacin,LKCWBDHBTVXHDL-RMDFUYIESA-N,MYSWGUAQZAJSOK-UHFFFAOYSA-N,"(LKCWBDHBTVXHDL-RMDFUYIESA-N, MYSWGUAQZAJSOK-U..."
3,AMK,CLA,-0.326,amikacin,clarithromycin,LKCWBDHBTVXHDL-RMDFUYIESA-N,AGOYDEPGAOXOCK-KCBOHYOISA-N,"(AGOYDEPGAOXOCK-KCBOHYOISA-N, LKCWBDHBTVXHDL-R..."
4,AMK,ERY,1.434,amikacin,erythromycin,LKCWBDHBTVXHDL-RMDFUYIESA-N,ULGZDMOVFRHVEP-RWJQBGPGSA-N,"(LKCWBDHBTVXHDL-RMDFUYIESA-N, ULGZDMOVFRHVEP-R..."


In [14]:
chan_cleaned_data = chan_data[['Drug A', 'Drug B', 'Drug A Inchikey', 'Drug B Inchikey', 'Experimental Interaction Score', 'Drug Pair']]
chan_cleaned_data.head

<bound method NDFrame.head of            Drug A           Drug B              Drug A Inchikey  \
0        amikacin        cefoxitin  LKCWBDHBTVXHDL-RMDFUYIESA-N   
1        amikacin  chloramphenicol  LKCWBDHBTVXHDL-RMDFUYIESA-N   
2        amikacin    ciprofloxacin  LKCWBDHBTVXHDL-RMDFUYIESA-N   
3        amikacin   clarithromycin  LKCWBDHBTVXHDL-RMDFUYIESA-N   
4        amikacin     erythromycin  LKCWBDHBTVXHDL-RMDFUYIESA-N   
..            ...              ...                          ...   
166  tetracycline     trimethoprim  NWXMGUDVXFXRIG-WESIUVDSSA-N   
167  tetracycline       vancomycin  NWXMGUDVXFXRIG-WESIUVDSSA-N   
168    tobramycin     trimethoprim  NLVFBUXFDBBNBW-PBSUHMDJSA-N   
169    tobramycin       vancomycin  NLVFBUXFDBBNBW-PBSUHMDJSA-N   
170  trimethoprim       vancomycin  IEDVJHCEMCRBQM-UHFFFAOYSA-N   

                 Drug B Inchikey  Experimental Interaction Score  \
0    WZOZEZRFJCJXNZ-ZBFHGGJFSA-N                           1.374   
1    WIIZWVCIJKGZOK-RKDXNWHRS

In [15]:
chan_cleaned_data = mapper.check_na(chan_cleaned_data, critical_columns=['Drug A Inchikey', 'Drug B Inchikey', 'Experimental Interaction Score'])

No missing rows found in the critical columns.


### Chandrasekaran external set (Loewe α)
We convert Loewe α into binary labels using the published thresholds:
- synergy: α ≤ -0.5
- antagonism: α ≥ 1.0
Values in between are excluded (ambiguous / additive range).

In [16]:
def alpha_to_label(a):
        if a <= -0.5:
            return "synergy"
        elif a >= 1.0:
            return "antagonism"
        else:
            return None
        
chan_cleaned_data["Interaction Type"] = chan_cleaned_data["Experimental Interaction Score"].apply(alpha_to_label)
chan_cleaned_data = chan_cleaned_data[chan_cleaned_data["Interaction Type"].notna()].copy()

In [17]:
chan_cleaned_data.head()

Unnamed: 0,Drug A,Drug B,Drug A Inchikey,Drug B Inchikey,Experimental Interaction Score,Drug Pair,Interaction Type
0,amikacin,cefoxitin,LKCWBDHBTVXHDL-RMDFUYIESA-N,WZOZEZRFJCJXNZ-ZBFHGGJFSA-N,1.374,"(LKCWBDHBTVXHDL-RMDFUYIESA-N, WZOZEZRFJCJXNZ-Z...",antagonism
1,amikacin,chloramphenicol,LKCWBDHBTVXHDL-RMDFUYIESA-N,WIIZWVCIJKGZOK-RKDXNWHRSA-N,2.73,"(LKCWBDHBTVXHDL-RMDFUYIESA-N, WIIZWVCIJKGZOK-R...",antagonism
4,amikacin,erythromycin,LKCWBDHBTVXHDL-RMDFUYIESA-N,ULGZDMOVFRHVEP-RWJQBGPGSA-N,1.434,"(LKCWBDHBTVXHDL-RMDFUYIESA-N, ULGZDMOVFRHVEP-R...",antagonism
5,amikacin,fusidic acid,LKCWBDHBTVXHDL-RMDFUYIESA-N,IECPWNUMDGFDKC-MZJAQBGESA-N,-0.55,"(IECPWNUMDGFDKC-MZJAQBGESA-N, LKCWBDHBTVXHDL-R...",synergy
12,amikacin,rifampicin,LKCWBDHBTVXHDL-RMDFUYIESA-N,JQXXHWHPUNPDRT-WLSIYKJHSA-N,1.1,"(JQXXHWHPUNPDRT-WLSIYKJHSA-N, LKCWBDHBTVXHDL-R...",antagonism


In [18]:
chan_cleaned_data = mapper.filter_cc_missing(features_25_levels_into_1, chan_cleaned_data)
(INTERIM / "source_d_chandrasekaran").mkdir(parents=True, exist_ok=True)
chan_cleaned_data.to_csv(INTERIM / "source_d_chandrasekaran" / "chandrasekaran_cleaned_data.csv", index=False)

The number of final combinations is: 70
The inchikeys missing from feature set are: ['NWXMGUDVXFXRIG-WESIUVDSSA-N']


# **Brochado_data**

In [19]:
brochado_data = pd.read_csv(EXTRACTED / "source_a_brochado" / "brochado_data.csv").copy()
brochado_data.head()

Unnamed: 0,Drug 1,Drug 2,E. coli BW25113,E. coli iAi1,ST LT2,ST 14028,PAO1,PA14,p-value E. coli BW25113,p-value E. coli iAi1,p-value ST LT2,p-value ST 14028,p-value PAO1,p-value PA14,interaction E. coli BW25113,interaction E. coli iAi1,interaction ST LT2,interaction ST14028,interaction PAO1,interaction PA14
0,Amoxicillin,Oxacillin,-0.548378,-0.651226,,,,,0.001666,0.001463,,,,,Synergy,Synergy,,,,
1,Amoxicillin,Cefsulodin,-0.457704,-0.510561,,,,,0.001666,0.001463,,,,,Synergy,Synergy,,,,
2,Amoxicillin,Trimethoprim,-0.168982,0.092046,,,,,0.001666,,,,,,Synergy,,,,,
3,Amoxicillin,Acetylsalisylic acid,-0.2265,0.084217,,,,,0.001666,,,,,,Synergy,,,,,
4,Chloramphenicol,Fusidic acid,-0.160244,-0.327036,,,,,0.001666,0.001463,,,,,Synergy,Synergy,,,,


In [20]:
len(brochado_data)

1079

change the column names to 'Drug A and 'Drug B':

In [21]:
brochado_data.rename(columns={'Drug 1': 'Drug A', 'Drug 2': 'Drug B'}, inplace=True)

reshape the brochado_data from wide format to long format:

In [22]:
# Mapping columns
strains = ["E. coli BW25113", "E. coli iAi1", "ST LT2", "ST 14028", "PAO1", "PA14"]

bliss_cols = strains

pval_cols_map = {
    "p-value E. coli BW25113": "E. coli BW25113",
    "p-value E. coli iAi1": "E. coli iAi1",
    "p-value ST LT2": "ST LT2",
    "p-value ST 14028": "ST 14028",
    "p-value PAO1": "PAO1",
    "p-value PA14": "PA14",
}

interaction_cols_map = {
    "interaction E. coli BW25113": "E. coli BW25113",
    "interaction E. coli iAi1": "E. coli iAi1",
    "interaction ST LT2": "ST LT2",
    "interaction ST14028": "ST 14028",
    "interaction PAO1": "PAO1",
    "interaction PA14": "PA14",
}

bliss_df = brochado_data.melt(
    id_vars=["Drug A", "Drug B"],
    value_vars=bliss_cols,
    var_name="Strain",
    value_name="Bliss Score",
)

_p = brochado_data.melt(
    id_vars=["Drug A", "Drug B"],
    value_vars=list(pval_cols_map.keys()),
    var_name="col",
    value_name="p-value",
)
_p["Strain"] = _p["col"].map(pval_cols_map)
pvals_df = _p.drop(columns="col")

_i = brochado_data.melt(
    id_vars=["Drug A", "Drug B"],
    value_vars=list(interaction_cols_map.keys()),
    var_name="col",
    value_name="Interaction Type",
)
_i["Strain"] = _i["col"].map(interaction_cols_map)
inter_df = _i.drop(columns="col")

from functools import reduce
brochado_data = reduce(
    lambda left, right: left.merge(right, on=["Drug A", "Drug B", "Strain"], how="left"),
    [bliss_df, pvals_df, inter_df],
)

In [23]:
# brochado_data = brochado_data[['Drug A', 'Drug B', 'Strain', 'Bliss Score', 'p-value', 'Interaction Type']]
brochado_data.head()

Unnamed: 0,Drug A,Drug B,Strain,Bliss Score,p-value,Interaction Type
0,Amoxicillin,Oxacillin,E. coli BW25113,-0.548378,0.001666,Synergy
1,Amoxicillin,Cefsulodin,E. coli BW25113,-0.457704,0.001666,Synergy
2,Amoxicillin,Trimethoprim,E. coli BW25113,-0.168982,0.001666,Synergy
3,Amoxicillin,Acetylsalisylic acid,E. coli BW25113,-0.2265,0.001666,Synergy
4,Chloramphenicol,Fusidic acid,E. coli BW25113,-0.160244,0.001666,Synergy


In [24]:
len(brochado_data)

6474

In [25]:
brochado_data = mapper.inspect_and_clean(brochado_data)
all_brochado_compounds = mapper.compounds_list(brochado_data)

number of initial combinations in this dataset: 6474
number of unique antibacterial/antiviral compounds in this dataset:79


In [26]:
brochado_data = mapper.enrich(brochado_data, list_antibacterial)
brochado_missing_compounds = mapper.missing_compounds(brochado_data, dataset_name='brochado')

missing Drug A Inchikeys: 0
missing Drug B Inchikeys: 0


In [27]:
# brochado_missing_drugs
len(brochado_missing_compounds)

0

In [28]:
brochado_cleaned_data = mapper.check_na(brochado_data, critical_columns=['Drug A', 'Drug B', 'Strain', 'Bliss Score', 'Drug A Inchikey', 'Drug B Inchikey'])

Missing values report (before dropping): Bliss Score    3434
dtype: int64


### Fixing `Strain` and adding `Specie`:

In [29]:
brochado_cleaned_data.head()

Unnamed: 0,Drug A,Drug B,Strain,Bliss Score,p-value,Interaction Type,Drug A Inchikey,Drug B Inchikey
0,amoxicillin,oxacillin,E. coli BW25113,-0.548378,0.001666,Synergy,LSQZJLSUYDQPKJ-NJBDSQKTSA-N,UWYHMGVUTGAWSP-JKIFEVAISA-N
1,amoxicillin,cefsulodin,E. coli BW25113,-0.457704,0.001666,Synergy,LSQZJLSUYDQPKJ-NJBDSQKTSA-N,SYLKGLMBLAAGSC-QLVMHMETSA-N
2,amoxicillin,trimethoprim,E. coli BW25113,-0.168982,0.001666,Synergy,LSQZJLSUYDQPKJ-NJBDSQKTSA-N,IEDVJHCEMCRBQM-UHFFFAOYSA-N
3,amoxicillin,acetylsalisylic acid,E. coli BW25113,-0.2265,0.001666,Synergy,LSQZJLSUYDQPKJ-NJBDSQKTSA-N,BSYNRYMUTXBXSQ-UHFFFAOYSA-N
4,chloramphenicol,fusidic acid,E. coli BW25113,-0.160244,0.001666,Synergy,WIIZWVCIJKGZOK-RKDXNWHRSA-N,IECPWNUMDGFDKC-MZJAQBGESA-N


In [30]:
brochado_cleaned_data['Strain'].value_counts()

Strain
E. coli BW25113    619
E. coli iAi1       619
ST LT2             520
ST 14028           520
PAO1               381
PA14               381
Name: count, dtype: int64

In [31]:
brochado_cleaned_data['Strain'] = brochado_cleaned_data['Strain'].astype(str).str.strip().str.lower()
brochado_cleaned_data['Strain'] = brochado_cleaned_data['Strain'].replace(fixed_strains)
brochado_cleaned_data['Strain'] = brochado_cleaned_data['Strain'].astype(str).str.strip().str.lower()

In [32]:
brochado_cleaned_data['Strain'].value_counts()

Strain
escherichia coli bw25113        619
escherichia coli iai1           619
salmonella typhimurium lt2      520
salmonella typhimurium 14028    520
pseudomonas aeruginosa pao1     381
pseudomonas aeruginosa pa14     381
Name: count, dtype: int64

In [33]:
brochado_cleaned_data['Specie'] = brochado_cleaned_data['Strain'].map(fixed_species)
brochado_cleaned_data['Specie'] = brochado_cleaned_data['Specie'].astype(str).str.strip().str.lower()

In [34]:
brochado_cleaned_data['Specie'].value_counts()

Specie
escherichia coli          1238
salmonella typhimurium    1040
pseudomonas aeruginosa     762
Name: count, dtype: int64

### dealing with duplicated rows:

In [35]:
brochado_cleaned_data = mapper.refine_combinations(brochado_cleaned_data, other_columns=['Strain', 'Bliss Score'])

numebr of repeated row: 0
duplicated rows: Empty DataFrame
Columns: [Drug A, Drug B, Strain, Bliss Score, p-value, Interaction Type, Drug A Inchikey, Drug B Inchikey, Specie, Drug Pair]
Index: []


In [36]:
brochado_cleaned_data = mapper.filter_cc_missing(features_25_levels_into_1, brochado_cleaned_data)
# brochado_cleaned_data = mapper.filter_cc_missing(features_15_levels_into_1, brochado_cleaned_data)
(INTERIM / "source_a_brochado").mkdir(parents=True, exist_ok=True)
brochado_cleaned_data.to_csv(INTERIM / "source_a_brochado" / "brochado_cleaned_data.csv", index=False)

The number of final combinations is: 2298
The inchikeys missing from feature set are: ['NAN', 'NVNLLIYOARQCIX-GSJOZIGCSA-N', 'QRBLKGHRWFGINE-UGWAGOLRSA-N', 'SGKRLCUYIXIAHR-AKNGSSGZSA-N', 'SOVUOXKZCCAWOJ-HJYUBDRYSA-N']


The number of final combinations is: 2298

# **Cacace_data**

In [37]:
cacace_data = pd.read_csv(EXTRACTED / "source_b_cacace" / "cacace_data.csv").copy()
cacace_data.head()

Unnamed: 0,Combination,Bliss_interaction_score,p_adjusted,Strain,Type,Screen
0,Amoxicillin_Fluorouracil,-0.136761,0.051145,S. aureus DSM 20231,neutral,non_antibiotic_screen
1,Acetylsalicylicacid_Fluorouracil,0.101205,1.0,S. aureus DSM 20231,neutral,non_antibiotic_screen
2,Auranofin_Fluorouracil,-0.180292,0.146626,S. aureus DSM 20231,neutral,non_antibiotic_screen
3,Azithromycin_Fluorouracil,-0.148086,1.0,S. aureus DSM 20231,neutral,non_antibiotic_screen
4,Bacitracin_Fluorouracil,-0.05441,0.964089,S. aureus DSM 20231,neutral,non_antibiotic_screen


In [38]:
len(cacace_data)

10714

In [39]:
cacace_data['p_adjusted'].describe()

count    10714.000000
mean         0.733438
std          0.385384
min          0.001435
25%          0.421007
50%          0.997110
75%          0.998662
max          1.000000
Name: p_adjusted, dtype: float64

filtering data based on the p-value cutoff < 0.05:

In [40]:
cacace_data = cacace_data[cacace_data['p_adjusted'] < 0.05].reset_index(drop=True)

In [41]:
len(cacace_data)

1340

separating the columns for combinations to `Drug A` and `Drug B`:

In [42]:
cacace_data[['Drug A', 'Drug B']] = cacace_data['Combination'].str.split('_', expand=True)
cacace_data = cacace_data.drop(columns=['Combination'])
cacace_data.head()

Unnamed: 0,Bliss_interaction_score,p_adjusted,Strain,Type,Screen,Drug A,Drug B
0,-0.114271,0.029174,S. aureus DSM 20231,synergy,non_antibiotic_screen,Cefotaxime,Fluorouracil
1,0.194907,0.002674,S. aureus DSM 20231,antagonism,non_antibiotic_screen,Fluorouracil,Loperamide
2,-0.204682,0.004703,S. aureus DSM 20231,synergy,non_antibiotic_screen,Fluorouracil,Nitrofurantoin
3,-0.184713,0.002674,S. aureus DSM 20231,synergy,non_antibiotic_screen,Fluorouracil,Streptomycin
4,-0.294366,0.002674,S. aureus DSM 20231,synergy,non_antibiotic_screen,Acetylsalicylicacid,Alfacalcidol


In [43]:
cacace_data = cacace_data.rename(columns={
    "Bliss_interaction_score": "Bliss Score",
    "p_adjusted": "p-value",
    "Type": "Interaction Type"
})

cacace_data = cacace_data[['Drug A', 'Drug B', 'Strain', 'Bliss Score', 'p-value', 'Interaction Type']]

In [44]:
cacace_data.head()

Unnamed: 0,Drug A,Drug B,Strain,Bliss Score,p-value,Interaction Type
0,Cefotaxime,Fluorouracil,S. aureus DSM 20231,-0.114271,0.029174,synergy
1,Fluorouracil,Loperamide,S. aureus DSM 20231,0.194907,0.002674,antagonism
2,Fluorouracil,Nitrofurantoin,S. aureus DSM 20231,-0.204682,0.004703,synergy
3,Fluorouracil,Streptomycin,S. aureus DSM 20231,-0.184713,0.002674,synergy
4,Acetylsalicylicacid,Alfacalcidol,S. aureus DSM 20231,-0.294366,0.002674,synergy


fixing the drug names (some drugs might have a two part name and they need to be adjusted!) calling the `compounds_list` method:

In [45]:
all_cacace_compounds = mapper.compounds_list(cacace_data)
len(all_cacace_compounds)
# all_cacace_compounds

107

In [46]:
mapping_dict = {
    'Acetylsalicylicacid': 'Acetylsalicylic acid',
    'Amoxicillinclavulanic': 'Amoxicillin clavulanic',
    'CycloserineD': 'Cycloserine D',
    'Fusidic acid': 'Fusidic acid',
    'MitomycinC': 'Mitomycin C',
    'PenicillinG': 'Penicillin G',
    'Pseudomonic acid': 'Pseudomonic acid',
    'Virginiamycin M1': 'Virginiamycin M1'
}
# cacace_data['Drug A'] = cacace_data['Drug A'].apply(lambda x: mapping_dict.get(x, x))
# cacace_data['Drug B'] = cacace_data['Drug B'].apply(lambda x: mapping_dict.get(x, x))
cacace_data['Drug A'] = cacace_data['Drug A'].replace(mapping_dict)
cacace_data['Drug B'] = cacace_data['Drug B'].replace(mapping_dict)

In [47]:
cacace_data.head()

Unnamed: 0,Drug A,Drug B,Strain,Bliss Score,p-value,Interaction Type
0,Cefotaxime,Fluorouracil,S. aureus DSM 20231,-0.114271,0.029174,synergy
1,Fluorouracil,Loperamide,S. aureus DSM 20231,0.194907,0.002674,antagonism
2,Fluorouracil,Nitrofurantoin,S. aureus DSM 20231,-0.204682,0.004703,synergy
3,Fluorouracil,Streptomycin,S. aureus DSM 20231,-0.184713,0.002674,synergy
4,Acetylsalicylic acid,Alfacalcidol,S. aureus DSM 20231,-0.294366,0.002674,synergy


In [48]:
cacace_data = mapper.inspect_and_clean(cacace_data)
all_cacace_compounds = mapper.compounds_list(cacace_data)

number of initial combinations in this dataset: 1340
number of unique antibacterial/antiviral compounds in this dataset:107


In [49]:
cacace_data = mapper.enrich(cacace_data, list_antibacterial)
cacace_missing_compounds = mapper.missing_compounds(cacace_data, dataset_name='cacace')

missing Drug A Inchikeys: 0
missing Drug B Inchikeys: 0


In [50]:
# len(cacace_missing_compounds)
cacace_missing_compounds

[]

In [51]:
cacace_data.head()

Unnamed: 0,Drug A,Drug B,Strain,Bliss Score,p-value,Interaction Type,Drug A Inchikey,Drug B Inchikey
0,cefotaxime,fluorouracil,S. aureus DSM 20231,-0.114271,0.029174,synergy,GPRBEKHLDVQUJE-QSWIMTSFSA-N,GHASVSINZRGABV-UHFFFAOYSA-N
1,fluorouracil,loperamide,S. aureus DSM 20231,0.194907,0.002674,antagonism,GHASVSINZRGABV-UHFFFAOYSA-N,NAN
2,fluorouracil,nitrofurantoin,S. aureus DSM 20231,-0.204682,0.004703,synergy,GHASVSINZRGABV-UHFFFAOYSA-N,NXFQHRVNIOXGAQ-YCRREMRBSA-N
3,fluorouracil,streptomycin,S. aureus DSM 20231,-0.184713,0.002674,synergy,GHASVSINZRGABV-UHFFFAOYSA-N,UCSJYZPVAKXKNQ-HZYVHMACSA-N
4,acetylsalicylic acid,alfacalcidol,S. aureus DSM 20231,-0.294366,0.002674,synergy,BSYNRYMUTXBXSQ-UHFFFAOYSA-N,NAN


In [53]:
cacace_cleaned_data = mapper.check_na(cacace_data, critical_columns=['Drug A', 'Drug B', 'Strain', 'Bliss Score', 'Drug A Inchikey', 'Drug B Inchikey'])

No missing rows found in the critical columns.


### fixing `Strain` and adding `Specie`:

In [54]:
len(cacace_cleaned_data)

1340

In [55]:
cacace_cleaned_data['Strain'].value_counts()

Strain
S. aureus DSM 20231    467
S. aureus Newman       338
S. pneumoniae          290
B. subtilis            245
Name: count, dtype: int64

In [56]:
cacace_cleaned_data['Strain'] = cacace_cleaned_data['Strain'].astype(str).str.strip().str.lower()
cacace_cleaned_data['Strain'] = cacace_cleaned_data['Strain'].replace(fixed_strains)
cacace_cleaned_data['Strain'] = cacace_cleaned_data['Strain'].astype(str).str.strip().str.lower()

In [57]:
cacace_cleaned_data['Strain'].value_counts()

Strain
staphylococcus aureus dsm 20231    467
staphylococcus aureus newman       338
streptococcus pneumoniae           290
bacillus subtilis                  245
Name: count, dtype: int64

In [58]:
cacace_cleaned_data['Specie'] = cacace_cleaned_data['Strain'].map(fixed_species)
cacace_cleaned_data['Specie'] = cacace_cleaned_data['Specie'].astype(str).str.strip().str.lower()

In [59]:
cacace_cleaned_data['Specie'].value_counts()

Specie
staphylococcus aureus       805
streptococcus pneumoniae    290
bacillus subtilis           245
Name: count, dtype: int64

### dealing with duplicated rows and NA

In [60]:
cacace_cleaned_data = mapper.refine_combinations(cacace_cleaned_data, other_columns=['Strain', 'Bliss Score'])

numebr of repeated row: 0
duplicated rows: Empty DataFrame
Columns: [Drug A, Drug B, Strain, Bliss Score, p-value, Interaction Type, Drug A Inchikey, Drug B Inchikey, Specie, Drug Pair]
Index: []


In [61]:
cacace_cleaned_data = mapper.filter_cc_missing(features_25_levels_into_1, cacace_cleaned_data)
# cacace_cleaned_data = mapper.filter_cc_missing(features_15_levels_into_1, cacace_cleaned_data)
(INTERIM / "source_b_cacace").mkdir(parents=True, exist_ok=True)
cacace_cleaned_data.to_csv(INTERIM / "source_b_cacace" / "cacace_cleaned_data.csv", index=False)

The number of final combinations is: 862
The inchikeys missing from feature set are: ['DHPRQBPJLMKORJ-XRNKAMNCSA-N', 'NAN', 'NVNLLIYOARQCIX-GSJOZIGCSA-N', 'QRBLKGHRWFGINE-UGWAGOLRSA-N', 'SGKRLCUYIXIAHR-AKNGSSGZSA-N', 'SOVUOXKZCCAWOJ-HJYUBDRYSA-N']


The number of final combinations is: 862

In [62]:
# cacace_cleaned_data['Bliss Score'].describe()

# **ACDB_data**


In [93]:
acdb_data = pd.read_csv(EXTRACTED / "source_c_acdb" / "acdb_data.csv").copy()
acdb_data.head()

Unnamed: 0,Drug A,Drug B,organism,value,type,method,PMID
0,Amikacin,Levofloxacin,Acinetobacter,5.0,Antagonism,FICI,28207768
1,Amikacin,Levofloxacin,Citrobacter,4.0,Indifferent,FICI,28207768
2,Amikacin,Levofloxacin,Escherichia coli,2.13,Indifferent,FICI,28207768
3,Amikacin,Chloramphenicol,Escherichia coli ATCC10798,2.8,Antagonism,,31405069
4,Amikacin,Spectinomycin,Escherichia coli ATCC10798,2.12,Antagonism,,31405069


In [94]:
acdb_data.columns = acdb_data.columns.str.strip()
acdb_data.rename(columns={'organism': 'Strain', 'value': 'Score', 'method': 'Method', 'type': 'Interaction Type'}, inplace=True)
acdb_data.head()

Unnamed: 0,Drug A,Drug B,Strain,Score,Interaction Type,Method,PMID
0,Amikacin,Levofloxacin,Acinetobacter,5.0,Antagonism,FICI,28207768
1,Amikacin,Levofloxacin,Citrobacter,4.0,Indifferent,FICI,28207768
2,Amikacin,Levofloxacin,Escherichia coli,2.13,Indifferent,FICI,28207768
3,Amikacin,Chloramphenicol,Escherichia coli ATCC10798,2.8,Antagonism,,31405069
4,Amikacin,Spectinomycin,Escherichia coli ATCC10798,2.12,Antagonism,,31405069


In [95]:
len(acdb_data)

6040

In [96]:
# acdb_data = acdb_data[['Drug A', 'Drug B', 'Strain', 'Bliss Score', 'Interaction Type', 'Method', 'PMID']]

p.s: remember that i rename the `value` column to `Bliss Score`, but not all the scores in this column are Bliss. some are from different methods.

In [97]:
acdb_data = mapper.inspect_and_clean(acdb_data)
all_acdb_compounds = mapper.compounds_list(acdb_data)

number of initial combinations in this dataset: 6040
number of unique antibacterial/antiviral compounds in this dataset:87


In [98]:
acdb_data = mapper.enrich(acdb_data, list_antibacterial)
acdb_missing_compounds = mapper.missing_compounds(acdb_data, dataset_name='acdb_data')

missing Drug A Inchikeys: 0
missing Drug B Inchikeys: 0


In [99]:
acdb_data.head()

Unnamed: 0,Drug A,Drug B,Strain,Score,Interaction Type,Method,PMID,Drug A Inchikey,Drug B Inchikey
0,amikacin,levofloxacin,Acinetobacter,5.0,Antagonism,FICI,28207768,LKCWBDHBTVXHDL-RMDFUYIESA-N,GSDSWSVVBLHKDQ-JTQLQIEISA-N
1,amikacin,levofloxacin,Citrobacter,4.0,Indifferent,FICI,28207768,LKCWBDHBTVXHDL-RMDFUYIESA-N,GSDSWSVVBLHKDQ-JTQLQIEISA-N
2,amikacin,levofloxacin,Escherichia coli,2.13,Indifferent,FICI,28207768,LKCWBDHBTVXHDL-RMDFUYIESA-N,GSDSWSVVBLHKDQ-JTQLQIEISA-N
3,amikacin,chloramphenicol,Escherichia coli ATCC10798,2.8,Antagonism,,31405069,LKCWBDHBTVXHDL-RMDFUYIESA-N,WIIZWVCIJKGZOK-RKDXNWHRSA-N
4,amikacin,spectinomycin,Escherichia coli ATCC10798,2.12,Antagonism,,31405069,LKCWBDHBTVXHDL-RMDFUYIESA-N,UNFWWIHTNXNPBV-WXKVUWSESA-N


In [100]:
acdb_missing_compounds

[]

In [101]:
acdb_data.head()

Unnamed: 0,Drug A,Drug B,Strain,Score,Interaction Type,Method,PMID,Drug A Inchikey,Drug B Inchikey
0,amikacin,levofloxacin,Acinetobacter,5.0,Antagonism,FICI,28207768,LKCWBDHBTVXHDL-RMDFUYIESA-N,GSDSWSVVBLHKDQ-JTQLQIEISA-N
1,amikacin,levofloxacin,Citrobacter,4.0,Indifferent,FICI,28207768,LKCWBDHBTVXHDL-RMDFUYIESA-N,GSDSWSVVBLHKDQ-JTQLQIEISA-N
2,amikacin,levofloxacin,Escherichia coli,2.13,Indifferent,FICI,28207768,LKCWBDHBTVXHDL-RMDFUYIESA-N,GSDSWSVVBLHKDQ-JTQLQIEISA-N
3,amikacin,chloramphenicol,Escherichia coli ATCC10798,2.8,Antagonism,,31405069,LKCWBDHBTVXHDL-RMDFUYIESA-N,WIIZWVCIJKGZOK-RKDXNWHRSA-N
4,amikacin,spectinomycin,Escherichia coli ATCC10798,2.12,Antagonism,,31405069,LKCWBDHBTVXHDL-RMDFUYIESA-N,UNFWWIHTNXNPBV-WXKVUWSESA-N


### cheking NA and only keeping Bliss rows:

In [102]:
# acdb_cleaned_data['Method'].value_counts()

In [103]:
acdb_cleaned_data = mapper.check_na(acdb_data, critical_columns=['Drug A', 'Drug B', 'Strain', 'Score' ,'Drug A Inchikey', 'Drug B Inchikey'])

Missing values report (before dropping): Score    1024
dtype: int64


getting a subset of `acdb_data` that only has the Bliss method.

In [104]:
acdb_cleaned_data_bliss = acdb_cleaned_data.copy()
acdb_cleaned_data_bliss = acdb_cleaned_data_bliss[acdb_cleaned_data_bliss['Method'] == 'Bliss']
acdb_cleaned_data_bliss = acdb_cleaned_data_bliss.rename(columns={'Score': 'Bliss Score'})

### fixing `Strain` and adding `Specie`:

In [105]:
acdb_cleaned_data_bliss['Strain'].value_counts()

Strain
Escherichia coli BW25113        566
Escherichia coli iAi1           566
Salmonella typhimurium LT2      435
Salmonella typhimurium 14028    435
Pseudomonas aeruginosa PAO1     312
Pseudomonas aeruginosa PA14     312
Name: count, dtype: int64

In [106]:
len(acdb_cleaned_data_bliss)

2626

In [107]:
acdb_cleaned_data_bliss['Strain'] = acdb_cleaned_data_bliss['Strain'].astype(str).str.strip().str.lower()
acdb_cleaned_data_bliss['Strain'] = acdb_cleaned_data_bliss['Strain'].replace(fixed_strains)
acdb_cleaned_data_bliss['Strain'] = acdb_cleaned_data_bliss['Strain'].astype(str).str.strip().str.lower()

In [108]:
acdb_cleaned_data_bliss['Strain'].value_counts()

Strain
escherichia coli bw25113        566
escherichia coli iai1           566
salmonella typhimurium lt2      435
salmonella typhimurium 14028    435
pseudomonas aeruginosa pao1     312
pseudomonas aeruginosa pa14     312
Name: count, dtype: int64

In [109]:
len(acdb_cleaned_data_bliss)

2626

In [110]:
acdb_cleaned_data_bliss['Specie'] = acdb_cleaned_data_bliss['Strain'].map(fixed_species)
acdb_cleaned_data_bliss['Specie'] = acdb_cleaned_data_bliss['Specie'].astype(str).str.strip().str.lower()

In [111]:
acdb_cleaned_data_bliss['Specie'].value_counts()

Specie
escherichia coli          1132
salmonella typhimurium     870
pseudomonas aeruginosa     624
Name: count, dtype: int64

In [112]:
len(acdb_cleaned_data_bliss)

2626

### dealing with duplicated rows and NA

In [114]:
acdb_cleaned_data_bliss = mapper.refine_combinations(acdb_cleaned_data_bliss, other_columns=['Strain', 'Bliss Score'])

numebr of repeated row: 1384
duplicated rows:               Drug A         Drug B                        Strain  \
351     azithromycin       amikacin      escherichia coli bw25113   
354     azithromycin       amikacin         escherichia coli iai1   
372     azithromycin       amikacin   pseudomonas aeruginosa pa14   
383     azithromycin       amikacin   pseudomonas aeruginosa pao1   
414        aztreonam   azithromycin      escherichia coli bw25113   
...              ...            ...                           ...   
6020  nitrofurantoin       amikacin   pseudomonas aeruginosa pao1   
6021  nitrofurantoin     novobiocin  salmonella typhimurium 14028   
6022  nitrofurantoin  ciprofloxacin  salmonella typhimurium 14028   
6023  nitrofurantoin     novobiocin    salmonella typhimurium lt2   
6024  nitrofurantoin  ciprofloxacin    salmonella typhimurium lt2   

      Bliss Score Interaction Type Method      PMID  \
351          0.20       Antagonism  Bliss  29973719   
354          0.

In [115]:
acdb_cleaned_data_bliss = mapper.filter_cc_missing(features_25_levels_into_1, acdb_cleaned_data_bliss)
# acdb_cleaned_data_bliss = mapper.filter_cc_missing(features_15_levels_into_1, acdb_cleaned_data_bliss)
(INTERIM / "source_c_acdb").mkdir(parents=True, exist_ok=True)
acdb_cleaned_data_bliss.to_csv(INTERIM / "source_c_acdb" / "acdb_cleaned_data_bliss.csv", index=False)

The number of final combinations is: 1148
The inchikeys missing from feature set are: ['SGKRLCUYIXIAHR-AKNGSSGZSA-N', 'SOVUOXKZCCAWOJ-HJYUBDRYSA-N']


In [116]:
len(acdb_cleaned_data_bliss)

1148

The number of final combinations is: 1148

___

#### All the missing compounds:

In [75]:
# all_missing = sorted(set(qin_missing_compounds + brochado_missing_compounds + cacace_missing_compounds + acdb_missing_compounds))
# len(all_missing)
# all_missing