## **00 — Clean Brochado:** single-treatment fitness table (stage0 → stage1)

This notebook loads the pre-exported Brochado single-treatment fitness table, standardizes column names and strain labels, does a quick completeness check (drug × strain coverage), and writes the cleaned table to the strain-space pipeline inputs (stage1).

### Inputs
- `feature_pipeline/strain_space/inputs/stage0/before_raw_brochado.csv`

### Outputs
- `feature_pipeline/strain_space/inputs/stage1/raw_brochado_fitness.csv`

### Notes / assumptions
- Strain columns are renamed to full strain names (e.g., `ecbw25113` → `escherichia coli bw25113`).
- The first two columns are treated as identifiers (`drug`, `3_letter_code`); all remaining columns are interpreted as strain fitness values.
- The long-format melt at the end is for coverage QC only and does not affect the saved output.


In [None]:
import pandas as pd
from halo.paths import FEATURE_PIPELINE

In [5]:
single_treat = pd.read_csv(FEATURE_PIPELINE / "strain_space" / "inputs" / "stage0" / "before_raw_brochado.csv")

In [6]:
single_treat.head()

Unnamed: 0,drug,3_letter_code,ECBW25113,ECiAi1,STLT2,ST14028,PAO1,PA14
0,Amoxicillin,AMX,0.45,0.0,0.84,0.76,0.89,0.92
1,Oxacillin,OXA,0.78,0.83,0.85,0.86,0.88,0.93
2,Cefotaxime,CTX,0.57,0.39,0.71,0.79,0.0,0.76
3,Cefaclor,CEC,0.47,0.55,0.0,0.0,0.9,0.92
4,Cefsulodin,CFS,0.62,0.86,0.82,0.89,0.0,0.0


In [7]:
single_treat.columns = (single_treat.columns.astype(str).str.strip().str.lower())
single_treat.columns

Index(['drug', '3_letter_code', 'ecbw25113', 'eciai1', 'stlt2', 'st14028',
       'pao1', 'pa14'],
      dtype='object')

In [8]:
single_treat = single_treat.rename(columns={
    'ecbw25113': 'escherichia coli bw25113',
    'eciai1': 'escherichia coli iai1',
    'stlt2': 'salmonella typhimurium lt2',
    'st14028': 'salmonella typhimurium 14028',
    'pao1': 'pseudomonas aeruginosa pao1',
    'pa14': 'pseudomonas aeruginosa pa14'
})

single_treat.head()

Unnamed: 0,drug,3_letter_code,escherichia coli bw25113,escherichia coli iai1,salmonella typhimurium lt2,salmonella typhimurium 14028,pseudomonas aeruginosa pao1,pseudomonas aeruginosa pa14
0,Amoxicillin,AMX,0.45,0.0,0.84,0.76,0.89,0.92
1,Oxacillin,OXA,0.78,0.83,0.85,0.86,0.88,0.93
2,Cefotaxime,CTX,0.57,0.39,0.71,0.79,0.0,0.76
3,Cefaclor,CEC,0.47,0.55,0.0,0.0,0.9,0.92
4,Cefsulodin,CFS,0.62,0.86,0.82,0.89,0.0,0.0


In [9]:
len(single_treat['drug'].unique())

79

In [10]:
single_treat.shape

(79, 8)

In [11]:
out_path = FEATURE_PIPELINE / "strain_space" / "inputs" / "stage1" / "raw_brochado_fitness.csv"
single_treat.to_csv(out_path, index=False)

converting to long format just to check how many drug * strain combos are present in this df:

In [12]:
fitness_cols = single_treat.columns[2:]
# sanity check 
assert list(single_treat.columns[:2]) == ["drug", "3_letter_code"], single_treat.columns[:5]

In [13]:
long_df = single_treat.melt(
    id_vars=['drug', '3_letter_code'],
    value_vars=fitness_cols,
    var_name='strain',
    value_name='fitness'
)

# count non-null entries
non_na_count = long_df['fitness'].notna().sum()

print("Total drug–strain values:", len(long_df))
print("Non-NA fitness values:", non_na_count)
print("Missing:", len(long_df) - non_na_count)

Total drug–strain values: 474
Non-NA fitness values: 464
Missing: 10
