## **01 — Clean Cacace:** fitness tables (donor + recipient → stage1)

This notebook loads the Cacace donor and recipient single-treatment fitness tables, aggregates fitness per (strain, drug) by averaging across concentrations/replicates, standardizes strain naming, merges donor/recipient views into a single fitness value, pivots to a wide drug × strain matrix, and saves the cleaned table for downstream strain-space construction.


### Inputs
- `feature_pipeline/strain_space/inputs/stage0/before_raw_cacace_donor.csv`
- `feature_pipeline/strain_space/inputs/stage0/before_raw_cacace_recipient.csv`

### Output
- `feature_pipeline/strain_space/inputs/stage1/raw_cacace_fitness.csv`

### Notes / assumptions
- Fitness is averaged **across concentrations and replicates** within each (Strain, Drug) pair (removes concentration dependence).
- Donor and recipient fitness values are merged with an **outer join**; the final fitness is the mean of available donor/recipient values.
- Strain names are normalized to lowercase/stripped strings and a small set of known strain labels is harmonized via a manual mapping.
- Final output is a **wide** matrix: one row per drug, one column per strain.


In [None]:
import pandas as pd
from halo.paths import FEATURE_PIPELINE

In [2]:
single_treat_donor = pd.read_csv(FEATURE_PIPELINE / "strain_space" / "inputs" / "stage0" / "before_raw_cacace_donor.csv" )
single_treat_recipient = pd.read_csv(FEATURE_PIPELINE / "strain_space" / "inputs" / "stage0" / "before_raw_cacace_recipient.csv")

### Cleaning up cacace donor drug-strain pairs:

In [3]:
single_treat_donor.head()

Unnamed: 0,Strain,Fitness_estimated,Donor,Donor_conc,Biol_rep
0,SANewman,1.0,AMX,1,1
1,SANewman,1.0,AMX,2,1
2,SANewman,0.93666,AMX,3,1
3,SANewman,0.43533,ASA,1,1
4,SANewman,0.860668,ASA,2,1


In [4]:
len(single_treat_donor)

1718

In [5]:
len(single_treat_donor['Donor'].unique())

121

* grouped by `['strain' + 'drug']`  
* Compute the mean of `fitness` across all concentrations and replicates → this removes concentration dependence.

In [6]:
donor_df = single_treat_donor.groupby(['Strain', 'Donor'], as_index=False)['Fitness_estimated'].mean()
donor_df = donor_df.rename(columns={
    'Strain': 'strain',
    'Donor': 'drug',
    'Fitness_estimated': 'donor_fitness'})

donor_df.isna().sum() 

strain           0
drug             0
donor_fitness    0
dtype: int64

No NAs. every (strain, drug) in donor has a numeric fitness.

In [7]:
len(donor_df)

313

In [8]:
donor_df['strain'] = donor_df['strain'].astype(str).str.strip().str.lower()

In [9]:
donor_df['strain'].value_counts()

strain
sabsubtilissm20231    121
sanewman               66
bsubtilis              64
spneumoniae            62
Name: count, dtype: int64

In [11]:
fixed_strains = {
    'sabsubtilissm20231': "staphylococcus aureus dsm 20231",
    'sanewman': "staphylococcus aureus newman",
    'bsubtilis': "bacillus subtilis",
    'spneumoniae': "streptococcus pneumoniae"
}

donor_df['strain'] = donor_df['strain'].replace(fixed_strains)
donor_df['strain'] = donor_df['strain'].astype(str).str.strip().str.lower()

In [12]:
donor_df.head()

Unnamed: 0,strain,drug,donor_fitness
0,bacillus subtilis,AMX,0.360845
1,bacillus subtilis,AMXCLA,0.816363
2,bacillus subtilis,ASA,0.479643
3,bacillus subtilis,AUR,0.297503
4,bacillus subtilis,AZM,0.709415


### Cleaning up cacace recipient drug-strain pairs:

In [13]:
single_treat_recipient.head()

Unnamed: 0,Strain,Fitness_estimated,Recipient_Drug,Concentration,Replicate
0,SANewman,1.0,ADEP,0.25,1
1,SANewman,1.0,ADEP,0.25,2
2,SANewman,1.0,ADEP,0.5,1
3,SANewman,0.98604,ADEP,0.5,2
4,SANewman,1.0,ADEP,1.0,1


In [14]:
len(single_treat_recipient)

5208

In [15]:
len(single_treat_recipient['Recipient_Drug'].unique())

65

In [16]:
rec_df = single_treat_recipient.groupby(['Strain', 'Recipient_Drug'], as_index=False)['Fitness_estimated'].mean()
rec_df = rec_df.rename(columns={
    'Strain': 'strain',
    'Recipient_Drug': 'drug',
    'Fitness_estimated': 'rec_fitness'
})

rec_df.isna().sum()

strain         0
drug           0
rec_fitness    0
dtype: int64

No NAs. every (strain, drug) in recipient has a numeric fitness.

In [17]:
len(rec_df)

251

In [18]:
rec_df['strain'] = rec_df['strain'].astype(str).str.strip().str.lower()

fixed_strains = {
    'sabsubtilissm20231': "staphylococcus aureus dsm 20231",
    'sanewman': "staphylococcus aureus newman",
    'bsubtilis': "bacillus subtilis",
    'spneumoniae': "streptococcus pneumoniae"
}

rec_df['strain'] = rec_df['strain'].replace(fixed_strains)
rec_df['strain'] = rec_df['strain'].astype(str).str.strip().str.lower()

In [19]:
rec_df.head()

Unnamed: 0,strain,drug,rec_fitness
0,bacillus subtilis,AMX,0.295742
1,bacillus subtilis,ASA,0.803276
2,bacillus subtilis,AUR,0.347032
3,bacillus subtilis,AZM,0.702392
4,bacillus subtilis,BAC,0.839152


### merge donor and recipient views:

In [20]:
merged = pd.merge(donor_df, rec_df, on=['strain', 'drug'], how='outer')
merged['fitness'] = merged[['donor_fitness', 'rec_fitness']].mean(axis=1)

In [21]:
merged.head()

Unnamed: 0,strain,drug,donor_fitness,rec_fitness,fitness
0,bacillus subtilis,AMX,0.360845,0.295742,0.328294
1,bacillus subtilis,AMXCLA,0.816363,,0.816363
2,bacillus subtilis,ASA,0.479643,0.803276,0.64146
3,bacillus subtilis,AUR,0.297503,0.347032,0.322268
4,bacillus subtilis,AZM,0.709415,0.702392,0.705904


In [22]:
len(merged)

314

In [23]:
len(merged['drug'].unique())

121

In [24]:
donor_df.groupby('strain')['drug'].nunique()

strain
bacillus subtilis                   64
staphylococcus aureus dsm 20231    121
staphylococcus aureus newman        66
streptococcus pneumoniae            62
Name: drug, dtype: int64

In [25]:
rec_df.groupby('strain')['drug'].nunique()

strain
bacillus subtilis                  62
staphylococcus aureus dsm 20231    65
staphylococcus aureus newman       62
streptococcus pneumoniae           62
Name: drug, dtype: int64

In [26]:
merged = merged.copy()
merged = merged.pivot(index='drug', columns='strain', values='fitness')
merged = merged.rename_axis(None, axis=1).reset_index()

In [27]:
merged.columns = (merged.columns.astype(str).str.strip().str.lower())

In [28]:
merged.head()

Unnamed: 0,drug,bacillus subtilis,staphylococcus aureus dsm 20231,staphylococcus aureus newman,streptococcus pneumoniae
0,5FU,,0.741594,,
1,ADEP,,0.717285,0.563376,
2,ALF,,0.985689,,
3,ALL,,0.996244,,
4,AMX,0.328294,0.370839,0.866437,0.930972


In [29]:
merged.shape

(121, 5)

In [30]:
out_path = FEATURE_PIPELINE / "strain_space" / "inputs" / "stage1" / "raw_cacace_fitness.csv"
merged.to_csv(out_path, index=False)