# **Antibacterial reference list** (QC + CC query export)

This notebook documents quality-control checks for the antibacterial reference list used in this study, and exports a de-duplicated list for Chemical Checker (CC) API queries.

## Why aliases are kept in the master list
`list_antibacterial.csv` intentionally contains alternative drug names that map to the same compound (same InChIKey).  <br> These aliases are needed later when extracting drug combinations from source datasets where naming is inconsistent.

## Inputs
- `data/reference/drug_lists/list_antibacterial.csv` (final master list)

## Outputs
- `data/reference/drug_lists/list_antibacterial_for_cc.csv`  <br>
  Unique compounds for CC API lookup (**de-duplicated by InChIKey**).

## QC checks performed here
- Duplicate drug names (after normalization) are reported for inspection.
- InChIKey formatting is standardized (trimmed, uppercased) and validated (expected length = 27 characters including hyphens).


In [1]:
import pandas as pd
from halo.paths import DRUG_LISTS

In [8]:
list_antibacterial = pd.read_csv(DRUG_LISTS / "list_antibacterial.csv").copy()
len(list_antibacterial)

345

In [9]:
list_antibacterial.head()

Unnamed: 0,drug,inchikey
0,a22,LZTCFLDZLBOLDW-UHFFFAOYSA-N
1,acetylsalicylic acid,BSYNRYMUTXBXSQ-UHFFFAOYSA-N
2,acetylsalicylicacid,BSYNRYMUTXBXSQ-UHFFFAOYSA-N
3,acetylsalisylic acid,BSYNRYMUTXBXSQ-UHFFFAOYSA-N
4,alahopcin,NTBVVEFUJUCXPF-FYCPLRARSA-N


### checking list_antibacterial:

In [10]:
list_antibacterial["drug"] = list_antibacterial["drug"].astype(str).str.strip().str.lower()
dup_drug_rows = list_antibacterial[list_antibacterial.duplicated(subset="drug", keep=False)]
dup_drug_rows

Unnamed: 0,drug,inchikey


### Getting a list of antibacterials to pass into the CC:
* Removes duplicates from list_antibacterial based on InChIKeys
* Keeps only the 'drug' and 'inchikey' columns.
* Converts all drug names to lowercase.
* Sorts the resulting list alphabetically by drug name.

In [22]:
final_list = list_antibacterial.drop_duplicates(subset='inchikey')[['drug', 'inchikey']]
len(final_list)

292

In [23]:
final_list['drug'] = final_list['drug'].str.lower()
final_list = final_list.sort_values(by='drug')

### Checking the length of inchikeys:
InChIKeys should be 27 characters including hyphens.

In [24]:
final_list['inchikey'] = final_list['inchikey'].astype(str).str.strip()
invalid_rows = final_list[final_list['inchikey'].str.len() != 27] 
print(invalid_rows)

Empty DataFrame
Columns: [drug, inchikey]
Index: []


In [None]:
final_list.to_csv(DRUG_LISTS / "list_antibacterial_for_cc.csv", index=False)