# CHECK_ DATA_LEAKAGE IN DEEP SYNERGY

### Assignment of Week 2: 
##### Author: Zhaoguo Wei  
###### Group 3

I tried to find out whether any data leakage occurred during the cross-validation procedures (leave-drug-combination-out, leave-drug-out, and leave-cell-line-out).

**task details:**

Check for data leakage using the provided files:

- smiles.csv and labels.csv (on GitHub they’ve only uploaded labels.csv for one CV scheme → leave-drug-combination-out)

- Optional: generate new label files to cover the other two CV schemes → leave-drug-out and leave-cell-line-out

Write a script to perform the checks in step 1.

Verify (or “validate”) the results.


## Script: Code for all three CV methods

In [1]:
import pandas as pd

def main():
    # 1. Load labels.csv and normalize names
    labels = pd.read_csv("/Volumes/Lenovo/projektseminar/labels.csv", dtype=str)
    for col in ["drug_a_name", "drug_b_name", "cell_line"]:
        labels[col] = labels[col].str.strip().str.upper()
    labels["fold"] = labels["fold"].astype(int)

    # 2. Build a unique key for each drugA+drugB+cell combination
    def combo_key(row):
        drugs = sorted([row["drug_a_name"], row["drug_b_name"]])
        return f"{drugs[0]}__{drugs[1]}__{row['cell_line']}"
    labels["combo_key"] = labels.apply(combo_key, axis=1)

    # 3. Split into train/test by fold==0
    mask_test  = labels["fold"] == 0
    mask_train = ~mask_test

    # 4. Check Leave-combination-out: no combo overlap
    train_combos = set(labels.loc[mask_train, "combo_key"])
    test_combos  = set(labels.loc[mask_test,  "combo_key"])
    combo_leak   = not train_combos.isdisjoint(test_combos)

    # 5. Check Leave-drug-out: no drug overlap
    train_drugs = set(labels.loc[mask_train, "drug_a_name"]) | set(labels.loc[mask_train, "drug_b_name"])
    test_drugs  = set(labels.loc[mask_test,  "drug_a_name"]) | set(labels.loc[mask_test,  "drug_b_name"])
    drug_leak   = not train_drugs.isdisjoint(test_drugs)

    # 6. Check Leave-cell-line-out: no cell-line overlap
    train_cells = set(labels.loc[mask_train, "cell_line"])
    test_cells  = set(labels.loc[mask_test,  "cell_line"])
    cell_leak   = not train_cells.isdisjoint(test_cells)

    # 7. Print results
    print("=== CV Data Leakage Check for labels.csv ===")
    print(f"1. Leave-combination-out : {'⚠ Leakage' if combo_leak else '✔ No leakage'}")
    print(f"2. Leave-drug-out        : {'⚠ Leakage' if drug_leak else '✔ No leakage'}")
    print(f"3. Leave-cell-line-out   : {'⚠ Leakage' if cell_leak else '✔ No leakage'}")

    # 8. Summary
    schemes = []
    if not combo_leak:
        schemes.append("Leave-combination-out")
    if not drug_leak:
        schemes.append("Leave-drug-out")
    if not cell_leak:
        schemes.append("Leave-cell-line-out")

    if schemes:
        print("→ Matches scheme(s): " + ", ".join(schemes))
    else:
        print("→ Does NOT match any of the three strict CV schemes.")

if __name__ == "__main__":
    main()


=== CV Data Leakage Check for labels.csv ===
1. Leave-combination-out : ✔ No leakage
2. Leave-drug-out        : ⚠ Leakage
3. Leave-cell-line-out   : ⚠ Leakage
→ Matches scheme(s): Leave-combination-out


Since only the combination criterion is satisfied (and both the drug- and cell-line criteria fail), my labels.csv fold assignment implements exactly the Leave-combination-out cross-validation and nothing else.
(The DeepSynergy authors only provided a labels.csv for the leave-drug-combination-out scheme.)

**→ In other words, there is no data leakage under the leave-drug-combination-out method.**