# CHECK_ DATA_LEAKAGE IN DEEP SYNERGY

### Assignment of Week 2: 
##### Author: Zhaoguo Wei  
###### Group 3

I tried to find out whether any data leakage occurred during the cross-validation procedures (leave-drug-combination-out, leave-drug-out, and leave-cell-line-out).

**task details:**

Check for data leakage using the provided files:

- smiles.csv and labels.csv (on GitHub they’ve only uploaded labels.csv for one CV scheme → leave-drug-combination-out)

- Optional: generate new label files to cover the other two CV schemes → leave-drug-out and leave-cell-line-out

Write a script to perform the checks in step 1.

Verify (or “validate”) the results.


### Why there is no other labels file?

**result of my research:**

The authors of DeepSynergy only released the one labels.csv because it encodes the single cross-validation scheme they treated as their primary evaluation: leave-drug-combination-out. 

## Script: Code for all three CV methods

In [5]:
import pandas as pd

def main():
    # 1. Load labels.csv and normalize names
    labels = pd.read_csv("/Volumes/Lenovo/projektseminar/labels.csv", dtype=str)
    for col in ["drug_a_name", "drug_b_name", "cell_line"]:
        labels[col] = labels[col].str.strip().str.upper()
    labels["fold"] = labels["fold"].astype(int)

    # 2. Build a unique key for each drugA+drugB+cell combination
    def combo_key(row):
        drugs = sorted([row["drug_a_name"], row["drug_b_name"]])
        return f"{drugs[0]}__{drugs[1]}__{row['cell_line']}"
    labels["combo_key"] = labels.apply(combo_key, axis=1)

    # 3. Split into train/test by fold==0
    mask_test  = labels["fold"] == 0
    mask_train = ~mask_test

    # 4. Check Leave-combination-out: no combo overlap
    train_combos = set(labels.loc[mask_train, "combo_key"])
    test_combos  = set(labels.loc[mask_test,  "combo_key"])
    combo_leak   = not train_combos.isdisjoint(test_combos)

    # 6. Print results
    print("=== CV Data Leakage Check for labels.csv ===")
    print(f"Leave-combination-out : {'⚠ Leakage' if combo_leak else '✔ No leakage'}")

    # 7. Summary
    schemes = []
    if not combo_leak:
        schemes.append("Leave-combination-out")
    else:
        print("→ Does NOT match any of the three strict CV schemes.")

if __name__ == "__main__":
    main()


=== CV Data Leakage Check for labels.csv ===
Leave-combination-out : ✔ No leakage


Since only the combination criterion is satisfied (and both the drug- and cell-line criteria fail), my labels.csv fold assignment implements exactly the Leave-combination-out cross-validation and nothing else.
(The DeepSynergy authors only provided a labels.csv for the leave-drug-combination-out scheme.)

**→ In other words, there is no data leakage under the leave-drug-combination-out method.**

## Week 3:

beacause I prepared the script: Check_Data_Leakage.py in the last week. 

So my next step is to generate **labels files** for other two cross-validationsmethods : **Leave Drug out and leave cell line out**.

### Script：generate two labels files

In [None]:
# build_cv_labels.py

#   labels_leave_drug_out.csv       -> Leave-Drug-Out (5折 CV)
#   labels_leave_cell_line_out.csv  -> Leave-Cell-Line-Out (5折 CV)

import pandas as pd


def round_robin_mapping(items: list, n_splits: int = 5) -> dict:
    """
    map ordered items with round-robin into 0..n_splits-1 fold。
    returns {item: fold_id} injection
    """
    mapping = {}
    for idx, item in enumerate(sorted(items)):
        mapping[item] = idx % n_splits
    return mapping


def build_leave_drug(labels: pd.DataFrame, n_splits: int = 5) -> pd.DataFrame:
    """
    Leave-Drug-Out:
    1. combine all (drug_a_name + drug_b_name），
    2. make order with round robin into  0..n_splits-1 fold
    3. pick the smaller one from drug_a_name and drug_b_name in the fold
    """
    df = labels.copy()
    # normalization
    for col in ['drug_a_name', 'drug_b_name', 'cell_line']:
        df[col] = df[col].str.strip().str.upper()

    # all uniqe durgs 
    all_drugs = set(df['drug_a_name']) | set(df['drug_b_name'])
    drug2fold = round_robin_mapping(list(all_drugs), n_splits)

    # get into folds
    df['fold'] = df.apply(
        lambda r: drug2fold[min(r['drug_a_name'], r['drug_b_name'])],
        axis=1
    )
    return df


def build_leave_cell(labels: pd.DataFrame, n_splits: int = 5) -> pd.DataFrame:
    """
    Leave-Cell-Line-Out -> split into 5 folds：
    1. get unique cell_line，
    2. make order with round robin into 0..n_splits-1 fold，
    3. map cell_line to corresbonding Fold
    """
    df = labels.copy()
    # normalization
    df['cell_line'] = df['cell_line'].str.strip().str.upper()

    all_cells = sorted(df['cell_line'].unique())
    cell2fold = round_robin_mapping(all_cells, n_splits)

    df['fold'] = df['cell_line'].map(cell2fold)
    return df


def main():
    # 1. read labels.csv
    df = pd.read_csv('labels.csv', dtype=str)

    # 2. generate Leave-Drug-Out 
    df_drug = build_leave_drug_5fold(df)
    df_drug.to_csv('labels_leave_drug_out.csv', index=False)
    print('successful generated: labels_leave_drug_out.csv (Leave-Drug-Out)')

    # 3. generat Leave-Cell-Line-Out
    df_cell = build_leave_cell_5fold(df)
    df_cell.to_csv('labels_leave_cell_line_out.csv', index=False)
    print('successful generated: labels_leave_cell_line_out.csv (Leave-Cell-Line-Out)')

if __name__ == '__main__':
    main()

then, there will be two labelsfile, which can help us to proof the data leakage in the rest of CV methods.

### Script: modified check_data_leakage

In [9]:
import pandas as pd
import os

# labels.csv                      -> Leave-combination-out
# labels_leave_drug_out_5fold.csv -> Leave-drug-out
# labels_leave_cell_line_out_5fold.csv -> Leave-cell-line-out

file_scheme_map = {
    "/Volumes/Lenovo/projektseminar/labels.csv": "combination",
    "/Volumes/Lenovo/projektseminar/labels_leave_drug_out.csv": "drug",
    "/Volumes/Lenovo/projektseminar/labels_leave_cell_line_out.csv": "cell"
}


def has_leakage(df: pd.DataFrame, scheme: str) -> bool:
    """
    scheme: 'combination' | 'drug' | 'cell'
    return: True -> there is leakage，False -> no leakage。
    """
    # normalization
    for col in ['drug_a_name','drug_b_name','cell_line']:
        df[col] = df[col].str.strip().str.upper()
    df['fold'] = df['fold'].astype(int)

    # check every fold，if any fold has overlap -> data leakage 
    for fold_id in sorted(df['fold'].unique()):
        train = df[df['fold'] != fold_id]
        test  = df[df['fold'] == fold_id]

        if scheme == 'combination':
            # Combo Key: DrugA__DrugB__CellLine
            key_train = set(
                train.apply(lambda r: '__'.join(
                    sorted([r['drug_a_name'], r['drug_b_name']]) + [r['cell_line']]
                ), axis=1)
            )
            key_test = set(
                test.apply(lambda r: '__'.join(
                    sorted([r['drug_a_name'], r['drug_b_name']]) + [r['cell_line']]
                ), axis=1)
            )
            if not key_train.isdisjoint(key_test):
                return True

        elif scheme == 'drug':
            # drug combination
            drugs_train = set(train['drug_a_name']) | set(train['drug_b_name'])
            drugs_test  = set(test['drug_a_name'])  | set(test['drug_b_name'])
            if not drugs_train.isdisjoint(drugs_test):
                return True

        elif scheme == 'cell':
            # cell line combination
            cells_train = set(train['cell_line'])
            cells_test  = set(test['cell_line'])
            if not cells_train.isdisjoint(cells_test):
                return True

    return False  # if there is no data leakge in every fold


def main():
    for fname, scheme in file_scheme_map.items():
        if not os.path.isfile(fname):
            print(f"[!] There is no file names: {fname}")
            continue
        df = pd.read_csv(fname, dtype=str)
        leak = has_leakage(df, scheme)
        # output result 
        scheme_name = {
            'combination': 'Leave-combination-out',
            'drug':        'Leave-drug-out',
            'cell':        'Leave-cell-line-out'
        }[scheme]
        status = '⚠ leakage' if leak else '✔ no leakage'
        print(f"{fname} ({scheme_name}): {status}")

if __name__ == '__main__':
    main()

/Volumes/Lenovo/projektseminar/labels.csv (Leave-combination-out): ✔ no leakage
/Volumes/Lenovo/projektseminar/labels_leave_drug_out.csv (Leave-drug-out): ⚠ leakage
/Volumes/Lenovo/projektseminar/labels_leave_cell_line_out.csv (Leave-cell-line-out): ✔ no leakage


### Conclusion until now:

In last week, I found out that there is no data leakage during the Leave-drug-Combination-Out and in this week, I would say that in Leave-cell-line-out has no data leakage either. 

In addition, there is alwayls problematic with the CV method: Leave-drug-out. 

In the original paper, they said: 

**“We used a stratified cross validation approach, where the test sets were selected to leave out drug combinations (see Fig. 3 second column). In addition, we performed leave-drug-out and leave-cell-line-out cross validations to assess model generalization to novel drugs and novel cell lines.”**

But I can't really verify it with my codescript. 

**Result for this week:**


(Leave-combination-out): ✔ no leakage

(Leave-drug-out): ⚠ leakage

(Leave-cell-line-out): ✔ no leakage