# User Story 14 / 15
@LuiseJedlitschka

**Cross-Validation Strategy: Leave-One-Group-Out**

To evaluate the generalization capability of our models and to avoid overfitting, we employed the leave-one-group-out cross-validation strategy as implemented in scikit-learn. In this approach, each dataset —corresponding to a specific phage— is used once as the test set (singleton), while the remaining datasets collectively form the training set. This ensures that, in each split, the model is validated on data from a phage that was not seen during training, providing a robust assessment of performance across different biological backgrounds.

Note:
The data was not explicitly stratified according to the classification classes ("early", "middle", "late") during the splitting process as that is not part of the leave-one-group-out strategy. As a result, the distribution of these classes may vary between the training and test sets in each split.

An overview of the class distribution in the training and test sets for each split is provided in
leave_one_group_out_split/overview.tsv.

All corresponding training and test files for each split are saved in
data/leave-one-group-out-stratified-split/.

In [4]:
import os
import glob
import pandas as pd
from sklearn.model_selection import LeaveOneGroupOut

# Directory with the TSV files
directory = "../data/feature_tables"
# Output directory
output_dir = "../data/leave-one-group-out-split"
os.makedirs(output_dir, exist_ok=True)

# List of all .tsv files in the directory
tsv_files = glob.glob(os.path.join(directory, "*.tsv"))

# Combine all TSV files into a Dataframe, for each dataset assign the group-name

df = pd.concat(
    [pd.read_csv(f, sep="\t").assign(group=os.path.basename(f)) for f in tsv_files],
    ignore_index=True
)

# save combined table containing features and group index of each gene
output_path = os.path.join(output_dir, "combined.tsv")
df.to_csv(output_path, sep="\t", index=False)

# Prepare subfolder for splits
splits_dir = os.path.join(output_dir, "splits")
os.makedirs(splits_dir, exist_ok=True)

# Perform split of one group each as test -> 7 splits
logo = LeaveOneGroupOut()
results = []    

for i, (train_idx, test_idx) in enumerate(logo.split(df, groups=df['group'])):
    train_df = df.iloc[train_idx]
    test_df = df.iloc[test_idx]

    # save train and test data for this split 
    train_path = os.path.join(splits_dir, f"train_split_{i}.tsv")
    test_path = os.path.join(splits_dir, f"test_split_{i}.tsv")
    train_df.to_csv(train_path, sep="\t", index=False)
    test_df.to_csv(test_path, sep="\t", index=False)
    
    # overall class distribution
    all_classes = sorted(df['classification_x'].unique())
    
    # class distribution in train set
    train_counts = train_df['classification_x'].value_counts(normalize=True)
    # class distribution in test set
    test_counts = test_df['classification_x'].value_counts(normalize=True)

    # Check for overlapping genes
    overlapping_genes = set(train_df["Geneid"]).intersection(set(test_df["Geneid"]))
    if overlapping_genes:
        print(f"Split {i}: {len(overlapping_genes)} overlapping genes found!")
    else:
        print(f"Split {i}: No overlapping genes.")
    
    for cls in all_classes:
        results.append({
            'split': i,
            'group_left_out': df.iloc[test_idx]['group'].iloc[0],  # number of the test group
            'class': cls,
            'train_ratio': train_counts.get(cls, 0),
            'test_ratio': test_counts.get(cls, 0),
            'train_count': train_df['classification_x'].value_counts().get(cls, 0),
            'test_count': test_df['classification_x'].value_counts().get(cls, 0)
        })

# convert results to DataFrame
split_summary = pd.DataFrame(results)

# Save overview of each split
overview_path = os.path.join(output_dir, "logo_class_distributions.tsv")
split_summary.to_csv(overview_path, sep='\t', index=False)


Split 0: No overlapping genes.
Split 1: No overlapping genes.
Split 2: No overlapping genes.
Split 3: No overlapping genes.
Split 4: No overlapping genes.
Split 5: No overlapping genes.
Split 6: No overlapping genes.
