# <center>**Split Data for Model Training**</center>  
**Author**: Shirshak Aryal  
**Last Updated**: 18 July 2025

---
**Purpose:** This notebook is dedicated to loading the fully engineered molecular feature dataset and splitting it into distinct training, validation, and test sets. This rigorous data partitioning ensures that machine learning models can be trained, tuned, and evaluated independently on the exact same sets of data, providing an unbiased assessment of their performances.

---

## 1. Setup Notebook
This section initializes the notebook environment by importing necessary libraries and configuring system settings relevant to data loading and splitting.

### 1.1. Configure Environment
This sub-section sets environment variables to optimize CPU core usage for numerical computations.

In [1]:
# General CPU Usage Optimization
import os
os.environ['OMP_NUM_THREADS'] = '16'
os.environ['MKL_NUM_THREADS'] = '16'
os.environ['OPENBLAS_NUM_THREADS'] = '16'
os.environ['NUMEXPR_NUM_THREADS'] = '16'

### 1.2. Import Libraries
All required Python libraries for data manipulation and data splitting are imported here.

In [2]:
# Standard Library Imports
from pathlib import Path

# Core Data Science Libraries
import pandas as pd

# ML Data Splitting Libraries
from sklearn.model_selection import train_test_split

### 1.3. Set Data Splits Save Location

In [4]:
splits_dir = Path("../data/splits")
splits_dir.mkdir(parents=True, exist_ok=True)
print(f"The data splits will be saved in: {splits_dir}")

The data splits will be saved in: ..\data\splits


## 2. Load Fully Feature-Engineered Data
This section loads the comprehensive dataset containing all engineered molecular features and the target variable (`pGI50`), prepared in the previous notebook.

In [5]:
features_dir = Path("../data/features")
try:
    features_df = pd.read_parquet(features_dir / "gi50_features.parquet")
    print(f"Loaded fully feature-engineered data from '{features_dir / 'gi50_features.parquet'}.")
    print(f"Shape of loaded data: {features_df.shape}")
    display(features_df.head())
    display(features_df.info())
except FileNotFoundError:
    print(f"Error: 'gi50_features.parquet' not found in '{features_dir}'.")
    print("Please ensure you have run the dataset-saving step in the '01_Engineer_Features.ipynb' file.")

Loaded fully feature-engineered data from '..\data\features\gi50_features.parquet.
Shape of loaded data: (18743, 2269)


Unnamed: 0,molregno,pGI50,canonical_smiles,num_activities,MaxAbsEStateIndex,MaxEStateIndex,MinAbsEStateIndex,MinEStateIndex,qed,SPS,...,morgan_fp_2038,morgan_fp_2039,morgan_fp_2040,morgan_fp_2041,morgan_fp_2042,morgan_fp_2043,morgan_fp_2044,morgan_fp_2045,morgan_fp_2046,morgan_fp_2047
0,148,7.999957,O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23,1,11.996238,11.996238,0.005972,-0.940881,0.216285,11.818182,...,0,0,0,0,0,0,0,0,0,0
1,666,4.823909,Cc1c(C)c2c(c(C)c1O)CCC(C)(COc1ccc(CC3SC(=O)NC3...,1,11.735952,11.735952,0.229569,-0.467042,0.716604,22.645161,...,0,0,0,0,0,0,0,0,0,0
2,696,5.421428,Cc1cc(O)nc2c3c(ccc12)OC(C)(C)C=C3,7,9.639307,9.639307,0.047809,-0.299061,0.767926,16.388889,...,0,0,0,0,0,0,0,0,0,0
3,717,5.583359,CC(O)(CS(=O)(=O)c1ccc(F)cc1)C(=O)Nc1ccc(C#N)c(...,1,13.006084,13.006084,0.342865,-4.860872,0.560412,13.965517,...,0,0,0,0,0,1,0,0,0,0
4,846,5.405822,CC1(C)CC(C)(C)c2cc(NC(=S)Nc3ccc([N+](=O)[O-])c...,9,10.730998,10.730998,0.052565,-0.422109,0.375924,16.888889,...,0,0,0,0,0,0,0,0,0,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18743 entries, 0 to 18742
Columns: 2269 entries, molregno to morgan_fp_2047
dtypes: float64(218), int64(2050), object(1)
memory usage: 324.5+ MB


None

## 3. Define Features and Target
This section explicitly separates the dataset into features (X) and the target variable (y), preparing them for the splitting process.

In [6]:
X = features_df.drop(columns=['pGI50'], errors='ignore')
y = features_df['pGI50']

print(f"\nDataFrame for splitting created.")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
display(X.head())


DataFrame for splitting created.
X shape: (18743, 2268)
y shape: (18743,)


Unnamed: 0,molregno,canonical_smiles,num_activities,MaxAbsEStateIndex,MaxEStateIndex,MinAbsEStateIndex,MinEStateIndex,qed,SPS,MolWt,...,morgan_fp_2038,morgan_fp_2039,morgan_fp_2040,morgan_fp_2041,morgan_fp_2042,morgan_fp_2043,morgan_fp_2044,morgan_fp_2045,morgan_fp_2046,morgan_fp_2047
0,148,O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23,1,11.996238,11.996238,0.005972,-0.940881,0.216285,11.818182,302.194,...,0,0,0,0,0,0,0,0,0,0
1,666,Cc1c(C)c2c(c(C)c1O)CCC(C)(COc1ccc(CC3SC(=O)NC3...,1,11.735952,11.735952,0.229569,-0.467042,0.716604,22.645161,441.549,...,0,0,0,0,0,0,0,0,0,0
2,696,Cc1cc(O)nc2c3c(ccc12)OC(C)(C)C=C3,7,9.639307,9.639307,0.047809,-0.299061,0.767926,16.388889,241.29,...,0,0,0,0,0,0,0,0,0,0
3,717,CC(O)(CS(=O)(=O)c1ccc(F)cc1)C(=O)Nc1ccc(C#N)c(...,1,13.006084,13.006084,0.342865,-4.860872,0.560412,13.965517,430.379,...,0,0,0,0,0,1,0,0,0,0
4,846,CC1(C)CC(C)(C)c2cc(NC(=S)Nc3ccc([N+](=O)[O-])c...,9,10.730998,10.730998,0.052565,-0.422109,0.375924,16.888889,401.557,...,0,0,0,0,0,0,0,0,0,0


## 4. Split Data
The dataset is partitioned into training, validation, and test sets to ensure robust model development and unbiased performance evaluation.

In [10]:
# Define a consistent random state for reproducibility
RANDOM_STATE = 42

print("\nPerforming initial train-test split (85% Train+Val, 15% Test)...")
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y,
                                                            test_size=0.15, # 15% for the final test set
                                                            random_state=RANDOM_STATE,
                                                            shuffle=True)

print(f"X_train_val shape: {X_train_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train_val shape: {y_train_val.shape}")
print(f"y_test shape: {y_test.shape}")

print("\nPerforming second split (Train and Validation from Train+Val set)...")
val_size_ratio_from_train_val = 0.15 / (1 - 0.15)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val,
                                                  test_size=val_size_ratio_from_train_val,
                                                  random_state=RANDOM_STATE,
                                                  shuffle=True)

print(f"\nFinal Split Shapes:")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

# Double check total rows
total_rows_after_split = X_train.shape[0] + X_val.shape[0] + X_test.shape[0]
print(f"Total rows after splitting: {total_rows_after_split} (Should match original {X.shape[0]})")


Performing initial train-test split (85% Train+Val, 15% Test)...
X_train_val shape: (15931, 2268)
X_test shape: (2812, 2268)
y_train_val shape: (15931,)
y_test shape: (2812,)

Performing second split (Train and Validation from Train+Val set)...

Final Split Shapes:
X_train shape: (13119, 2268)
y_train shape: (13119,)
X_val shape: (2812, 2268)
y_val shape: (2812,)
X_test shape: (2812, 2268)
y_test shape: (2812,)
Total rows after splitting: 18743 (Should match original 18743)


## 5. Save Data Splits
The distinct training, validation, and test sets (both features and targets) are saved locally for direct use in subsequent model training and evaluation notebooks.

In [11]:
print(f"\nSaving training, validation, and testing splits to {splits_dir}...")

X_train.to_parquet(splits_dir / "X_train.parquet", index=False)
X_val.to_parquet(splits_dir / "X_val.parquet", index=False)
X_test.to_parquet(splits_dir / "X_test.parquet", index=False)

y_train.to_frame().to_parquet(splits_dir / "y_train.parquet", index=True)
y_val.to_frame().to_parquet(splits_dir / "y_val.parquet", index=True)
y_test.to_frame().to_parquet(splits_dir / "y_test.parquet", index=True)

print("Data splits saved successfully to .parquet files.")
print("\nData splitting complete. Splits are ready for all model development notebooks.")


Saving training, validation, and testing splits to ..\data\splits...
Data splits saved successfully to .parquet files.

Data splitting complete. Splits are ready for all model development notebooks.
