# 01 · Data Cleaning & Validation

Following the AGENT specification and FPF rules, this notebook inspects the raw hybrid nanofluid dataset, enforces the mandatory cleaning rules, and persists `clean_dataset.csv`.

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import sys
sys.path.append("..")
import preprocess
pd.set_option("display.precision", 6)
RAW_DATASET = Path("../survey_sample_data.xlsx")
CLEAN_PATH = Path("../clean_dataset.csv")
TRAIN_PATH = Path("../data/processed/train_dataset.csv")
TEST_PATH = Path("../data/processed/test_dataset.csv")

## Inspect raw Excel payload

We load the file without `openpyxl` via the helper in `preprocess.py` so the parsing step is deterministic across environments.

In [2]:
raw_df = preprocess.load_raw_dataset(RAW_DATASET)
display(raw_df.head())
print(f"Raw shape: {raw_df.shape}")
raw_df.dtypes

Unnamed: 0,M,S,K,phi1,phi2,Ec,Pr,eta,f3,f5
0,0.5,60.0,1.0,0.02,0.01,0.5,204.0,0.0,-0.52987,5.64949
1,0.5,60.0,1.0,0.02,0.01,0.5,204.0,0.333,-0.6091013882717312,-0.6818898582443035
2,0.5,60.0,1.0,0.02,0.01,0.5,204.0,0.666,-0.678774162938626,-0.4617618360928274
3,0.5,60.0,1.0,0.02,0.01,0.5,204.0,0.999,-0.733691010908527,-0.4139963996124683
4,0.5,60.0,1.0,0.02,0.01,0.5,204.0,1.332,-0.7654594598978063,-0.4044683056319904


Raw shape: (216, 10)


M       object
S       object
K       object
phi1    object
phi2    object
Ec      object
Pr      object
eta     object
f3      object
f5      object
dtype: object

Observations:

- The spreadsheet ships with a blank column (header = `NaN`).
- `S` clearly uses degrees (values near 60).
- Every physical field is typed as `object`, so we must coerce to floats before ML.

## Apply mandatory cleaning rules

In [3]:
clean_df = preprocess.clean_dataset(raw_df)
display(clean_df.head())
print(f"Clean shape: {clean_df.shape}")
clean_df.describe()

Unnamed: 0,M,S,K,phi1,phi2,Ec,Pr,eta,f3,f5
0,0.5,1.047198,1.0,0.02,0.01,0.5,204.0,0.0,-0.52987,5.64949
1,0.5,1.047198,1.0,0.02,0.01,0.5,204.0,0.333,-0.609101,-0.68189
2,0.5,1.047198,1.0,0.02,0.01,0.5,204.0,0.666,-0.678774,-0.461762
3,0.5,1.047198,1.0,0.02,0.01,0.5,204.0,0.999,-0.733691,-0.413996
4,0.5,1.047198,1.0,0.02,0.01,0.5,204.0,1.332,-0.765459,-0.404468


Clean shape: (168, 10)


Unnamed: 0,M,S,K,phi1,phi2,Ec,Pr,eta,f3,f5
count,168.0,168.0,168.0,168.0,168.0,168.0,168.0,168.0,168.0,168.0
mean,1.0,1.008239,0.991071,0.02,0.011786,0.545238,203.285714,1.420625,-0.620662,0.141786
std,0.173032,0.123694,0.159276,0.003283,0.00517,0.361767,17.371304,0.919777,0.288707,1.996767
min,0.5,0.523599,0.5,0.01,0.01,0.1,150.0,0.0,-1.283619,-5.072533
25%,1.0,1.047198,1.0,0.02,0.01,0.3,204.0,0.666,-0.762508,-0.426502
50%,1.0,1.047198,1.0,0.02,0.01,0.5,204.0,1.332,-0.52987,-0.249773
75%,1.0,1.047198,1.0,0.02,0.01,1.0,204.0,2.334,-0.378159,-0.08258
max,1.5,1.047198,1.5,0.03,0.03,1.0,250.0,3.0,-0.216338,5.64949


We verify the physics-informed guards below:

- `S` must land in radians so the trigonometric terms mimic the RK4+shooting solver.
- `eta` stays inside `[0, 5]` per the similarity variable definition.
- Both gradients remain safely within the ±10⁴ band, signaling numerical stability.

In [4]:
s_range = (clean_df["S"].min(), clean_df["S"].max())
eta_range = (clean_df["eta"].min(), clean_df["eta"].max())
grad_ranges = {
    "f3": (clean_df["f3"].min(), clean_df["f3"].max()),
    "f5": (clean_df["f5"].min(), clean_df["f5"].max()),
}
print({"S_rad": s_range, "eta": eta_range, **grad_ranges})

{'S_rad': (np.float64(0.5235987755982988), np.float64(1.0471975511965976)), 'eta': (np.float64(0.0), np.float64(3.0)), 'f3': (np.float64(-1.2836186023144711), np.float64(-0.2163378389205405)), 'f5': (np.float64(-5.072533298528624), np.float64(5.64949))}


## Persist clean dataset + canonical train/test split

In [5]:
train_df, test_df = preprocess.split_dataset(clean_df, test_size=0.2, random_state=42)
preprocess.save_datasets(clean_df, train_df, test_df, CLEAN_PATH, TRAIN_PATH, TEST_PATH)
CLEAN_PATH, TRAIN_PATH, TEST_PATH

(WindowsPath('clean_dataset.csv'),
 WindowsPath('data/processed/train_dataset.csv'),
 WindowsPath('data/processed/test_dataset.csv'))

The resulting CSVs keep the canonical ordering `[M, S, K, phi1, phi2, Ec, Pr, eta, f3, f5]` so downstream classical ML and neural models can map the physics inputs directly to the gradients.