# Machine Learning Model Trainer

## Detailed Analysis

This script manages the training of a hierarchy of machine learning classifiers used to distinguish between Hadronic and Quark stars. It implements a training pipeline that includes data cleaning, group-aware splitting, and probability calibration.

The script trains five distinct models (`Geo`, `A`, `B`, `C`, `D`), each having access to an increasing amount of physical information (from basic macroscopic observables to internal microscopic parameters). This hierarchy allows the project to quantify exactly how much information is gained by adding variables like Tidal Deformability or Phase Space Topology.

## Physics and Math

### Feature Hierarchy
The models are defined by the subset of physics features they utilize:

1.  **Model Geo:** $M, R$ (Macroscopic, Metric only).
2.  **Model A:** $M, R, \log_{10}\Lambda$ (Macroscopic, Full Observables).
3.  **Model B:** $+ \varepsilon_c$ (Central Density).
4.  **Model C:** $+ c_s^2$ (Speed of Sound).
5.  **Model D:** $+ dR/dM$ (Curve Slope / Topology).

### Validation Strategy
Standard random splitting is insufficient for this physics problem because points belonging to the same Equation of State (EoS) curve are highly correlated. To prevent data leakage (where the model "memorizes" the curve rather than learning the physics), **Group Shuffle Split** is used. The "Group" is defined by the `Curve_ID`. This ensures that if a specific EoS is in the training set, no stars generated from that EoS appear in the test set.

### Probability Calibration
Random Forests often produce uncalibrated probabilities (e.g., predicting 0.6 when the true positive rate is 0.8). To fix this, **Isotonic Regression** is applied to the classifier outputs:

$$
P_{calibrated} = f(P_{raw})
$$

where $f$ is a non-decreasing function learned from cross-validation folds. This ensures the output can be interpreted as a true posterior probability $P(\text{Quark} | \text{Features})$.

## Code Walkthrough

### 1. Pre-Processing and Safety Checks
The script first prepares the dataset. A logarithmic transformation is applied to the Tidal Deformability ($\Lambda$) to stabilize the distribution.

Crucially, rows containing `NaN` values in features like `Slope14` are dropped. This occurs for low-mass stars that collapse before reaching the canonical $1.4 M_{\odot}$ reference point. These stars are removed to ensure a "fair comparison": all models (from A to D) must be trained and tested on the exact same set of stars.

```python
# Drop rows where critical physics features are NaN
df_clean = df.dropna(subset=['Slope14', 'Eps_Central', ...]).copy()

groups = df_clean['Curve_ID']
```

### 2. Cross-Validation Split
The dataset is split into training and testing sets using the `Curve_ID` as the grouping key.

```python
gss = GroupShuffleSplit(n_splits=1, test_size=0.25, random_state=42)
train_idx, test_idx = next(gss.split(df_clean, y, groups=groups))
```
This ensures that the test set consists of entirely unseen EoS physics.

### 3. Hyperparameter Configuration
Two configurations are defined:
*   **Base Configuration:** Deep trees (`max_depth=15`) with `max_features='sqrt'` are used for the complex models (A-D) to ensure they capture the intricate topological boundaries in the phase space.
*   **Geo Configuration:** A constrained configuration (`max_depth=10`) is used for the geometric model to prevent overfitting, as the Mass-Radius plane alone has significant overlap between classes.

### 4. Training Loop with Calibration
The script iterates through each model definition. Instead of fitting the `RandomForestClassifier` directly, it is wrapped in a `CalibratedClassifierCV`.

```python
# Wrap in Isotonic Calibration
cal_rf = CalibratedClassifierCV(base_rf, method='isotonic', cv=3)

# Fit the Calibrated Stack
cal_rf.fit(df_clean.iloc[train_idx][cols], y.iloc[train_idx])
```
This step trains the base estimator on subsets of the training data and learns the calibration mapping on the hold-out folds.

### 5. Diagnostics and Patching
After training, the script calculates training and testing accuracies to detect overfitting.

Finally, because `CalibratedClassifierCV` obscures the internal feature importances of the Random Forest, a patching step is performed for Model D. The feature importances are averaged across the calibrated sub-estimators and re-attached to the main object. This allows downstream visualization scripts (like `plot_physical_insights.py`) to access and plot the Gini importance scores.