# Dragon ML Toolbox - Regression Tutorial with DatasetMaker

This notebook demonstrates the complete workflow for a **regression task** using the `dragon-ml-toolbox`. It showcases the new `DatasetMaker` for streamlined data preprocessing before training with `MyTrainer`.

## 1. Imports

First, we import all necessary components. Notice the new import of `DatasetMaker` from `ml_tools.datasetmaster`.

In [None]:
import torch
from torch import nn
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
from pathlib import Path

# Import from your dragon_ml_toolbox package
from ml_tools.datasetmaster import DatasetMaker
from ml_tools.ML_trainer import MyTrainer
from ml_tools.ML_callbacks import EarlyStopping, ModelCheckpoint
from ml_tools.keys import LogKeys

## 2. Setup Device

We'll automatically select the best available hardware accelerator.

In [None]:
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f'Using device: {device}')

## 3. Prepare the Data with `DatasetMaker`

Here, we'll generate a synthetic regression dataset and use the fluent interface of `DatasetMaker` to process it in a few simple steps. This replaces the need for manual splitting and normalization.

In [None]:
# Create a synthetic dataset with continuous and categorical features
X, y, *_ = make_regression(
    n_samples=1000, 
    n_features=10, 
    n_informative=7, 
    noise=25,
    random_state=42
)
X = pd.DataFrame(X, columns=[f'cont_feature_{i+1}' for i in range(10)])

# Add some categorical features for a more realistic example
X['cat_feature_1'] = pd.cut(X['cont_feature_1'], bins=4, labels=['A', 'B', 'C', 'D'])
X['cat_feature_2'] = np.random.choice(['TypeX', 'TypeY', 'TypeZ'], size=1000)

# Combine features and target into a single DataFrame
df = X.copy()
df['target'] = y

print("Original data sample:")
display(df.head())

# --- Use DatasetMaker for preprocessing ---
maker = DatasetMaker(pandas_df=df, label_col='target')

maker.process_categoricals(method='one-hot', drop_first=True) \
     .split_data(test_size=0.2, random_state=42) \
     .normalize_continuous(method='standard')

# Get the final PyTorch datasets
train_dataset, test_dataset = maker.get_datasets()

# We can also inspect the processed dataframes to get feature names and shapes
X_train_df, X_test_df, y_train_s, y_test_s = maker.inspect_dataframes()

print("\nShape of processed training features:", X_train_df.shape)
print("Shape of processed testing features:", X_test_df.shape)

## 4. Define Model, Criterion, and Optimizer

The model's input layer size is now determined by the shape of the processed data from `DatasetMaker`.

In [None]:
class SimpleRegressor(nn.Module):
    def __init__(self, input_features, output_features=1):
        super().__init__()
        self.layer_1 = nn.Linear(input_features, 128)
        self.layer_2 = nn.Linear(128, 64)
        self.layer_3 = nn.Linear(64, output_features)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.layer_1(x))
        x = self.relu(self.layer_2(x))
        return self.layer_3(x)

# Get the number of input features from our processed training data
input_size = X_train_df.shape[1]

# Instantiate the components
model = SimpleRegressor(input_features=input_size)
criterion = nn.MSELoss() # Mean Squared Error is common for regression
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

## 5. Configure Callbacks

We'll configure `ModelCheckpoint` and `EarlyStopping` to monitor the validation loss.

In [None]:
CHECKPOINT_DIR = 'checkpoints_regression'
MONITOR_METRIC = LogKeys.VAL_LOSS

model_checkpoint = ModelCheckpoint(
    save_dir=CHECKPOINT_DIR,
    monitor=MONITOR_METRIC,
    save_best_only=True, 
    mode='min',
    verbose=1
)

early_stopping = EarlyStopping(
    monitor=MONITOR_METRIC,
    patience=10,
    mode='min',
    verbose=1
)

## 6. Initialize the Trainer

Instantiate `MyTrainer` with the datasets created by `DatasetMaker`.

In [None]:
trainer = MyTrainer(
    model=model,
    train_dataset=train_dataset, # From DatasetMaker
    test_dataset=test_dataset,   # From DatasetMaker
    kind='regression', # Specify the task
    criterion=criterion,
    optimizer=optimizer,
    device=device,
    callbacks=[model_checkpoint, early_stopping]
)

## 7. Train the Model

Call `.fit()` to start training. `MyTrainer` will handle the rest.

In [None]:
history = trainer.fit(epochs=100, batch_size=64, shuffle=True)

## 8. Evaluate the Model

Load the best model and call `.evaluate()` to generate and save a full performance report for our regression task.

In [None]:
# Load the best model saved by the callback
best_model_path = model_checkpoint.last_best_filepath

if best_model_path and best_model_path.exists():
    print(f'Loading best model weights from: {best_model_path}')
    trainer.model.load_state_dict(torch.load(best_model_path))
else:
    print('Warning: No best model found. Evaluating with the last model state.')

# Define a directory to save all evaluation artifacts
EVAL_DIR = Path('tutorial_results_regression') / 'evaluation_report'

# Evaluate the model
trainer.evaluate(save_dir=EVAL_DIR)

## 9. Explain the Model

Finally, use `.explain()` to generate SHAP plots. The feature names are taken directly from the processed DataFrame columns provided by `DatasetMaker`.

In [None]:
# Define a directory to save all explanation artifacts
EXPLAIN_DIR = Path('tutorial_results_regression') / 'explanation_report'

# Generate and save SHAP summary plots
trainer.explain(
    explain_dataset=test_dataset, 
    n_samples=100,
    feature_names=X_train_df.columns.tolist(), # Get feature names from our processed DF
    save_dir=EXPLAIN_DIR
)