# Training a segmentation model

Sample notebook showing a training process.
It trains a PyTorch estimator in a script mode.
I played around with different hyper parameters to achieve best results.

In [33]:
import sagemaker
from sagemaker.debugger import TensorBoardOutputConfig
from sagemaker.experiments.run import Run
from sagemaker.pytorch import PyTorch
from sagemaker.session import TrainingInput

In [34]:
sess = sagemaker.Session()
bucket = sess.default_bucket()

In [38]:
parameters = {
    "epoch": "50",
    "num-workers": "4",
    "alpha": "0.5",
    "lr": "0.01",
    "architecture": "unet++",
    "backbone": "efficientnet-b1"
}

In [36]:
tensor_board_output_config=TensorBoardOutputConfig(
    s3_output_path=f"s3://{bucket}/tensorboard/",
    container_local_output_path="/opt/ml/output/tensorboard"
)

In [41]:
estimator = PyTorch(
    entry_point="train.py",
    source_dir="../src",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    image_uri="971422676823.dkr.ecr.us-east-1.amazonaws.com/ct-images-segmentation:gpu-latest",
    hyperparameters=parameters,
    tensorboard_output_config=tensor_board_output_config,
    output_path=f"s3://{bucket}/training_jobs",
    base_job_name="unet-plus-plus-with-dice-loss-b1"
)

estimator.fit(
    {
        "training": TrainingInput(f"s3://{bucket}/data/processed/train", distribution="FullyReplicated"),
        "test": TrainingInput(f"s3://{bucket}/data/processed/val", distribution="FullyReplicated"),
    },
    wait=True
)

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: unet-plus-plus-with-dice-loss-b1-2025-06-27-22-49-28-161


2025-06-27 22:49:28 Starting - Starting the training job
2025-06-27 22:49:28 Pending - Training job waiting for capacity............
2025-06-27 22:51:12 Pending - Preparing the instances for training...
2025-06-27 22:51:41 Downloading - Downloading input data...
2025-06-27 22:52:11 Downloading - Downloading the training image...........................
2025-06-27 22:56:50 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34mCUDA compat package should be installed for NVIDIA driver smaller than 560.35.05[0m
[34mCurrent installed NVIDIA driver version is 550.163.01[0m
[34mAdding CUDA compat to LD_LIBRARY_PATH[0m
[34m/usr/local/cuda/compat:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib:/lib/x86_64-linux-gnu:/usr/local/lib:/usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64[0m
[34m2025-06-27 22:57:11,549 sagemaker-training-toolk