# 01 · Load Data 
**Purpose** This notebook serves as the starting point for working with the LUNA16 dataset. Its purpose is to:

## Workflow
- Locate and load raw CT scan volumes (.mhd files) and the official annotations.csv file provided with LUNA16.
- Provide a quick overview of the dataset by reporting:
- Number of subsets available
- Number of CT volumes discovered
- Number of annotated nodules
- Demonstrate how to read a sample CT scan using SimpleITK and convert it into a NumPy array for further processing.
- Display the array shape (slices × height × width) and intensity value range (in Hounsfield Units).

This notebook does not perform preprocessing, patch extraction, or model training. Instead, it ensures that the raw data is correctly mounted and accessible, and provides an initial sanity check before downstream steps.

In [None]:
import pandas as pd
import SimpleITK as sitk

from pathlib import Path

In [None]:
RAW_DIR = Path("/kaggle/input/luna16")

ann = pd.read_csv(RAW_DIR / "annotations.csv")
mhd_files = sorted(RAW_DIR.rglob("*.mhd"))

In [None]:
print(f"Subsets       : {len(list(RAW_DIR.glob('subset*')))}")
print(f"CT volumes    : {len(mhd_files)}")
print(f"Annotated nodules: {len(ann)}")

Subsets       : 5
CT volumes    : 1333
Annotated nodules: 1186


In [None]:
sample = mhd_files[0]
img = sitk.ReadImage(str(sample))
arr = sitk.GetArrayFromImage(img)

print("shape", arr.shape, "HU-range", (arr.min(), arr.max()))

shape (194, 512, 512) HU-range (0, 5)
