# Background

We discovered that several entries in the `candidates.csv` file appear multiple times. These repeated entries are not exact duplicates, which suggests that the original annotations made by human experts were not properly cleaned before being added to the file. For example, some entries might refer to the same lung nodule, but appear in different CT image slices. Interestingly, these types of variations might actually be helpful for training our classifier, since they provide multiple views of the same object.

To address this issue, we have prepared a cleaned version of the `annotation.csv` file. The LUNA dataset we are using is based on a more comprehensive dataset called the Lung Image Database Consortium Image Collection (LIDC-IDRI). This original dataset includes detailed annotations from multiple radiologists.

This script is used to extract the original LIDC annotations, identify the actual nodules, remove redundant or duplicate entries, and save the cleaned data to a file named `annotations_with_malignancy.csv`

Using this new file, we can now use our `getCandidateInfoList function` defined in `dsets.py` to extract nodules based on the improved annotation data. This involves iterating over each new annotation, identifying the nodules, and then using a CSV reader to load the data. During this process, we convert the data into the appropriate data types and store it in a structure called `CandidateInfoTuple`.



# Purpose: Data Cleansing
- This script processes lung CT scan annotation data to enrich it with malignancy information derived from the LIDC-IDRI dataset.
- It also converts raw annotation data into a structured format (DataFrame) for improved accessibility, analysis, and modeling.

Key Objectives:
1. **Data Integration**:
  - Load LUNA challenge annotations (`annotations.csv`).
  - Extract scan and malignancy metadata using PyLIDC.

2. **Malignancy Computation**:
  - Determine malignancy for each nodule cluster.
  - Convert pixel coordinates to physical space using SimpleITK for spatial accuracy.

3. **Annotation Matching**:
  - Match PyLIDC-derived malignancy nodules to LUNA annotations based on centroid proximity.
  - Enrich the LUNA annotations with malignancy label (`mal_bool`), details (`mal_details`), and bounding box information.

4. **Data Cleansing**:
  - Drop annotations that could not be matched to PyLIDC data or CT volumes.
  - Save the cleaned and enhanced annotation data as `annotations_with_malignancy.csv`.

## Why This Matters:
Accurate malignancy labeling is crucial for training robust medical imaging models. This preprocessing step ensures the training data includes high-quality, expert-derived labels while filtering out unmatched or incomplete records.

Output:
- `annotations_with_malignancy.csv`: A refined annotation dataset with malignancy labels for downstream deep learning pipelines.
"""


In [3]:
import torch
import SimpleITK as sitk
import pandas
import glob, os
import numpy
import tqdm
import pylidc


We first load the annotations from the LUNA challenge. The `annotations` variable is a DataFrame (table) containing nodule annotations.

In [5]:
annotations = pandas.read_csv('data/part2/luna/annotations.csv')

For the CTs where we have a `.mhd` file, we collect the malignancy_data from PyLIDC.

It is a bit tedious as we need to convert the pixel locations provided by PyLIDC to physical points.
We will see some warnings about annotations to be too close too each other (PyLIDC expects to have 4 annotations per site, including when we consider a nodule to be malignant).

This takes quite a while (~1-2 seconds per scan on the author's computer).

Cluster is a group of annotations (by different radiologists) that are close enough to refer to the same physical nodule.

---

### ✅ Why there are **multiple malignancy scores** per `ann_cluster`

In the **LUNA16 dataset** (used with `pylidc`), each **nodule (tumor candidate)** is often **independently annotated by multiple radiologists** — typically 4 different doctors.

Each radiologist:

* Views the same scan.
* Finds the same nodule.
* Assigns a **malignancy score** from 1 to 5:

  * 1 = highly benign
  * 5 = highly malignant

---

### 🔁 What is an `ann_cluster`?

An `ann_cluster` is a group of annotations that refer to the **same physical nodule**, made by different radiologists.

So, if four radiologists each mark the same nodule, the cluster will contain **4 annotation objects**, each with its own:

* `centroid` (center)
* `bbox_matrix()` (bounding box)
* `malignancy` score

---

### 🎯 Why do we use multiple scores?

1. **Inter-rater variation**: Doctors may disagree.
2. **Reduce noise**: Instead of trusting one opinion, we average or vote.
3. **Robustness**: By requiring "at least two scores ≥ 4", the code ensures that the **nodule is likely malignant**, not due to one outlier opinion.

We consider a cluster of annotations to be malignant (i.e., cancerous) if at least two of the radiologists who marked that nodule gave it a high malignancy score.

---




In [22]:
import numpy as np

# Monkey patch for compatibility with old versions of pylidc using deprecated np.int
if not hasattr(np, 'int'):
    np.int = int

# Each item will contain a tumor's 3D position, bounding box, and malignancy.
malignancy_data = []
# Track any scan files that could not be found.
missing = []
# Store the pixel spacing for each scan — how much physical space each voxel represents
spacing_dict = {}

# Loads all available CT scans using pylidc, mapping each scan’s unique ID to its scan object.
# The keys are each scan’s unique ID. The values are the scan objects themselves.
scans = {s.series_instance_uid: s for s in pylidc.query(pylidc.Scan).all()}

# Access the column named seriesuid, which stores the scan ID each annotation belongs to.
# Then, extract all distinct values (i.e., removes duplicates).
# Finally, store those unique scan IDs in the variable suids.
suids = annotations.seriesuid.unique()

# Loop over each scan ID, with a progress bar from tqdm.
for suid in tqdm.tqdm(suids):
    # fn is a list of filenames (file paths), found using a wildcard pattern.
    fn = glob.glob('F:/Organized_LUNA16_Train_Data/subset*/{}.mhd'.format(suid))
    if len(fn) == 0 or '*' in fn[0]:
        missing.append(suid)
        continue
    fn = fn[0]
    x = sitk.ReadImage(fn) # Load the image
    spacing_dict[suid] = x.GetSpacing() #Get voxel spacing
    s = scans[suid]
    
    # s.cluster_annotations() groups multiple annotations into clusters. Each cluster represents one real-world nodule.
    for ann_cluster in s.cluster_annotations():
        # A cluster (a set of annotations referring to the same nodule) is considered malignant if at least two
        # annotations give it a malignancy score ≥ 4 (scale usually 1–5).
        is_malignant = len([a.malignancy for a in ann_cluster if a.malignancy >= 4])>=2 # Malignancy criterion
        # Take the average along axis 0 (i.e., across rows, column-by-column).
        centroid = numpy.mean([a.centroid for a in ann_cluster], axis=0)
        bbox = numpy.mean([a.bbox_matrix() for a in ann_cluster], 0).T
        coord = x.TransformIndexToPhysicalPoint([int(numpy.round(i)) for i in centroid[[1, 0, 2]]])
        bbox_low = x.TransformIndexToPhysicalPoint([int(numpy.round(i)) for i in bbox[0, [1, 0, 2]]])
        bbox_high = x.TransformIndexToPhysicalPoint([int(numpy.round(i)) for i in bbox[1, [1, 0, 2]]])
        malignancy_data.append((suid, coord[0], coord[1], coord[2], bbox_low[0], bbox_low[1], bbox_low[2], bbox_high[0], bbox_high[1], bbox_high[2], is_malignant, [a.malignancy for a in ann_cluster]))


 11%|█████████▎                                                                       | 69/601 [00:47<06:27,  1.37it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 15%|████████████▌                                                                    | 93/601 [01:04<06:33,  1.29it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 18%|██████████████▏                                                                 | 107/601 [01:13<06:07,  1.35it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 37%|█████████████████████████████▉                                                  | 225/601 [02:42<05:34,  1.12it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 44%|███████████████████████████████████▌                                            | 267/601 [03:13<03:23,  1.64it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 47%|█████████████████████████████████████▍                                          | 281/601 [03:22<03:42,  1.44it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 61%|████████████████████████████████████████████████▉                               | 368/601 [04:25<02:39,  1.46it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 72%|█████████████████████████████████████████████████████████▉                      | 435/601 [05:11<02:19,  1.19it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 74%|███████████████████████████████████████████████████████████▎                    | 446/601 [05:20<01:32,  1.68it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 75%|███████████████████████████████████████████████████████████▉                    | 450/601 [05:22<01:48,  1.40it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 88%|██████████████████████████████████████████████████████████████████████▏         | 527/601 [06:08<00:41,  1.78it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 96%|████████████████████████████████████████████████████████████████████████████▊   | 577/601 [06:39<00:20,  1.16it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


 99%|███████████████████████████████████████████████████████████████████████████████▍| 597/601 [06:53<00:03,  1.27it/s]

Failed to reduce all groups to <= 4 Annotations.
Some nodules may be close and must be grouped manually.


100%|████████████████████████████████████████████████████████████████████████████████| 601/601 [06:55<00:00,  1.45it/s]


You can check how many `mhd`s you are missing. It seems that the LUNA data has dropped a couple. Don't worry if there are <10 missing.

In [30]:
if missing:
  print("Missing scan UIDs:")
  for i, uid in enumerate(missing, 1):
    print(f"{i:2d}. {uid}")
else:
  print("No missing scan UIDs.")

No missing scan UIDs.


And now we match the malignancy data to the annotations. This is a lot faster...

In [32]:
# Create a structured DataFrame from a list of data records called malignancy_data
df_mal = pandas.DataFrame(malignancy_data, columns=['seriesuid', 'coordX', 'coordY', 'coordZ', 'bboxLowX', 'bboxLowY', 'bboxLowZ', 'bboxHighX', 'bboxHighY', 'bboxHighZ', 'mal_bool', 'mal_details'])

processed_annot = []
annotations['mal_bool'] = float('nan')
annotations['mal_details'] = [[] for _ in annotations.iterrows()]
bbox_keys = ['bboxLowX', 'bboxLowY', 'bboxLowZ', 'bboxHighX', 'bboxHighY', 'bboxHighZ']
for k in bbox_keys:
    annotations[k] = float('nan')
for series_id in tqdm.tqdm(annotations.seriesuid.unique()):
    # series_id = '1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860'
    # c = candidates[candidates.seriesuid == series_id]
    a = annotations[annotations.seriesuid == series_id]
    m = df_mal[df_mal.seriesuid == series_id]
    if len(m) > 0:
        m_ctrs = m[['coordX', 'coordY', 'coordZ']].values
        a_ctrs = a[['coordX', 'coordY', 'coordZ']].values
        #print(m_ctrs.shape, a_ctrs.shape)
        matches = (numpy.linalg.norm(a_ctrs[:, None] - m_ctrs[None], ord=2, axis=-1) / a.diameter_mm.values[:, None] < 0.5)
        has_match = matches.max(-1)
        match_idx = matches.argmax(-1)[has_match]
        a_matched = a[has_match].copy()
        # c_matched['diameter_mm'] = a.diameter_mm.values[match_idx]
        a_matched['mal_bool'] = m.mal_bool.values[match_idx]
        a_matched['mal_details'] = m.mal_details.values[match_idx]
        for k in bbox_keys:
            a_matched[k] = m[k].values[match_idx]
        processed_annot.append(a_matched)
        processed_annot.append(a[~has_match])
    else:
        processed_annot.append(c)
processed_annot = pandas.concat(processed_annot)
processed_annot.sort_values('mal_bool', ascending=False, inplace=True)
processed_annot['len_mal_details'] = processed_annot.mal_details.apply(len)

100%|███████████████████████████████████████████████████████████████████████████████| 601/601 [00:00<00:00, 638.24it/s]


Finally, we drop NAs (where we didn't find a match) and save it in the right place.

In [33]:
df_nona = processed_annot.dropna()
df_nona.to_csv('./data/part2/luna/annotations_with_malignancy.csv', index=False)