Each label column contains one of four values: 1.0, -1.0, 0.0, or missing. These labels have the following interpretation:

- 1.0 - The label was positively mentioned in the associated study, and is present in one or more of the corresponding images
e.g. "A large pleural effusion"   
- 0.0 - The label was negatively mentioned in the associated study, and therefore should not be present in any of the corresponding images
e.g. "No pneumothorax."   
- -1.0 - The label was either: 
  (1) mentioned with uncertainty in the report, and therefore may or may not be present to some degree in the corresponding image, or (2) mentioned with ambiguous language in the report and it is unclear if the pathology exists or not    
  Explicit uncertainty: "The cardiac size cannot be evaluated."  
  Ambiguous language: "The cardiac contours are stable."  
- Missing (empty element) - No mention of the label was made in the report
---
In this project, I using   
`1` represent `positive`; (same with the original category indicator)   
`0` represent `negative`; (same with the original category indicator)  
`0` represent `Nan`; Reasonably assume that `Missing` value in the original table indicates the absence of certain of disease, in this case, `Nan` is replace by `0`;  
`uncertainty` is more complicate to preprocess, and there are multiple strategies:  
- binary: categorize the `uncertainty` into no-postive case, in this it would be represented with 0
- binary_2: reference the strategies used in the paper [[1]](https://arxiv.org/pdf/1901.07031.pdf) and [[2]](https://arxiv.org/pdf/2211.14929)
  - `Atelectasis` and `Edema`: U-ones
  - `Cardiomegaly`: multi-class
  - *`rest`*: U-zeros
  - `ignore`: U-ignore, ignore the uncertainty cases and training with mask binary cross entropy
  - > Strategy_1 : `Atelectasis`, `Edema`: U-ones; and the `rest`: U-zeros.
  - > Strategy_2 : U-ignore, ignore the uncertainty cases
  
- multiple-classes: in this case, `uncertainty` will be viewed as a independent indicator, and would be represented with -1 
 

##### import package

In [285]:
import pandas as pd
from collections import defaultdict
import clip
import torch
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from PIL import Image
import numpy as np
from typing import Any, Dict, Optional, Tuple, Union
import open_clip
import copy
from torchvision.transforms import InterpolationMode
import os
from PIL import Image, ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
import torch.nn as nn

##### load original dateset

In [237]:
split_data = pd.read_csv("/home_data/home/v-liudsh/coding/constrastive_P/diagnosisP/exchange/Fine-Grained_Features_Alignment_via_Constrastive_Learning/data/project_using_data/mimic-cxr-2.0.0-split.csv")
original_label_data = pd.read_csv("/home_data/home/v-liudsh/coding/constrastive_P/diagnosisP/exchange/Fine-Grained_Features_Alignment_via_Constrastive_Learning/data/project_using_data/mimic-cxr-2.0.0-chexpert.csv")
original_meta_data = pd.read_csv("/home_data/home/v-liudsh/coding/constrastive_P/diagnosisP/exchange/Fine-Grained_Features_Alignment_via_Constrastive_Learning/data/project_using_data/mimic-cxr-2.0.0-metadata.csv")

In [205]:

original_label_data[original_label_data['study_id']==58235663]

split_data[split_data['study_id']==58235663]

Unnamed: 0,dicom_id,study_id,subject_id,split,original_14_labels
58215,1a671a62-0a32dfc6-5f85029c-81c3922e-3f5a2c27,58235663,11573679,train,


##### extract 14 labels

In [165]:
def extract_14_label_4_each_record(original_df = None):
  # for index, row in original_df.iterrows():
    label_dic = {}
    for column_name, column_data in original_df.items():
      if column_name in ["subject_id", 'study_id', "original_14_labels", "strategy1_14_labels"]:
        continue
      label_dic[column_name] = 0 if pd.isnull(column_data) else column_data
    return label_dic
  
original_label_data["original_14_labels"] = original_label_data.apply(extract_14_label_4_each_record, axis=1)

def extract_14_label_4_each_record_with_strategy_1(each_row):
    label_dic = {}
    for column_name, column_data in each_row.items():
      if column_name in ["subject_id", 'study_id', "original_14_labels", "strategy1_14_labels"]:
        continue
      if column_name in ['Atelectasis', 'Edema'] and column_data == -1:
        label_dic[column_name] = 1
      elif column_name not in ['Atelectasis', 'Edema'] and column_data == -1:
        label_dic[column_name] = 0
      else:
        label_dic[column_name] = 0 if pd.isnull(column_data) else column_data
    return label_dic
  
original_label_data["strategy1_14_labels"] = original_label_data.apply(extract_14_label_4_each_record_with_strategy_1, axis=1)
  
      


In [181]:
col_index = ['subject_id', 'study_id', 'original_14_labels',  'strategy1_14_labels']
process_data = original_label_data[col_index]
process_data

def get_original_14_labels_vector(row):
  keys = row['original_14_labels'].keys()
  values = row['original_14_labels'].values()
  return values

def get_strategy1_14_labels_vector(row):
  values = row['strategy1_14_labels'].values()
  return values
  
process_data.loc[:,'original_14_labels'] = process_data.apply(get_original_14_labels_vector, axis=1)
process_data.loc[:,'strategy1_14_labels'] = process_data.apply(get_strategy1_14_labels_vector, axis=1)
process_data.to_csv('/home_data/home/v-liudsh/coding/constrastive_P/diagnosisP/exchange/Fine-Grained_Features_Alignment_via_Constrastive_Learning/data/project_using_data/process_data.csv', index=False)  # index=False 表示不保存行索引
process_data.head(1)
  

Unnamed: 0,subject_id,study_id,original_14_labels,strategy1_14_labels
0,10000032,50414267,"(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)","(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)"


##### add split data indicator

split数据集中的数据量要比原始的original data要多，但是两个dataset中的study-id数量是一致的。  
同时在split数据集中不存在同一个sid用于不同的目的（train，test，validate）  
split数据的增多理解为study-id在该数据表格中的重复更多(一个study，一个label，多个views)    
在训练中多个view的图片有一个label。每张图片的label在process data中检索获得


---
构造 `program_data_set` 保存最终项目使用的数据

In [211]:
program_data_set = split_data.copy()
program_data_set.loc[:, "original_14_labels"] = None
program_data_set.loc[:, "strategy1_14_labels"] = None
program_data_set.loc[:, "ViewPosition"] = None
program_data_set.head(1)

Unnamed: 0,dicom_id,study_id,subject_id,split,original_14_labels,strategy1_14_labels,ViewPosition
0,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,50414267,10000032,train,,,


In [213]:
dictionary = process_data.set_index('study_id').to_dict(orient='index')

##### add labels

In [224]:
# add label
except_sid_original = []
except_sid_original_strategy1 = []

def search_label_in_process_and_fill_split_data(row):
    study_id = row.study_id

    if study_id not in dictionary:
      except_sid_original.append(study_id)
      return
      
    original_14_labels = dictionary[study_id]["original_14_labels"]
    return original_14_labels
  
def search_Strategy1_label_in_process_and_fill_split_data(row):
    study_id = row.study_id
    if study_id not in dictionary:
      except_sid_original_strategy1.append(study_id)
      return
    strategy1_14_labels = dictionary[study_id]["strategy1_14_labels"]
    return strategy1_14_labels
  
program_data_set["original_14_labels"] = program_data_set.apply(search_label_in_process_and_fill_split_data, axis=1)
program_data_set["strategy1_14_labels"] = program_data_set.apply(search_Strategy1_label_in_process_and_fill_split_data, axis=1)

In [231]:
condition = program_data_set['study_id'].isin(except_sid_original)  # 例如，删除满足 A 列大于 3 的行
program_data_set = program_data_set[~condition]

In [235]:
program_data_set.head(3)

Unnamed: 0,dicom_id,study_id,subject_id,split,original_14_labels,strategy1_14_labels,ViewPosition
0,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,50414267,10000032,train,"(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)","(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)",
1,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,50414267,10000032,train,"(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)","(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)",
2,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,53189527,10000032,train,"(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)","(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)",


In [236]:
program_data_set.to_csv("/home_data/home/v-liudsh/coding/constrastive_P/diagnosisP/exchange/Fine-Grained_Features_Alignment_via_Constrastive_Learning/data/project_using_data/program_data_set_3_8.csv", index=False)  # index=False 表示不保存行索引

##### add view position

In [247]:
# add view position
meta_dict = original_meta_data.set_index('dicom_id').to_dict(orient = "index")

In [251]:
all_view = []
for index, row in program_data_set.iterrows():
  dicom_id = row.dicom_id
  view = meta_dict[dicom_id]['ViewPosition']
  all_view.append(view)

In [255]:
program_data_set.loc[:, "ViewPosition"] = all_view
program_data_set.to_csv("/home_data/home/v-liudsh/coding/constrastive_P/diagnosisP/exchange/Fine-Grained_Features_Alignment_via_Constrastive_Learning/data/project_using_data/program_data_set_3_8.csv", index=False)  # index=False 表示不保存行索引

In [256]:
program_data_set.head()

Unnamed: 0,dicom_id,study_id,subject_id,split,original_14_labels,strategy1_14_labels,ViewPosition
0,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,50414267,10000032,train,"(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)","(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)",PA
1,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,50414267,10000032,train,"(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)","(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)",LATERAL
2,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,53189527,10000032,train,"(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)","(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)",PA
3,e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c,53189527,10000032,train,"(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)","(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)",LATERAL
4,68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714,53911762,10000032,train,"(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)","(0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0)",AP


##### add image_tensor_path

In [None]:
basic = "/public_bme/data/lds/"

def get_image_file_path(row):
    p = "p" + str(row.subject_id)[:2]
    pp = 'p' + str(row.subject_id)
    s = "s" + str(row.study_id)
    img = row.dicom_id + ".jpg"
    file_path = f"{basic}/{p}/{pp}/{s}/{img}"
    return file_path

# 获取所有文件路径
file_paths = [get_image_file_path(row) for _, row in program_data_set.iterrows()]

# 检查所有文件路径是否存在
for file_path in file_paths:
    assert os.path.exists(file_path)

print("pass")

In [283]:
BiomedClip_img_tensor_paths = [(lambda x: x.replace(".jpg", "_BioMedClip.pth"))(path) for path in file_paths]
Clip_img_tensor_path = [(lambda x: x.replace(".jpg", "_Clip.pth"))(path) for path in file_paths]

In [288]:
program_data_set.loc[:,"image_file_path"] = file_paths
program_data_set.loc[:,"BiomedClip_img_tensor_path"] = BiomedClip_img_tensor_paths
program_data_set.loc[:,"Clip_img_tensor_path"] = Clip_img_tensor_path

program_data_set.to_csv("/home_data/home/v-liudsh/coding/constrastive_P/diagnosisP/exchange/Fine-Grained_Features_Alignment_via_Constrastive_Learning/data/project_using_data/program_data_set_3_8.csv", index=False)  # index=False 表示不保存行索引

In [287]:
# image preprocess logics -- BiomedClip & CLIP
try:
    BICUBIC = InterpolationMode.BICUBIC
except ImportError:
    BICUBIC = Image.BICUBIC

def _convert_image_to_rgb(image):
    return image.convert("RGB")

def _transform(n_px):
    return Compose([
        Resize(n_px, interpolation=BICUBIC),
        CenterCrop(n_px),
        _convert_image_to_rgb,
        ToTensor(),
        Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
    ])

def CLIP_Process(image_path, dest):
    img = Image.open(image_path)
    a = 224
    b = _transform(a)
    c = b(img)
    if ((dest.split(".")[-1]) != "pth"):
      dest+=".pth"
      
    torch.save(c, dest)
    return c

OPENAI_DATASET_MEAN = (0.48145466, 0.4578275, 0.40821073)
OPENAI_DATASET_STD = (0.26862954, 0.26130258, 0.27577711)

_FIELDS = '__dataclass_fields__'
def _is_dataclass_instance(obj):
    """Returns True if obj is an instance of a dataclass."""
    return hasattr(type(obj), _FIELDS)

def asdict(obj, *, dict_factory=dict):
    """Return the fields of a dataclass instance as a new dictionary mapping
    field names to field values.

    Example usage:

      @dataclass
      class C:
          x: int
          y: int

      c = C(1, 2)
      assert asdict(c) == {'x': 1, 'y': 2}

    If given, 'dict_factory' will be used instead of built-in dict.
    The function applies recursively to field values that are
    dataclass instances. This will also look into built-in containers:
    tuples, lists, and dicts.
    """
    if not _is_dataclass_instance(obj):
        raise TypeError("asdict() should be called on dataclass instances")
    return _asdict_inner(obj, dict_factory)

def _asdict_inner(obj, dict_factory):
    if _is_dataclass_instance(obj):
        result = []
        for f in fields(obj):
            value = _asdict_inner(getattr(obj, f.name), dict_factory)
            result.append((f.name, value))
        return dict_factory(result)
    elif isinstance(obj, tuple) and hasattr(obj, '_fields'):
        return type(obj)(*[_asdict_inner(v, dict_factory) for v in obj])
    elif isinstance(obj, (list, tuple)):
        # Assume we can create an object of this type by passing in a
        # generator (which is not true for namedtuples, handled
        # above).
        return type(obj)(_asdict_inner(v, dict_factory) for v in obj)
    elif isinstance(obj, dict):
        return type(obj)((_asdict_inner(k, dict_factory),
                          _asdict_inner(v, dict_factory))
                         for k, v in obj.items())
    else:
        return copy.deepcopy(obj)

class AugmentationCfg:
    scale: Tuple[float, float] = (0.9, 1.0)
    ratio: Optional[Tuple[float, float]] = None
    color_jitter: Optional[Union[float, Tuple[float, float, float]]] = None
    interpolation: Optional[str] = None
    re_prob: Optional[float] = None
    re_count: Optional[int] = None
    use_timm: bool = False

class ResizeMaxSize(nn.Module):
    def __init__(self, max_size, interpolation=InterpolationMode.BICUBIC, fn='max', fill=0):
        super().__init__()
        if not isinstance(max_size, int):
            raise TypeError(f"Size should be int. Got {type(max_size)}")
        self.max_size = max_size
        self.interpolation = interpolation
        self.fn = min if fn == 'min' else min
        self.fill = fill

    def forward(self, img):
        if isinstance(img, torch.Tensor):
            height, width = img.shape[:2]
        else:
            width, height = img.size
        scale = self.max_size / float(max(height, width))
        new_size = tuple(round(dim * scale) for dim in (height, width))
        if scale != 1.0:
            img = F.resize(img, new_size, self.interpolation)
        if not width == height:
            pad_h = self.max_size - new_size[0]
            pad_w = self.max_size - new_size[1]
            img = F.pad(img, padding=[pad_w//2, pad_h//2, pad_w - pad_w//2, pad_h - pad_h//2], fill=self.fill)
        return img

def image_transform(
        image_size: int,
        is_train:bool = False,
        mean: Optional[Tuple[float, ...]] = None,
        std: Optional[Tuple[float, ...]] = None,
        resize_longest_max: bool = False,
        fill_color: int = 0,
        aug_cfg: Optional[Union[Dict[str, Any], AugmentationCfg]] = None,
):
    mean = mean or OPENAI_DATASET_MEAN
    if not isinstance(mean, (list, tuple)):
        mean = (mean,) * 3

    std = std or OPENAI_DATASET_STD
    if not isinstance(std, (list, tuple)):
        std = (std,) * 3

    if isinstance(image_size, (list, tuple)) and image_size[0] == image_size[1]:
        # for square size, pass size as int so that Resize() uses aspect preserving shortest edge
        image_size = image_size[0]

    if isinstance(aug_cfg, dict):
        aug_cfg = AugmentationCfg(**aug_cfg)
    else:
        aug_cfg = aug_cfg or AugmentationCfg()
    normalize = Normalize(mean=mean, std=std)
    if is_train:
        raise NotImplemented("!!LDS!!")
    else:
        if resize_longest_max:
            transforms = [
                ResizeMaxSize(image_size, fill=fill_color)
            ]
        else:
            transforms = [
                Resize(image_size, interpolation=InterpolationMode.BICUBIC),
                CenterCrop(image_size),
            ]
        transforms.extend([
            _convert_image_to_rgb,
            ToTensor(),
            normalize,
        ])
        return Compose(transforms)

def BiomedCLIP_processor(image_path, dest):
    img = Image.open(image_path)
    preprocess_val = image_transform(224)
    data = preprocess_val(img)
    if ((dest.split(".")[-1]) != "pth"):
      dest+=".pth"
      
    torch.save(data, dest)
    return data


In [None]:
# generate .pth (CLIP and BiomedCLIP)
img_paths = data.file_path
BiomedClip_tensor_paths = data.BiomedClip_tensor_path
total = len(tensor_path)
print(total)
print(len(tensor_path), len(img_paths))
dev = total // 10
count = 0

for (img_path, tensor_path) in (zip(img_paths, BiomedClip_tensor_paths)):
  try:
    # print( type(img_path), img_path, type(ten,,l[pl-0o-or_path), tensor_path)
    BiomedCLIP_processor(img_path, tensor_path)
    if count%dev == 0:
      print(count/dev)
      print(img_path, tensor_path)
    count+=1
  except Exception as e:
    print(e)


Clip_img_tensor_paths = data.Clip_img_tensor_path
total = len(Clip_img_tensor_paths)
print(Clip_img_tensor_paths)
print(len(Clip_img_tensor_paths), len(img_paths))
dev = total // 10
count = 0


for  (img_path, tensor_path) in (zip(img_paths, Clip_img_tensor_paths)):
  try:
    # print( type(img_path), img_path, type(ten,,l[pl-0o-or_path), tensor_path)
    CLIP_Process(img_path, tensor_path)
    if count%dev == 0:
      print(count/dev)
      print(img_path, tensor_path)
    count+=1
  except Exception as e:
    print(e)
  