# ChestX-ray14 dataset
이번 실습시간에는 chest X-ray 이미지를 담고있는 [ChestX-ray14 dataset](https://www.kaggle.com/datasets/nih-chest-xrays/data)을 사용할 것이다. 이 데이터셋은 [ChestX-ray8 dataset](https://openaccess.thecvf.com/content_cvpr_2017/papers/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf)를 더 보강하여 disease categories를 늘리고 (from 8 to 14) 이미지를 약간 더 추가한 데이터셋이다

- **Number of Images**: 30,805명의 환자에게서 획득한 112,120개의 frontal-view X-ray images.
- **Number of Classes (Diseases)**: 14개의 disease classes (Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass, Nodule, Pneumonia, Pneumothorax, Consolidation, Edema, Emphysema, Fibrosis, Pleural Thickening, and Hernia).
- **Size**: Approximately 42GB.
- **Release Date**: Released in late 2017.

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
data_dir = "/datasets/ChestX-ray14"
image_dir = os.path.join(data_dir, "images")

In [None]:
metadata = pd.read_csv(os.path.join(data_dir, "Data_Entry_2017.csv"))

각 이미지는 14개 라벨 (병리학적 상태)을 0개 (No Finding), 1개, 또는 여러개 (| 로 구분) 가지고 있다.

In [None]:
metadata.head()

In [None]:
metadata.info()

`Patient ID`는 각 X-ray 이미지를 얻은 환자를 구분짓는 ID 이다. 

의료 데이터를 다룰 때 중요한 것 중 하나는 같은 환자를 여러번 측정하여 한 환자에게서 서로 다른 여러 이미지를 얻을 수 있다는 것이다.

In [None]:
print(f"The total images {len(metadata)}, from those the number of unique patient ids are {metadata['Patient ID'].value_counts().shape[0]} ")

<mark>실습</mark> metadata DataFrame 입력받아 라벨을 아래와 같이 One-hot encoded DataFrame으로 변환하는 함수 `encode_labels_one_hot`를 완성세요
- [pandas.Series.str.get_dummies()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html) 함수를 사용

예시

- 입력 DataFrame

| Image Index        | Finding Labels                | Follow-up # | Patient ID | ...|
|--------------------|-------------------------------|-------------|------------|---|
| 00000001_000.png    | Cardiomegaly                  | 0           | 1          | ... |
| 00000001_001.png    | Cardiomegaly\|Emphysema       | 1           | 1          | ... |
| 00000001_002.png    | Cardiomegaly\|Effusion        | 2           | 1          | ... |
| 00000002_000.png    | No Finding                    | 0           | 2          | ... |
| 00000003_000.png    | Hernia                        | 0           | 3          | ...   |

- 출력 DataFrame: 입력 DataFrame과 같은 행이면 같은 record를 의미

| Cardiomegaly | Emphysema | Effusion | No Finding | Hernia | ... |
|--------------|-----------|----------|------------|--------|-----|
| 1            | 0         | 0        | 0          | 0      | ... |
| 1            | 1         | 0        | 0          | 0      | ... |
| 1            | 0         | 1        | 0          | 0      | ... |
| 0            | 0         | 0        | 1          | 0      | ... |
| 0            | 0         | 0        | 0          | 1      | ... |



In [None]:
def encode_labels_one_hot(df, label_column_name):
    """
    이 함수는 medical diagnoses라벨이 '|'로 구분된 'label_column_name'컬럼을 가진 DataFrame을 입력받아
    one-hot encoded 형식으로 변환한다. 각 열은 medical diagnoses라벨이 되며 값은 0 또는 1을 가진다

    Args:
    - df: Pandas DataFrame containing the data.
    - label_column_name: The column name in the DataFrame containing the labels (seperated by '|') to be one-hot encoded.

    Returns:
    - pandas.DataFrame: 각 열은 각 라벨, 값은 0 또는 1의 one-hot encoded DataFrame 
    
    """

    ##### YOUR CODE START #####  

    ##### YOUR CODE END #####

    return one_hot_encoded_df


In [None]:
label_onehot = encode_labels_one_hot(metadata, 'Finding Labels')
label_onehot.head()

In [None]:
class_labels = sorted(label_onehot.columns.tolist())
class_labels

In [None]:
metadata_with_onehot = pd.concat([metadata, label_onehot], axis=1)
metadata_with_onehot.head()

각 라벨별로 이미지 숫자를 세아려보자.

In [None]:
def check_label_imbalance(df, class_labels):
    def calculate_uniformity(counts):
        mean = np.mean(counts)
        std = np.std(counts)
        cv = std / mean if mean != 0 else 0
        return {
            'mean': mean,
            'std': std,
            'cv': cv,
        }
    #for label in class_labels:
    #    print(f"The class '{label}' has {df[label].sum()} ({df[label].sum() / df.shape[0] * 100:.2f}%) images")

    sns.barplot(df[class_labels].astype(bool).sum(), color='b', orient = "y")
    plt.title('Distribution of Classes', fontsize=15)
    plt.xlabel('Number of Patients', fontsize=15)
    plt.ylabel('Diseases', fontsize=15)
    plt.show()

    counts = [df[label].astype(bool).sum() for label in class_labels]
    print(f"Coefficient of Variation of class counts: {calculate_uniformity(counts)['cv']}")



In [None]:
check_label_imbalance(df = metadata_with_onehot, class_labels = class_labels)

위 결과를 살펴보면 Class imbalance문제가 매우 심각한 것을 확인할 수 있다.

- Hernia가 가장 심각한 imbalance를 보이며 전체 케이스의 0.2%정도 밖에 안된다
- Infiltration는 이미지에 가장 많은 case가 존재하지만 그럼에도 전체 이미지 중 17.5%만이 positive (1)로 라벨링 되어 있다.

이상적으로 우리는 balanced 데이터셋을 사용하여 positive (1)와 negative (0) 케이스가 `loss`값에 균등하게 기여하기를 기대합니다.

하지만 현재와 같이 심각한 class imbanace 상황에서 일반적인 loss함수(예: cross-entropy loss)를 사용하면 딥러닝 모델은 데이터에서 대다수를 차지 하는 라벨 (negative 케이스)에 우선순위를 두고 학습이 될 것입니다..

이를 해결하는데에는 크게 두가지의 주요 접근법이 있습니다.
- 데이터셋을 준비하는 과정에서 각 class들을 최대한 균등하게 만드는 방법
- weighted loss 이용.

## Split dataset into train/val/test
먼저 데이터셋을 이미지 수준에서 나누는 통상적인 split 함수 `split_data_by_image`를 살펴보자.

In [None]:
from sklearn.model_selection import train_test_split

def split_data_by_image(df, val_size=0.2, test_size=0.2, random_state=42):
    """
    Splits the dataset into training, validation, and test sets based on images.
    
    Args:
    - df: DataFrame containing image data and labels.
    - test_size: Proportion of data to be reserved for the test set.
    - val_size: Proportion of data to be reserved for the validation set (relative to the remaining data after test split).
    - random_state: Seed for reproducibility.
    
    Returns:
    - train_df: DataFrame containing the training set.
    - val_df: DataFrame containing the validation set.
    - test_df: DataFrame containing the test set.
    """
    
    train_val_df, test_df = train_test_split(df, test_size=test_size, random_state = random_state)
    
    train_df, val_df = train_test_split(train_val_df, test_size=val_size / (1 - test_size), random_state=random_state)
    
    print(f"Training set: {train_df.shape[0]} images")
    print(f"Validation set: {val_df.shape[0]} images")
    print(f"Test set: {test_df.shape[0]} images")
    
    return train_df, val_df, test_df

In [None]:
train_df, val_df, test_df = split_data_by_image(metadata_with_onehot, 0.2, 0.2, 42)

하지만 이러한 데이터 분할 방식에서는 동일한 환자(같은 Patient ID)로 부터 얻어진 서로 다른 이미지가 train/val/test셋에 걸쳐서 존재하게 될 확률이 있으며, 즉 같은 환자로 부터 유래한 이미지 간의 상간관계(correlataion)으로 인해 data leakage 발생한다는 문제가 있습니다.

따라서 각 데이터셋에 동일한 환자의 이미지가 존재하지 않도록 검증하는 것이 중요합니다. 만약 이러한 문제가 발생하면 data leakage문제로 인해 모델 성능을 과도하게 낙관적으로(over-optimistic)하게 측정하게 될 것입니다.

<mark>실습</mark> train_df와 test_df를 받아 'Patient ID'가 동일한 레코드가 있는지 확인하는 함수 `check_data_leakage`를 작성하세요
- `set.intersection()`를 사용하여 중복되는 Patient ID의 집합(set) `overlap`을 획득하세요.

In [None]:
def check_data_leakage(train_df, test_df):
    """
    Checks if there is any patient overlap between the training and test sets.
    
    Args:
    - train_df: DataFrame containing the training set.
    - test_df: DataFrame containing the test set.
    
    Returns:
    - Boolean indicating if data leakage was found.
    """

    ##### YOUR CODE START #####  

    ##### YOUR CODE END #####

    if overlap:
        print(f"Data leakage detected! {len(overlap)} patients are in both train and test sets.")
        return True
    else:
        print("No data leakage detected.")
        return False

In [None]:
check_data_leakage(train_df, test_df)

<mark>실습 </mark> Data leackage 문제를 해결하기 위해 환자 단위로 데이터셋을 나누는 함수 `split_data_by_patient`를 완성하세요
- 목표: 각 환자가 train, validation, 또는 test set중 한 군데에만 포함되도록 데이터셋을 분할합니다
- 과정
    1. Patient ID로 데이터 그룹화: 같은 환자에게서 얻어진 이미지를 모두 그룹화하여, 각 행이 한 환자를 나타내는 새로운 DataFrame을 생성합니다.
        - pandas [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), [agg](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html) 사용
    2. 각 환자를 train, validation, test 셋으로 나눕니다
       - `train_test_split` 함수 이용 ([docs](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html))
       - 전체 dataset을 (train + val) 과 test set으로 나눕니다. (`test_size`, `random_state`를 인자로 전달)
       - (train + val) set을 train set과 val set으로 나눕니다. (`test_size=val_size / (1 - test_size)`)
    3. 원본 DataFrame(각 행이 이미지)에서 train/val/test set의 Patient ID에 해당하는 레코드를 찾아 원본 DataFrame를 분할한 뒤 리턴합니다.

In [None]:
from sklearn.model_selection import train_test_split

def split_data_by_patient(df, class_labels, val_size=0.2, test_size=0.2, random_state = 42):
    """
    Splits the dataset into train, validation, and test sets while preventing patient overlap 

    Args:
    - df: DataFrame containing image data, patient IDs, and one-hot encoded class labels.
    - class_labels: List of class labels to consider for multi-label classification.
    - val_size: Proportion of the dataset for the validation set (default 0.2).
    - test_size: Proportion of the dataset for the test set (default 0.2).

    Returns:
    - train_df: DataFrame containing training set images and labels.
    - val_df: DataFrame containing validation set images and labels.
    - test_df: DataFrame containing test set images and labels.
    """

    ##### YOUR CODE START #####  

    ##### YOUR CODE END #####

    assert df.shape[0] == (train_df.shape[0] + val_df.shape[0] + test_df.shape[0]), "Data split error: sum of train/val/test is not total number of images"

    return train_df, val_df, test_df

In [None]:
train_df, val_df, test_df = split_data_by_patient(metadata_with_onehot, class_labels, 0.2, 0.2)

print(f"Train set: {train_df.shape[0]} images and {len(train_df['Patient ID'].drop_duplicates())} pateints")
print(f"Validation set: {val_df.shape[0]} images and {len(val_df['Patient ID'].drop_duplicates())} pateints")
print(f"Test set: {test_df.shape[0]} images and {len(test_df['Patient ID'].drop_duplicates())} pateints")
check_data_leakage(train_df, test_df)
check_data_leakage(train_df, val_df)

**Expected output:**
```
Train set: 67444 images and 18483 pateints
Validation set: 22382 images and 6161 pateints
Test set: 22294 images and 6161 pateints
No data leakage detected.
No data leakage detected.
```


In [None]:
#check_label_imbalance(train_df, class_labels)
#check_label_imbalance(val_df, class_labels)
check_label_imbalance(test_df, class_labels)

Data Leackage는 해결하였으나 Class imbalance 문제는 여전히 해결하지 못하였습니다..
이를 해결하기 위해 아래와 같이 balanced data split 함수를 간단히 구현해 볼 수 있습니다..

In [None]:
def split_data_by_patient_balanced(df, class_labels, test_size=0.2, val_size=0.2, random_state=42, min_patients_per_class=400):
    """
    Splits the dataset into train, validation, and test sets based on Patient ID, ensuring better balance across classes.
    
    Args:
    - df: DataFrame containing image data, patient IDs, and one-hot encoded class labels.
    - class_labels: List of class labels to consider for multi-label classification.
    - test_size: Proportion of patients to be reserved for the test set.
    - val_size: Proportion of patients to be reserved for the validation set (relative to the remaining data after test split).
    - random_state: Seed for reproducibility.
    - min_patients_per_class: Minimum number of patients required per class in training/validation/test sets.
    
    Returns:
    - train_df: DataFrame containing training set images and labels.
    - val_df: DataFrame containing validation set images and labels.
    - test_df: DataFrame containing test set images and labels.
    """

    # Step 1: Group data by Patient ID and aggregate the class labels using sum
    df_patient_groups = df[class_labels + ['Patient ID']].groupby('Patient ID').agg("sum").reset_index()
    
    # Calculate the number of patients per class
    class_to_patient_counts = {label: df_patient_groups[label].astype(bool).sum() for label in class_labels}
    # print(f"Class to Patient Counts: {class_to_patient_counts}")

    class_labels_sorted = sorted(class_labels, key=lambda x: class_to_patient_counts[x])

    # for label in class_labels_sorted:
    #     print(f"The class '{label}' has {df_patient_groups[label].astype(bool).sum()} ({df_patient_groups[label].astype(bool).sum() / df_patient_groups.shape[0] * 100:.2f}%) patients")

    # Step 2: Select a combined validation and test set with a balanced number of patients per class
    val_test_patients = pd.DataFrame(columns=df_patient_groups.columns)

    # Step 3: Start by adding minority class patients to the validation and test sets
    for class_label in class_labels_sorted:  # Start with the minority classes
        patients_with_class = df_patient_groups[df_patient_groups[class_label] > 0]

        # Avoid sampling patients that are already in the validation/test set
        available_patients = patients_with_class[~patients_with_class['Patient ID'].isin(val_test_patients['Patient ID'])]
        
        # Randomly sample patients for validation and test sets from minority classes
        sampled_patients = available_patients.sample(min(min_patients_per_class, int(len(available_patients)*(test_size + val_size))), random_state=random_state)
        val_test_patients = pd.concat([val_test_patients, sampled_patients])

        print(f"selected {sampled_patients.shape[0]} patients for class {class_label} out of {patients_with_class.shape[0]} cases")


    # Ensure uniqueness of selected patients in the combined set
    val_test_patients = val_test_patients.drop_duplicates(subset=['Patient ID'])
    
    # Step 4: Split the combined set into validation and test sets
    val_patients, test_patients = train_test_split(val_test_patients, test_size=test_size / (val_size + test_size), random_state=random_state)
    
    # Step 5: The remaining patients are used for the training set
    train_patients = df_patient_groups[~df_patient_groups['Patient ID'].isin(val_test_patients['Patient ID'])]

    assert df_patient_groups.shape[0] == (train_patients.shape[0] + val_patients.shape[0] + test_patients.shape[0])

    # Step 6: Extract image paths and merge back with df_onehot
    def get_images_from_patients(df_images, df_patients):
        return df_images[df_images['Patient ID'].isin(df_patients['Patient ID'])]

    train_df = get_images_from_patients(df, train_patients)
    val_df = get_images_from_patients(df, val_patients)
    test_df = get_images_from_patients(df, test_patients)

    assert df.shape[0] == (train_df.shape[0] + val_df.shape[0] + test_df.shape[0]), "Data split error: Check patient overlap"

    print(f"[Train set]: {train_df.shape[0]} images and {len(train_patients)} pateints")
    print(f"[ Val  set]: {val_df.shape[0]} images and {len(val_patients)} pateints")
    print(f"[Test  set]: {test_df.shape[0]} images and {len(test_patients)} pateints")


    return train_df, val_df, test_df



In [None]:
train_df, val_df, test_df = split_data_by_patient_balanced(metadata_with_onehot, class_labels, 0.2, 0.2)

check_data_leakage(train_df, test_df)
check_data_leakage(train_df, val_df)

이를통해 imbalance 문제가 일부 경감되긴 하였지만 여전히 심각합니다. 더 나은 데이터 분할을 하기 위해서는 더 복잡한 알고리즘이 필요할 것입니다.

In [None]:
check_label_imbalance(test_df, class_labels)

## weighted loss for handling class imbalance problem

$i^{th}$ training example에 대한 일반적인 cross-entropy loss는 아래와 같습니다:

$$\mathcal{L}_{cross-entropy}(x_i) = -(y_i \log(\hat{y}^{(i)}) + (1-y_i) \log(1 - \hat{y}^{(i)})),$$

$x_i$는 input feature, $y_i$ 는label, $\hat{y}^{(i)} = f(x_i)$ 는 모델의 softmax 출력값 (probability that it is positive)

학습 상황에서 $y_i=0$ 또는 $(1-y_i)=0$ 이므로, loss 함수의 두 항중 하나만 loss값에 기여할 것입니다.

따라서 크기가 $N$인 모든 학습 데이터 $\mathcal{D}$에 대한 average cross-entropy loss는 아래와 같이 계산됩니다

$$\mathcal{L}_{cross-entropy}(\mathcal{D}) = - \frac{1}{N}\big( \sum_{\text{positive examples}} \log (f(x_i)) + \sum_{\text{negative examples}} \log(1-f(x_i)) \big).$$

이 수식에서 알수 있듯이 만약 positive training case가 적다면 loss 값은 negative class에 의해 주로 결정되게 될 것입니다. 

즉 positive class (X-ray pathological condition)와 negative class가 loss값에 기여하는 정도는 아래와 같습니다.

$$freq_{p} = \frac{\text{number of positive examples}}{N}$$
$$freq_{n} = \frac{\text{number of negative examples}}{N}$$

<mark>실습</mark> class frequency를 계산하는 함수 `compute_class_freqs`를 완성하세요

In [None]:
def compute_class_freqs(df, class_labels):
    ##### YOUR CODE START #####  

    ##### YOUR CODE END #####
    return positive_frequencies, negative_frequencies

In [None]:
freq_pos, freq_neg = compute_class_freqs(metadata_with_onehot, class_labels)
print("frequency of positive case : ", freq_pos)
print("frequency of positive case : ", freq_neg)

In [None]:
data = pd.concat([pd.DataFrame({"Class": class_labels, "Label": "Positive", "Value": freq_pos}),
                    pd.DataFrame({"Class": class_labels, "Label": "Negative", "Value": freq_neg})], ignore_index=True)
plt.xticks(rotation=90)
f = sns.barplot(x="Class", y="Value", hue="Label" ,data=data)

위 plot과 같이 loss값에 대한 positive cases의 기여도가 negative case의 기여도보다 훨씬 작음을 확인할 수 있습니다.

우리는 이 기여도가 동일하기를 바라며, loss값에 class-specific weight $w_{pos}$ 와 $w_{neg}$를 곱함으로써 이를 달성할 수 있습니다.

즉, 
$$w_{pos} \times freq_{p} = w_{neg} \times freq_{n},$$
가 되기를 원하며, 아래 값을 사용함으로써 이 목표를 달성할 수 있습니다.
$$w_{pos} = freq_{neg}$$
$$w_{neg} = freq_{pos}$$


In [None]:
pos_weights = freq_neg
neg_weights = freq_pos
pos_contribution = freq_pos * pos_weights 
neg_contribution = freq_neg * neg_weights

즉, Class specific-weight `pos_weights` 와 `neg_weights`를 곱함으로써 positive case와 negative case의 loss에 대한 기여도를 동일하게 만들 수 있을 것입니다.

In [None]:
data = pd.concat([pd.DataFrame({"Class": class_labels, "Label": "Positive", "Value": pos_contribution}),
                  pd.DataFrame({"Class": class_labels, "Label": "Negative", "Value": neg_contribution})], ignore_index=True)

plt.xticks(rotation=90)
sns.barplot(x="Class", y="Value", hue="Label" ,data=data);

이를 적용한 weighted loss값은 아래와 같이 정의됩니다.

$$\mathcal{L}_{cross-entropy}^{w}(x) = - (w_{p} y \log(f(x)) + w_{n}(1-y) \log( 1 - f(x) ) ).$$

실제로 $w_{p}$와 $w_{n}$값은 상대적인 비율이므로, Loss값 전체를 $w_{n}$으로 나누어 weight 값을 하나로 단순화 하여도 무방합니다.

$$\mathcal{L}_{cross-entropy}^{w}(x) = - (w \times y \log(f(x)) + (1-y) \log( 1 - f(x) ) ).$$ 
where,
$$w = \frac{w_{p}}{w_{n}}$$

이를 통해 positive case와 negative case가 `loss`값에 균등하게 기여하도록 조정하여, minority class에서도 높은 정확도를 얻을 수 있을 것입니다.

<mark>실습</mark> class-specific weight값 $w$를 계산하는 함수 `calculate_class_weights`를 완성하세요.
- 앞서 구현한 함수 `compute_class_freqs`를 활용하세요

In [None]:
def calculate_class_weights(df, class_labels):
    """
    Calculate class weights for multi-label classification based on the frequency of each class.

    Args:
    - df: DataFrame containing one-hot encoded labels for each sample.
    - class_labels: List of class label columns in the DataFrame.

    Returns:
    - class_weights: Tensor of weights for each class.
    """
    
    ##### YOUR CODE START #####  

    ##### YOUR CODE END #####
    
    return class_weights

In [None]:
class_weights = calculate_class_weights(metadata_with_onehot, class_labels)
print(class_weights)

**Expected output:**
```
[  8.69980102  39.38904899  23.02399829  47.68432479   7.41931366
  43.56279809  65.50059312 492.92070485   4.63587011  18.39121411
   0.85749076  16.70968251  32.1225997   77.35080363  20.14673708]
```


# Training a X-ray image classification model using PyTorch
이제 [DenseNet121](https://www.kaggle.com/pytorch/densenet121) 아키텍처를 이용하여 X-ray 이미지 분류 모델을 학습해보자.

In [None]:
import time, os, shutil
import torch
from torch import nn
from torchvision import models, transforms
from torch.utils.data import Dataset, DataLoader
from PIL import Image

from tqdm import tqdm

from training_utilities import AverageMeter, create_dataloaders, save_checkpoint, load_checkpoint

## Datasets
현재까지 구현한 함수들을 모두 이용하여 `ChestXrayDataset`을 완성할 수 있습니다.

또한, PyTorch는 `torchvision.transform`을 이용해 입력 이미지에 대한 다양한 변환을 수행하여 data augmentation을 수행할 수 있습니다.

1. `Resize` : 1024x1024 입력 이미지의 크기를 224x224로 변환한다
2. `RandomHorizontalFlip`, `RandomRotation`, `RandomAffine`, `ColorJitter` : 랜덤 이미지 변환
3. ToTensor : 이미지를 [0, 1] 사이의 값을 가지는 PyTorch 텐서로 변환합니다.
4. Normalize : 이미지 픽셀값에 평균을 뺴고 표준편차로 나누어 정규화(normalize)합니다.

$$\frac{x_i - \mu}{\sigma}$$

평균과 표준편차 값은 train dataset을 이용해 계산한 뒤 적용하며, 이번 실습에서는 미리 계산된 값을 사용합니다. 

In [None]:
def load_ChestXray_datasets(meta_csv_file, image_root_dir):
    train_transforms = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(degrees=10),
        transforms.RandomAffine(degrees=0, translate=(0.1, 0.1), scale=(0.9, 1.1)),
        transforms.ColorJitter(brightness=0.2, contrast=0.2),
        transforms.ToTensor(),
        transforms.Normalize([0.485], [0.229])  # Normalizing grayscale images
    ])

    test_transforms = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize([0.485], [0.229])
    ])


    metadata = pd.read_csv(meta_csv_file)
    label_onehot = encode_labels_one_hot(metadata, 'Finding Labels')
    label_onehot.drop(columns = ["No Finding"], inplace = True)
    class_labels = sorted(label_onehot.columns.tolist())

    metadata_with_onehot = pd.concat([metadata, label_onehot], axis=1) #[['Image Index', 'Patient ID']]
    metadata_with_onehot_and_imagepath = add_full_image_path_to_df(df = metadata_with_onehot, index_column = "Image Index", target_column="Image Path", image_dir = image_root_dir)

    train_df, val_df, test_df = split_data_by_patient(metadata_with_onehot_and_imagepath, class_labels, val_size= 0.2, test_size= 0.2, random_state= 42)

    class_weights = calculate_class_weights(train_df, class_labels)

    train_dataset = ChestXrayDataset(train_df, classes = class_labels, transform=train_transforms)
    val_dataset = ChestXrayDataset(val_df, classes = class_labels, transform=test_transforms)
    test_dataset = ChestXrayDataset(test_df, classes = class_labels, transform=test_transforms)
    
    return train_dataset, val_dataset, test_dataset, class_labels, class_weights

class ChestXrayDataset(Dataset):
    def __init__(self, df, classes, transform=None):
        """
        Args:
            df (pandas.DataFrame) : DataFrame with column ["Image Path"] and one-hot labels
            classes (list) : list of target classes
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.df = df
        self.classes = classes
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        img_name = self.df["Image Path"].iloc[idx]
        image = Image.open(img_name).convert('RGB')  # Convert grayscale to RGB
        
        labels = self.df[self.classes].iloc[idx].values

        if self.transform:
            image = self.transform(image)

        return image, torch.tensor(labels, dtype=torch.float)

def add_full_image_path_to_df(df, index_column, target_column, image_dir):
    """
    Generates a DataFrame with image file names and their corresponding directory paths,
    and merges it with the provided one-hot encoded DataFrame based on 'Image Index'.

    Args:
    - df_onehot: The one-hot encoded DataFrame with 'Image Index' as one of the columns.
    - index_column: The column name in the DataFrame to use as the index (e.g., Image Index).
    - target_column: Target column name to save the full image path
    - image_dir: The root directory containing subdirectories of images.

    Returns:
    - A merged DataFrame containing the one-hot encoded data and the corresponding image paths.
    """

    image_path_list = [
        (image, os.path.join(image_dir, subdir, image))
        for subdir in os.listdir(image_dir)
        for image in os.listdir(os.path.join(image_dir, subdir))
    ]
    df_path = pd.DataFrame(image_path_list, columns = [index_column, target_column])
    df_merged = pd.merge(df, df_path, on = index_column, how='inner')

    assert df.shape[0] == df_merged.shape[0]
    assert df_merged[target_column].isnull().sum() == 0, f"Found records that does not have image files: {df_merged[df_merged[target_column].isnull()]}"

    return(df_merged)

In [None]:
data_root_dir = "/datasets/ChestX-ray14"
train_dataset, val_dataset, test_dataset, class_labels, class_weights = load_ChestXray_datasets(
    meta_csv_file= os.path.join(data_root_dir, "Data_Entry_2017.csv"), # "sample_labels.csv"
    image_root_dir= os.path.join(data_root_dir, "images")
)

In [None]:
X, y = train_dataset[2]
print("Train dataset size: ", len(train_dataset))
print("Val dataset size: ", len(val_dataset))
print("Test dataset size: ", len(test_dataset))
print("Image X shape: ", X.shape)
print("Target y: ", y, y.dtype)

In [None]:
import matplotlib.pyplot as plt

def visualize_few_samples(dataset, class_labels, cols=8, rows=5):
    figure, axes = plt.subplots(rows, cols, figsize=(cols * 2, rows * 2)) 
    axes = axes.flatten()

    for i in range(cols * rows):
        sample_idx = torch.randint(len(dataset), size=(1,)).item()
        img, label = dataset[sample_idx]
        img = img.permute(1, 2, 0).numpy()  # Convert to numpy array
        img = (img * 0.229 + 0.485)  # Unnormalize to [0, 1] for display
        label_names = [class_labels[idx] for idx, value in enumerate(label) if value == 1]
        axes[i].imshow(img)
        axes[i].set_title(", ".join(label_names))
        axes[i].axis('off')
    
    plt.tight_layout()
    plt.show()

In [None]:
visualize_few_samples(train_dataset, class_labels, cols = 5, rows = 3)

In [None]:
train_dataloader, val_dataloader, test_dataloader = create_dataloaders(train_dataset, val_dataset, test_dataset, device = "cpu", batch_size = 16, num_worker= 1)

for X, y in test_dataloader:
    print(f"Mini-batch shape of X [N, C, H, W]: {X.shape}")
    print(f"Mini-batch shape of y: {y.shape} {y.dtype}")
    break

## Evaluating model
모델의 성능을 다양한 관점에서 평가하기 위해 다음과 같은 지표들을 사용할 수 있습니다:

- EMR (Exact Match Ratio) : 다중 레이블 분류 문제에서 모든 레이블이 정확하게 맞아떨어지는 비율을 측정합니다.
- Accuracy: 전체 예측 중에서 맞힌 예측의 비율을 측정합니다.
- F1 score: 정밀도(Precision)와 재현율(Recall)의 조화 평균을 측정하여, 클래스 간 불균형이 존재할 때도 균형 잡힌 평가를 제공합니다.
- ROC_AUC (Receiver Operating Characteristic - Area Under Curve)

In [None]:
from sklearn.metrics import f1_score, roc_auc_score, confusion_matrix, accuracy_score

def train_loop(model, device, dataloader, criterion, optimizer, epoch):
    # train for one epoch
    loss_meter = AverageMeter('Loss', ':.4e')
    emr_meter = AverageMeter('EMR', ':6.2f') # Exact Match Ratio
    acc_meter = AverageMeter('Accuracy', ':6.2f')
    metrics_list = [loss_meter, emr_meter, acc_meter, ]
    
    model.train() # switch to train mode

    tqdm_epoch = tqdm(dataloader, desc=f'Training Epoch {epoch + 1}', total=len(dataloader))
    for images, target in tqdm_epoch:
        images = images.to(device, non_blocking=True)
        target = target.to(device, non_blocking=True)

        output = model(images)
        loss = criterion(output, target)

        loss_meter.update(loss.item(), images.size(0))

        probs = torch.sigmoid(output)
        preds = (probs > 0.5).float()

        current_emr = accuracy_score(target.cpu().numpy(), preds.cpu().numpy())
        current_acc = accuracy_score(target.cpu().numpy().flatten(), preds.cpu().numpy().flatten())
        emr_meter.update(current_emr, images.size(0))
        acc_meter.update(current_acc, images.size(0))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        tqdm_epoch.set_postfix(avg_metrics = ", ".join([str(x) for x in metrics_list]))

    tqdm_epoch.close()

    metrics = {
        "epoch" : epoch,
        "loss" : loss_meter.avg,
        "exact_match_ratio" : emr_meter.avg,
        "accuracy" : acc_meter.avg,
    }

    return metrics

<mark>실습</mark> 딥러닝 모델에 대한 정확한 평가를 위해 F1 score를 계산하도록 `evaluation_loop` 함수를 완성하세요.
- `sklearn.metrics.f1_score` ([docs](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.f1_score.html)) 사용, "macro" averaging

In [None]:
def evaluation_loop(model, device, dataloader, criterion, epoch = 0, phase = "validation"):
    loss_meter = AverageMeter('Loss', ':.4e')
    emr_meter = AverageMeter('EMR', ':6.2f')
    acc_meter = AverageMeter('Accuracy', ':6.2f')
    metrics_list = [loss_meter, emr_meter, acc_meter]

    all_targets_list = []
    all_probs_list = []
    all_preds_list = []

    model.eval() # switch to evaluate mode

    with torch.no_grad():
        tqdm_val = tqdm(dataloader, desc='Validation/Test', total=len(dataloader))
        for images, target in tqdm_val:
            images = images.to(device, non_blocking=True)
            target = target.to(device, non_blocking=True)

            output = model(images)
            
            loss = criterion(output, target)
            
            loss_meter.update(loss.item(), images.size(0))

            probs = torch.sigmoid(output)
            preds = (probs > 0.5).float()

            current_emr = accuracy_score(target.cpu().numpy(), preds.cpu().numpy())
            current_acc = accuracy_score(target.cpu().numpy().flatten(), preds.cpu().numpy().flatten())
            emr_meter.update(current_emr, images.size(0))
            acc_meter.update(current_acc, images.size(0))

            all_targets_list.append(target.cpu().numpy()) # (batch_size, # classes)
            all_probs_list.append(probs.cpu().numpy())
            all_preds_list.append(preds.cpu().numpy())

            tqdm_val.set_postfix(avg_metrics = ", ".join([str(x) for x in metrics_list]))

        tqdm_val.close()

    all_targets = np.concatenate(all_targets_list, axis=0)  # (# images, # classes)
    all_probs = np.concatenate(all_probs_list, axis=0)
    all_preds = np.concatenate(all_preds_list, axis=0)

    ## calculate metrics
    ##### YOUR CODE START #####
    f1_macro = None # TODO
    ##### YOUR CODE END #####

    per_class_roc_auc = roc_auc_score(all_targets, all_probs, average=None, multi_class='ovr')
    mean_roc_auc = np.array(per_class_roc_auc).mean()
    
    metrics = {
        "epoch" : epoch,
        f"{phase.capitalize()} loss" : loss_meter.avg,
        "exact_match_ratio" : emr_meter.avg,
        "accuracy" : acc_meter.avg,
        "f1_macro" : f1_macro,
        "mean_roc_auc" : mean_roc_auc,
        "per_class_roc_auc" : per_class_roc_auc,
    }

    return metrics


## DenseNet architecture
DenseNet은 모든 레이어가 이전 레이어들의 출력을 입력으로 받아 사용하는 CNN아키텍처 입니다.  아래 그림과 같이 각 레이어는 앞선 모든 레이어와 연결되어 있어, 효율적인 정보 흐름과 특징 재사용이 가능합니다.
- 예를 들어, 첫 번째 레이어의 출력은 2번째, 3번째, 4번째 등 모든 이후 레이어에 전달됩니다.
- 마찬가지로, 두 번째 레이어의 출력도 3번째, 4번째 등 모든 이후 레이어에 전달됩니다."

<img src="resources/densenet.png" alt="U-net Image" width="400" align="middle"/>

더 자세한 사항은 논문 [Gao Huang et al., Densely Connected Convolutional Networks, 2018](https://arxiv.org/pdf/1608.06993.pdf)를 참고하세요

DenseNet 아키텍처 사용을 위해 PyTorch에서 DenseNet121 모델을 불러와 마지막 fully connected layer를 ChestX-Ray 데이터셋의 num_classes에 맞게 수정합니다.

In [None]:
def get_model(model_name, num_classes, config):
    if model_name == "resnet50":
        if config.get('pretrained', ""): #if pretrained model name is given
            print(f'Using pretrained model {config["pretrained"]}')
            model = models.resnet50(weights = config["pretrained"])
            model.fc = nn.Linear(model.fc.in_features, num_classes)
        else:
            model = models.resnet50()
            model.fc = nn.Linear(model.fc.in_features, num_classes)
    elif model_name == "densenet121":
        if config.get('pretrained', ""):
            print(f'Using pretrained model {config["pretrained"]}')
            model = models.densenet121(weights = config["pretrained"]) 
        else:
            model = models.densenet121()
        model.classifier = nn.Linear(model.classifier.in_features, num_classes)

    else:
        raise Exception("Model not supported: {}".format(model_name))
    
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

    print(f"Using model {model_name} with {total_params} parameters ({trainable_params} trainable)")

    return model

In [None]:
def train_main(config):
    ## data and preprocessing settings
    data_root_dir = config['data_root_dir']
    num_worker = config.get('num_worker', 4)

    ## Hyper parameters
    batch_size = config['batch_size']
    learning_rate = config['learning_rate']
    start_epoch = config.get('start_epoch', 0)
    num_epochs = config['num_epochs']

    ## checkpoint setting
    checkpoint_save_interval = config.get('checkpoint_save_interval', 10)
    checkpoint_path = config.get('checkpoint_path', "checkpoints/checkpoint.pth")
    best_model_path = config.get('best_model_path', "checkpoints/best_model.pth")
    load_from_checkpoint = config.get('load_from_checkpoint', None)

    ## variables
    best_metric = 0

    device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
    print(f"Using {device} device")

    train_dataset, val_dataset, test_dataset, class_labels, class_weights = load_ChestXray_datasets(
        meta_csv_file= os.path.join(data_root_dir, "Data_Entry_2017.csv" ), # To use 1% of dataset, use "sample_labels.csv"
        image_root_dir= os.path.join(data_root_dir, "images")
    )
    num_classes = len(class_labels)
    
    train_dataloader, val_dataloader, test_dataloader = create_dataloaders(
        train_dataset, val_dataset, test_dataset, 
        device, batch_size = batch_size, num_worker = num_worker
    )

    model = get_model(model_name = config["model_name"], num_classes= num_classes, config = config).to(device)

    criterion = nn.BCEWithLogitsLoss(pos_weight=torch.tensor(class_weights, dtype=torch.float32)).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1) 


    if load_from_checkpoint:
        load_checkpoint_path = (best_model_path if load_from_checkpoint == "best" else checkpoint_path)
        start_epoch, best_metric = load_checkpoint(load_checkpoint_path, model, optimizer, scheduler, device)

    val_metrics_list = []

    if config.get('test_mode', False):
        # Only evaluate on the test dataset
        print("Running test evaluation...")
        test_metric = evaluation_loop(model, device, test_dataloader, criterion, phase = "test")
        print(f"Test f1_macro: {test_metric}")
    else:
        # Train and validate using train/val datasets
        for epoch in range(start_epoch, num_epochs):
            train_metrics = train_loop(model, device, train_dataloader, criterion, optimizer, epoch)
            val_metrics = evaluation_loop(model, device, val_dataloader, criterion, epoch = epoch, phase = "validation")
            scheduler.step()

            if (epoch + 1) % checkpoint_save_interval == 0 or (epoch + 1) == num_epochs:
                is_best = val_metrics["f1_macro"] > best_metric
                best_metric = max(val_metrics["f1_macro"], best_metric)
                save_checkpoint(checkpoint_path, model, optimizer, scheduler, epoch, best_metric, is_best, best_model_path)

            ## print metrics
            val_metrics_list.append({k:v for k, v in val_metrics.items() if not k.startswith("per_class")})
            metrics_df = pd.DataFrame([x for x in val_metrics_list])
            print("Validation metrics :\n", metrics_df)
            # print("Class-wise ROC-AUC :\n", val_metrics["per_class_roc_auc"])

이제 학습할 준비가 끝났습니다.
- 데이터셋 전체 (+40GB)를 이용한 학습은 긴 시간이 걸리므로 시간을 두고 학습해보세요.

In [None]:
config = {
    'data_root_dir': '/datasets/ChestX-ray14',
    'batch_size': 64,
    'learning_rate': 1e-3,
    'model_name': 'densenet121',
    'pretrained' : 'IMAGENET1K_V1',
    'num_epochs': 10,

    "checkpoint_save_interval" : 10,
    "checkpoint_path" : "checkpoints/checkpoint.pth",
    "best_model_path" : "checkpoints/best_model.pth",
    "load_from_checkpoint" : None,    # Options: "latest", "best", or None
}

In [None]:
train_main(config)

### Lab을 마무리 짓기 전 저장된 checkpoint를 모두 지워 저장공간을 확보합니다

In [None]:
import shutil, os
if os.path.exists('checkpoints/'):
    shutil.rmtree('checkpoints/')