### Overall Idea

The overall approach ensures that the impact of demographic feature distributions (like sex) on model performance is explicitly studied. By carefully controlling the sampling ratios in training and testing datasets, it enables fair comparisons and insights into potential biases or generalization challenges in the classification of skin lesion images.

In [9]:
import pandas as pd
import numpy as np
import tabulate as tb
from typing import Dict
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, GlobalAveragePooling2D
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import accuracy_score, f1_score
from tensorflow.keras.applications import MobileNetV2
SEED = 42
np.random.seed(SEED)

### 1. Data Preparation and Filtering

- Load metadata about skin lesion images, including image IDs, diagnoses, and demographic features.
- Create file paths for each image based on the metadata.
- Filter the dataset to focus on three specific lesion classes.
- Remove unnecessary columns to keep the dataset clean and relevant.

In [None]:
# file_path = './DeepFake Annotations/A-FF++.csv'
file_path = './data/metadata.csv'
df = pd.read_csv(file_path, sep=',')
df['file_path'] = 'data/images/' + df['image_id'] + '.jpg'

df = df.drop(columns=['image_id', 'lesion_id', 'dx_type', 'age', 'localization'])

print("classes amt: ", df['dx'].value_counts().to_dict())
df = df[df['dx'].isin(['nv', 'mel', 'bkl'])]

### 2. Balanced Subset Sampling

- Define a function to extract balanced subsets of the data based on class labels and specific feature values (e.g., male or female).
- Ensure each class has the same number of samples within the chosen feature group.
- This helps in creating fair and balanced datasets for training or testing.

In [10]:
def get_balanced_subset(
    df, class_col, feature_col, feature_value,
    samples_per_class, randomize=True, reset_index=False
):
    """
    Select a balanced subset of the data for a given feature value, with equal number of samples per class.

    Args:
        df: DataFrame
        class_col: column name of class labels
        feature_col: column name of feature
        feature_value: specific feature value to filter
        samples_per_class: number of samples per class
        randomize: whether to shuffle within class before selecting
        reset_index: whether to reset index of returned DataFrame
        seed: random seed for reproducibility

    Returns:
        Balanced DataFrame subset
    """
    tmp = df[df[feature_col] == feature_value]

    counts = tmp[class_col].value_counts()
    for cl, count in counts.items():
        if count < samples_per_class:
            raise ValueError(f"Not enough samples for class '{cl}' in feature '{feature_value}'. "
                             f"Required: {samples_per_class}, Available: {count}")

    tmp = pd.concat([
        (g.sample(frac=1, random_state=SEED).head(samples_per_class) if randomize else g.head(samples_per_class))
        for _, g in tmp.groupby(class_col)
    ])

    if reset_index:
        tmp = tmp.reset_index(drop=True)

    return tmp

tmp_test = get_balanced_subset(
    df=df, class_col='dx', feature_col='sex', feature_value='male', 
    samples_per_class=2, randomize=True, reset_index=True)
print(tb.tabulate(tmp_test, headers='keys', tablefmt='psql'))

+----+------+-------+------------------------------+
|    | dx   | sex   | file_path                    |
|----+------+-------+------------------------------|
|  0 | bkl  | male  | data/images/ISIC_0026015.jpg |
|  1 | bkl  | male  | data/images/ISIC_0031132.jpg |
|  2 | mel  | male  | data/images/ISIC_0033209.jpg |
|  3 | mel  | male  | data/images/ISIC_0024410.jpg |
|  4 | nv   | male  | data/images/ISIC_0027839.jpg |
|  5 | nv   | male  | data/images/ISIC_0032326.jpg |
+----+------+-------+------------------------------+


### 3. Experimental Data Sampling with Feature Ratios

- Develop a more flexible sampling function to create subsets with specific ratios of feature values (e.g., 20% male, 80% female).
- Balance the classes within each feature group according to the desired ratio and total sample size.
- Optionally exclude samples already used in other datasets to avoid overlap.
- Validate that the resulting subsets respect the requested feature distributions within a certain tolerance.


In [11]:
def get_exp_data(df, class_col, feature_col, ratio : Dict, size, randomize=True, exclude_column=None, exclude_df=None, max_diff=0.05):
    '''
    Get a balanced subset of the data based on specified ratios for features.
    Args:
        df: DataFrame containing the data
        class_col: column name for class labels
        feature_col: column name for features
        ratio: dictionary with feature values as keys and their ratios as values
        size: total number of samples to return
        randomize: whether to shuffle the DataFrame before processing
        exclude_column: column name to exclude from the DataFrame
        exclude_df: DataFrame containing values to exclude based on exclude_column
    '''
    if randomize:
        df_rnd = df.sample(frac=1, random_state=SEED).reset_index(drop=True)
    else:
        df_rnd = df.copy()
        
    if exclude_column is not None and exclude_df is not None:
        if exclude_column not in df_rnd.columns:
            raise ValueError(f"Column '{exclude_column}' not found in DataFrame.")
        if exclude_column not in exclude_df.columns:
            raise ValueError(f"Column '{exclude_column}' not found in exclude DataFrame.")
        df_rnd = df_rnd[~df_rnd[exclude_column].isin(exclude_df[exclude_column])]
        
    uniq_classes = df_rnd[class_col].unique()
    uniq_features = df_rnd[feature_col].unique()
    
    def get_exp_data_inner(tmp_df, size):
        df_tmp = None
        for uf in uniq_features:
            if ratio.get(uf) is None:
                print(f"Feature '{uf}' not found in ratios. Skipping.")
                continue            
            c_amt = int(size * ratio[uf] / len(uniq_classes))
            # if c_amt <= 0:
            #     raise ValueError(f"Calculated samples per class ({c_amt}) is less than or equal to zero for feature '{uf}' with ratio {ratio}.")
            tmp = get_balanced_subset(df=tmp_df, class_col=class_col, feature_col=feature_col, feature_value=uf, 
                                        samples_per_class=c_amt, randomize=False)
            if df_tmp is None:
                df_tmp = tmp
            else:
                df_tmp = pd.concat([df_tmp, tmp])
        return df_tmp
            
    df_res = get_exp_data_inner(df_rnd, size)
    
    if len(df_res) < size:
        print(f"Samples for ({len(df_res)}) are less than requested ({size}).")
    
    ratios_fet = df_res[feature_col].value_counts(normalize=True).to_dict()
    ratios_cls = df_res[class_col].value_counts(normalize=False).to_dict()
    print(f"[] Ratios for {feature_col}: {ratios_fet}")
    print(f"[] Ratios for {class_col}: {ratios_cls}")
    
    for k in ratio:
        if ratios_fet.get(k) is None:
            if ratio[k] > 0.0:
                raise ValueError(f"Feature '{k}' not found in DataFrame after sampling (try increase 'size' parameter).")
        elif abs(ratios_fet[k] - ratio[k]) > max_diff:
            raise ValueError(f"Feature '{k}' ratio {ratios_fet[k]} differs from requested {ratio[k]} by more than {max_diff}.")
    
    print()
    
    df_res = df_res.reset_index(drop=True)
    
    return df_res      

tmp_test = get_exp_data(
    df=df, class_col='dx', feature_col='sex', ratio={'male':0.2, 'female':0.8}, size=15, randomize=True)
print(tb.tabulate(tmp_test, headers='keys', tablefmt='psql'))

Feature 'unknown' not found in ratios. Skipping.
[] Ratios for sex: {'female': 0.8, 'male': 0.2}
[] Ratios for dx: {'bkl': 5, 'mel': 5, 'nv': 5}

+----+------+--------+------------------------------+
|    | dx   | sex    | file_path                    |
|----+------+--------+------------------------------|
|  0 | bkl  | female | data/images/ISIC_0032826.jpg |
|  1 | bkl  | female | data/images/ISIC_0028258.jpg |
|  2 | bkl  | female | data/images/ISIC_0027374.jpg |
|  3 | bkl  | female | data/images/ISIC_0031558.jpg |
|  4 | mel  | female | data/images/ISIC_0027102.jpg |
|  5 | mel  | female | data/images/ISIC_0032110.jpg |
|  6 | mel  | female | data/images/ISIC_0033038.jpg |
|  7 | mel  | female | data/images/ISIC_0030759.jpg |
|  8 | nv   | female | data/images/ISIC_0027539.jpg |
|  9 | nv   | female | data/images/ISIC_0031493.jpg |
| 10 | nv   | female | data/images/ISIC_0028462.jpg |
| 11 | nv   | female | data/images/ISIC_0025739.jpg |
| 12 | bkl  | male   | data/images/ISIC_0033

### 4. Data Loading and Model Preparation

- Implement image loading and preprocessing to prepare images for model input.
- Convert class labels from strings to integer codes for model compatibility.
- Create TensorFlow datasets from file paths and labels for efficient batch processing.
- Define two types of models:
  - A simple convolutional neural network built from scratch.
  - A transfer learning model based on MobileNetV2, using pretrained weights and a custom classification head.

In [12]:
def load_image(file_path, target_size=(224, 224)):
    image = tf.io.read_file(file_path)
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, target_size)
    image = image / 255.0  # Normalize to [0, 1]
    return image

def get_data_for_model(df, class_col, files_col):
    image_paths = df[files_col].values
    labels = df[class_col].values
    labels = df[class_col].astype('category').cat.codes.values #classes strs to ints
    dataset = tf.data.Dataset.from_tensor_slices((image_paths, labels))
    dataset = dataset.map(lambda path, label: (load_image(path), label))
    dataset = dataset.batch(32)
    return dataset
    
def create_simple_model(num_classes, input_shape=(224, 224, 3)):
    model = Sequential([
        Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
        MaxPooling2D(pool_size=(2, 2)),
        Conv2D(64, (3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),
        Flatten(),
        Dense(128, activation='relu'),
        Dense(num_classes, activation='softmax')
    ])
    model.compile(optimizer=Adam(learning_rate=0.001), 
                  loss='sparse_categorical_crossentropy', 
                  metrics=['accuracy'])
    return model, "CNV"


def create_mobile_net2(num_classes, input_shape=(224, 224, 3)):
    base_model = MobileNetV2(weights='imagenet', include_top=False, input_shape=input_shape)
    base_model.trainable = False
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    x = Dense(128, activation='relu')(x)
    predictions = Dense(num_classes, activation='softmax')(x)
    model = Model(inputs=base_model.input, outputs=predictions)
    model.compile(optimizer=Adam(learning_rate=0.001), 
                  loss='sparse_categorical_crossentropy', 
                  metrics=['accuracy'])
    return model, "mobile net 2"

### 5. Experiment Execution

- Build a function to run repeated training and evaluation experiments.
- For each repetition:
  - Sample training data according to specified feature ratios.
  - Sample multiple test sets with different feature distributions to analyze model performance under varying conditions.
  - Train the model on the training set.
  - Evaluate the model on each test set, calculating accuracy and F1 scores.
- Save results after each repetition to keep track of performance metrics across experiments.


In [7]:
def perform_tests(df, train_meta, test_metas, reps, class_col, feature_split_col, exclude_column, files_col, get_model, epochs_num):
    res = []
    for r in range(reps):
        train = get_exp_data(df, class_col=class_col, feature_col=feature_split_col, ratio=train_meta['ratio'], size=train_meta['size'])
        tests = [
            get_exp_data(df, class_col=class_col, feature_col=feature_split_col, ratio=tm['ratio'], size=tm['size'], exclude_column=exclude_column, exclude_df=train) for tm in test_metas
        ]

        train_dataset = get_data_for_model(train, class_col=class_col, files_col=files_col)
        test_datasets = [
            get_data_for_model(test, class_col=class_col, files_col=files_col) for test in tests
        ]

        train_ratio = '/'.join([f"{k}:{v}" for k, v in train_meta['ratio'].items()])
        train_ratio_rel = '/'.join([f"{k}:{v:.4f}" for k, v in train[feature_split_col].value_counts(normalize=True).to_dict().items()])
        
        model, model_name = get_model(num_classes=len(df[class_col].unique()))
        model.fit(train_dataset, epochs=epochs_num)
        
        for test_dataset, test_meta, test_df in zip(test_datasets, test_metas, tests):
            predictions = model.predict(test_dataset)            
            y_true = test_df[class_col].astype('category').cat.codes.values
            y_pred = np.argmax(predictions, axis=1)
            acc = accuracy_score(y_true, y_pred)
            f1 = f1_score(y_true, y_pred, average='weighted')
            
            test_ratio = '/'.join([f"{k}:{v}" for k, v in test_meta['ratio'].items()])
            test_ratio_rel = '/'.join([f"{k}:{v:.4f}" for k, v in test_df[feature_split_col].value_counts(normalize=True).to_dict().items()])
            
            res.append([
                r,
                model_name,
                feature_split_col,
                train_meta['size'], 
                train_ratio,
                test_meta['size'],
                test_ratio,
                acc,
                f1,
                train_ratio_rel,
                test_ratio_rel
            ])    

            print(f"Rep: {r:2} | Model: {model_name} | Feature Split: {feature_split_col} | Ratio: {test_ratio} | Acc: {acc:.2f}")
            
            res_df = pd.DataFrame(res, columns=[
                'rep', 'model_name', 'feature_split_col', 
                'train_size', 'train_ratio', 'test_size', 'test_ratio',
                'accuracy', 'f1_score', 'train_ratio_rel', 'test_ratio_rel'
            ])
            
            res_df.to_csv(f'res/res_{model_name.replace(' ', '_')}.csv', index=False)         
        
        
perform_tests(df=df, 
              train_meta={'ratio': {'male':0.0, 'female':1.0}, 'size': 500},
              test_metas=[
                  {'ratio': {'male':1.0, 'female':0.0}, 'size': 100},
                  {'ratio': {'male':0.8, 'female':0.2}, 'size': 100},
                  {'ratio': {'male':0.6, 'female':0.4}, 'size': 100},
                  {'ratio': {'male':0.4, 'female':0.6}, 'size': 100},
                  {'ratio': {'male':0.2, 'female':0.8}, 'size': 100},
                  {'ratio': {'male':0.0, 'female':1.0}, 'size': 100},
              ],
              reps=20,
              class_col='dx',
              feature_split_col='sex',
              exclude_column='file_path',
              files_col='file_path',
              get_model=create_mobile_net2,
              epochs_num=10
              )

Feature 'unknown' not found in ratios. Skipping.
Samples for (498) are less than requested (500).
[] Ratios for sex: {'female': 1.0}
[] Ratios for dx: {'bkl': 166, 'mel': 166, 'nv': 166}

Feature 'unknown' not found in ratios. Skipping.
Samples for (99) are less than requested (100).
[] Ratios for sex: {'male': 1.0}
[] Ratios for dx: {'bkl': 33, 'mel': 33, 'nv': 33}

Feature 'unknown' not found in ratios. Skipping.
Samples for (96) are less than requested (100).
[] Ratios for sex: {'male': 0.8125, 'female': 0.1875}
[] Ratios for dx: {'bkl': 32, 'mel': 32, 'nv': 32}

Feature 'unknown' not found in ratios. Skipping.
Samples for (99) are less than requested (100).
[] Ratios for sex: {'male': 0.6060606060606061, 'female': 0.3939393939393939}
[] Ratios for dx: {'bkl': 33, 'mel': 33, 'nv': 33}

Feature 'unknown' not found in ratios. Skipping.
Samples for (99) are less than requested (100).
[] Ratios for sex: {'female': 0.6060606060606061, 'male': 0.3939393939393939}
[] Ratios for dx: {'bkl':

### 6. Results Aggregation and Summary

- Load the experiment results from saved files.
- Group results by test feature ratios and calculate average performance metrics along with their variability.
- Present the summarized results


In [8]:
res = pd.read_csv('res/res_mobile_net_2.csv')

gr = res.groupby(['test_ratio']).agg(
    Model=('model_name', 'first'),
    TrainRatio=('train_ratio', 'first'),
    TestRatio=('test_ratio', 'first'),
    Accuracy= ('accuracy', 'mean'),
    AccuracySTD= ('accuracy', 'std'),
    F1=('f1_score', 'mean'),
    F1STD=('f1_score', 'std')
).reset_index(drop=True)

gr = gr.round(2).sort_values(by='TestRatio', ascending=False)
              
print(tb.tabulate(gr, headers='keys', tablefmt='psql'))

+----+--------------+---------------------+---------------------+------------+---------------+------+---------+
|    | Model        | TrainRatio          | TestRatio           |   Accuracy |   AccuracySTD |   F1 |   F1STD |
|----+--------------+---------------------+---------------------+------------+---------------+------+---------|
|  5 | mobile net 2 | male:0.0/female:1.0 | male:1.0/female:0.0 |       0.62 |          0.03 | 0.61 |    0.03 |
|  4 | mobile net 2 | male:0.0/female:1.0 | male:0.8/female:0.2 |       0.62 |          0.03 | 0.61 |    0.04 |
|  3 | mobile net 2 | male:0.0/female:1.0 | male:0.6/female:0.4 |       0.64 |          0.03 | 0.64 |    0.03 |
|  2 | mobile net 2 | male:0.0/female:1.0 | male:0.4/female:0.6 |       0.66 |          0.03 | 0.65 |    0.03 |
|  1 | mobile net 2 | male:0.0/female:1.0 | male:0.2/female:0.8 |       0.66 |          0.04 | 0.65 |    0.04 |
|  0 | mobile net 2 | male:0.0/female:1.0 | male:0.0/female:1.0 |       0.68 |          0.04 | 0.68 |   