**This Jupyter Notebook is for training a model to classify different skin conditions across diverse skin tones!**

 I am using Transfer Learning with EfficientNetB0 because it's already been trained on a large dataset and will help with generalization.

The AJL Kaggle competition is centered around AI equity, so I am also incorporating Fairlearn to assess fairness across different Fitzpatrick skin tones.

**Learning resources I used during this process included:**

FINAL-BT-TransferLearning.ipynb (for transfer learning concepts)

FINAL-BT-AlgoFairness.ipynb (for fairness assessment)

"How to do Transfer learning with Efficientnet" by [DLology](https://www.dlology.com/blog/transfer-learning-with-efficientnet/)

This draft was last updated on 02/22/25.

In [None]:
!pip install fairlearn

Collecting fairlearn
  Downloading fairlearn-0.12.0-py3-none-any.whl.metadata (7.0 kB)
Downloading fairlearn-0.12.0-py3-none-any.whl (240 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.0/240.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fairlearn
Successfully installed fairlearn-0.12.0


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

# Path to dataset folder in Google Drive
dataset_path = ""  # Insert your dataset path here. Mine has been removed as I am posting!

# List all files
print(os.listdir(dataset_path))


['test.csv', 'sample_submission.csv', 'train.csv', 'test', 'train']


In [None]:
# Insert your file paths here. Mine have been removed as I am posting!
train_df = pd.read_csv('')
test_df = pd.read_csv('')

data_dir = ''
test_dir = ''

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import os
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import EfficientNetB0
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score
from fairlearn.metrics import MetricFrame, demographic_parity_difference
from sklearn.preprocessing import LabelEncoder


In [None]:
print(train_df.head())

# Checking if image directories contain files
print("Training images:", len(os.listdir(data_dir)))
print("Test images:", len(os.listdir(test_dir)))


                            md5hash  fitzpatrick_scale  fitzpatrick_centaur  \
0  fd06d13de341cc75ad679916c5d7e6a6                  4                    4   
1  a4bb4e5206c4e89a303f470576fc5253                  1                    1   
2  c94ce27e389f96bda998e7c3fa5c4a2e                  5                    5   
3  ebcf2b50dd943c700d4e2b586fcd4425                  3                    3   
4  c77d6c895f05fea73a8f3704307036c0                  1                    1   

                              label nine_partition_label  \
0                 prurigo-nodularis     benign-epidermal   
1  basal-cell-carcinoma-morpheiform  malignant-epidermal   
2                            keloid         inflammatory   
3              basal-cell-carcinoma  malignant-epidermal   
4                 prurigo-nodularis     benign-epidermal   

  three_partition_label            qc  ddi_scale  
0                benign           NaN         34  
1             malignant           NaN         12  
2        no

In [None]:
# I need to append '.jpg' to my file names so that they correctly reference the image files
# Otherwise, my generator won't find the right images

# def add_file_extension(df):
#     df['md5hash'] = df['md5hash'].astype(str) + '.jpg'
#     return df

# File path required changes made change from above code block necessary
# def add_file_extension(df):
#     df['file_path'] = df['label'] + '/' + df['md5hash'].astype(str) + '.jpg'
#     return df

# New implemmentation below, forgot test_df does not have a label column
def add_file_extension(df, is_test=False):
    if is_test:
        # Test images are stored directly in the test directory, without label subfolders
        df['file_path'] = df['md5hash'].astype(str) + '.jpg'
    else:
        # Train images are stored inside subfolders named after their labels
        df['file_path'] = df['label'] + '/' + df['md5hash'].astype(str) + '.jpg'
    return df

# Apply the function to both datasets
train_df = add_file_extension(train_df, is_test=False)  # Train images in subfolders
test_df = add_file_extension(test_df, is_test=True)  # Test images in root directory

# Strip whitespace from filenames
test_df['file_path'] = test_df['file_path'].apply(lambda x: x.strip())



In [None]:
# I need to encode my labels so that my model can work with them
# The labels are currently strings (like 'keloid', 'eczema'), so I convert them to numerical values
train_df['label_encoded'] = train_df['label'].astype('category').cat.codes
label_mapping = dict(enumerate(train_df['label'].astype('category').cat.categories))

# I need to split my data so that I have a training set and a validation set
# This gives me the ability to check how well my model generalizes
train_data, val_data = train_test_split(train_df, test_size=0.2, stratify=train_df['label_encoded'], random_state=42)


In [None]:
# Image Augmentation - I am using this to expand my dataset and prevent overfitting
# Referencing: FINAL-BT-DataAugmentation.ipynb
train_datagen = ImageDataGenerator(rescale=1./255, rotation_range=30, horizontal_flip=True, zoom_range=0.2)
val_datagen = ImageDataGenerator(rescale=1./255)

In [None]:
# Define Paths

#converting label col to str
train_data['label_encoded'] = train_data['label_encoded'].astype(str)
val_data['label_encoded'] = val_data['label_encoded'].astype(str)

def create_generator(df, data_gen, batch_size=32, target_size=(128, 128)):
    return data_gen.flow_from_dataframe(
        df, directory=data_dir,
        x_col='file_path', y_col='label_encoded',  # Switched from Kaggle to Google Drive Path bc of upload issues
        target_size=target_size, batch_size=batch_size, class_mode='sparse')

train_generator = create_generator(train_data, train_datagen)
val_generator = create_generator(val_data, val_datagen)

Found 2288 validated image filenames belonging to 21 classes.
Found 572 validated image filenames belonging to 21 classes.


In [None]:
# Now I am defining my model
# I am using EfficientNetB0 as my base model because it is optimized for performance with fewer parameters
# Referencing: FINAL-BT-TransferLearning.ipynb
base_model = EfficientNetB0(weights='imagenet', include_top=False, input_shape=(128, 128, 3))

# Adding layers to fine-tune it for my dataset
x = GlobalAveragePooling2D()(base_model.output)
x = Dropout(0.3)(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.2)(x)
output = Dense(len(label_mapping), activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=output)

# Freezing base model layers. Will prevent initial retraining.
for layer in base_model.layers:
    layer.trainable = False

Downloading data from https://storage.googleapis.com/keras-applications/efficientnetb0_notop.h5
[1m16705208/16705208[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [None]:
# Compiling model with Adam optimizer
# Using sparse categorical crossentropy because my labels are integer-encoded
model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Training the model
# I am using EarlyStopping to prevent overfitting and save time if the model stops improving (efficiency).
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
history = model.fit(train_generator, validation_data=val_generator, epochs=10, callbacks=[early_stopping])

# Evaluating my model with F1 score since that’s the Kaggle competition metric
# val_predictions = model.predict(val_generator)
# val_pred_labels = np.argmax(val_predictions, axis=1)
# f1 = f1_score(val_data['label_encoded'], val_pred_labels, average='weighted')
# print("Weighted F1 Score:", f1)

# Got rid of the code block above once I shifted to CoLab

# Use validation generator

#Erroring
  #Resolved error -> converted one hot encoded to int
# true_labels = []
# for i in range(len(val_generator)):
#     batch_labels = val_generator[i][1] #Error
#     true_labels.extend(batch_labels)

true_labels = []
for i in range(len(val_generator)):
    batch_labels = np.argmax(val_generator[i][1], axis=1)  # Convert one-hot to integer
    true_labels.extend(batch_labels)

val_predictions = model.predict(val_generator)
val_pred_labels = np.argmax(val_predictions, axis=1)

f1 = f1_score(true_labels, val_pred_labels, average='weighted')
print("Weighted F1 Score:", f1)



  self._warn_if_super_not_called()


Epoch 1/10
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1100s[0m 15s/step - accuracy: 0.1009 - loss: 2.9829 - val_accuracy: 0.1416 - val_loss: 2.8691
Epoch 2/10
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m94s[0m 1s/step - accuracy: 0.1340 - loss: 2.8775 - val_accuracy: 0.1416 - val_loss: 2.8649
Epoch 3/10
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m89s[0m 1s/step - accuracy: 0.1348 - loss: 2.9037 - val_accuracy: 0.1416 - val_loss: 2.8611
Epoch 4/10
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m90s[0m 1s/step - accuracy: 0.1145 - loss: 2.9129 - val_accuracy: 0.1416 - val_loss: 2.8676
Epoch 5/10
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 1s/step - accuracy: 0.1341 - loss: 2.8740 - val_accuracy: 0.1416 - val_loss: 2.8623
Epoch 6/10
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m94s[0m 1s/step - accuracy: 0.1346 - loss: 2.8436 - val_accuracy: 0.1416 - val_loss: 2.8717


AxisError: axis 1 is out of bounds for array of dimension 1

In [None]:
# Fairness Evaluation
# I am using Fairlearn to check for disparities across Fitzpatrick skin tones
# Referencing: FINAL-BT-AlgoFairness.ipynb

val_data['fitzpatrick_scale'].fillna("Unknown", inplace=True)  # Fill NaN values

fairness_metric = MetricFrame(
    metrics=f1_score,
    y_true=val_data['label_encoded'],
    y_pred=val_pred_labels,
    sensitive_features=val_data['fitzpatrick_scale']
)
print("Fairness Metrics:", fairness_metric.by_group)
print("Demographic Parity Difference:", demographic_parity_difference(val_data['label_encoded'], val_pred_labels, sensitive_features=val_data['fitzpatrick_scale']))

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  val_data['fitzpatrick_scale'].fillna("Unknown", inplace=True)  # Fill NaN values


NameError: name 'val_pred_labels' is not defined

In [None]:
# Now I generate predictions for Kaggle submission
# test_df['file_path'] = test_df['md5hash'].astype(str) + '.jpg'

# Remove trailing white spaces etc
test_df['file_path'] = test_df['md5hash'].astype(str) + '.jpg'
test_df['file_path'] = test_df['file_path'].apply(lambda x: x.strip())  # Remove whitespace issues


test_generator = val_datagen.flow_from_dataframe(test_df, directory=test_dir,
    x_col='file_path', target_size=(128, 128), batch_size=32,
    class_mode=None, shuffle=False)


test_predictions = model.predict(test_generator)
test_df['label'] = [label_mapping[i] for i in np.argmax(test_predictions, axis=1)]


In [None]:
# Saving my predictions in the format Kaggle expects
test_df[['md5hash', 'label']].to_csv('/content/drive/MyDrive/bttai_kaggle_training_tests/submission.csv', index=False)

# Now I can upload 'submission.csv' to Kaggle and check my final weighted F1 score

In [None]:
model.save('/content/drive/MyDrive/skin_condition_model.h5')

