# ISIC 2024 - Skin Cancer Detection with 3D-TBP

## Identify cancers among skin lesions cropped from 3D total body photographs

### Overvview

In this notebook we will be developing image-based algorithms to identify histologically confirmed cancers among skin lesions cropped from 3D total body photographs. The image quality closely resembles smartphone photos, and the dataset is collected from a diverse population of patients that are regularly submitted for telehealth purposes. This notebook  contains a binary classsification algrotihm that can be used in settings without access to specialized care and improve triage almalgamation dermatology algoithms ability to identify skin cancers.

### Description

Skin cancer can be deadly if not caught early, but many populations lack specialized dermatologic care. Over the past several years, dermoscopy-based AI algorithms have been shown to benefit clinicians in diagnosing melanoma, basal cell, and squamous cell carcinoma. However, determining which individuals should see a clinician in the first place has great potential impact. Triaging applications have a significant potential to benefit underserved populations and improve early skin cancer detection, the key factor in long-term patient outcomes.

Dermatoscope images reveal morphologic features not visible to the naked eye, but these images are typically only captured in dermatology clinics. Algorithms that benefit people in primary care or non-clinical settings must be adept to evaluating lower quality images. This competition leverages 3D TBP to present a novel dataset of every single lesion from thousands of patients across three continents with images resembling cell phone photos.

This competition challenges you to develop AI algorithms that differentiate histologically-confirmed malignant skin lesions from benign lesions on a patient. Your work will help to improve early diagnosis and disease prognosis by extending the benefits of automated skin cancer detection to a broader population and settings.

### Evaluation

Submissions are evaluated on partial area under the ROC curve (pAUC) above 80% true positive rate (TPR) for binary classification of malignant examples. (See the implementation in the notebook ISIC pAUC-aboveTPR.)

The receiver operating characteristic (ROC) curve illustrates the diagnostic ability of a given binary classifier system as its discrimination threshold is varied. However, there are regions in the ROC space where the values of TPR are unacceptable in clinical practice. Systems that aid in diagnosing cancers are required to be highly-sensitive, so this metric focuses on the area under the ROC curve AND above 80% TRP. Hence, scores range from [0.0, 0.2].

The shaded regions in the following example represents the pAUC of two arbitrary algorithms (Ca and Cb) at an arbitrary minimum TPR:


In [101]:
!pip install catboost



In [102]:
import warnings
import os
from pathlib import Path

import numpy as np
import pandas as pd
import polars as pl
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots  # Import for creating subplots
import plotly.figure_factory as ff  # Import for creating the KDE plot (create_distplot)

from sklearn.metrics import roc_curve, auc, confusion_matrix
from sklearn.model_selection import GroupKFold
import lightgbm as lgb
from catboost import CatBoostClassifier, Pool

warnings.filterwarnings('ignore')


In [103]:
import os
os.makedirs(os.path.expanduser('~/.kaggle'), exist_ok=True)
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [104]:
!kaggle competitions download -c isic-2024-challenge
# unzip the file
!unzip -q isic-2024-challenge.zip

isic-2024-challenge.zip: Skipping, found more recently modified local copy (use --force to force download)
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [105]:
# display  training data
train_data = pd.read_csv('/content/train-metadata.csv')
train_data.head()


Unnamed: 0,isic_id,target,patient_id,age_approx,sex,anatom_site_general,clin_size_long_diam_mm,image_type,tbp_tile_type,tbp_lv_A,...,lesion_id,iddx_full,iddx_1,iddx_2,iddx_3,iddx_4,iddx_5,mel_mitotic_index,mel_thick_mm,tbp_lv_dnn_lesion_confidence
0,ISIC_0015670,0,IP_1235828,60.0,male,lower extremity,3.04,TBP tile: close-up,3D: white,20.244422,...,,Benign,Benign,,,,,,,97.517282
1,ISIC_0015845,0,IP_8170065,60.0,male,head/neck,1.1,TBP tile: close-up,3D: white,31.71257,...,IL_6727506,Benign,Benign,,,,,,,3.141455
2,ISIC_0015864,0,IP_6724798,60.0,male,posterior torso,3.4,TBP tile: close-up,3D: XP,22.57583,...,,Benign,Benign,,,,,,,99.80404
3,ISIC_0015902,0,IP_4111386,65.0,male,anterior torso,3.22,TBP tile: close-up,3D: XP,14.242329,...,,Benign,Benign,,,,,,,99.989998
4,ISIC_0024200,0,IP_8313778,55.0,male,anterior torso,2.73,TBP tile: close-up,3D: white,24.72552,...,,Benign,Benign,,,,,,,70.44251


In [106]:
# display column names
print(train_data.columns)

Index(['isic_id', 'target', 'patient_id', 'age_approx', 'sex',
       'anatom_site_general', 'clin_size_long_diam_mm', 'image_type',
       'tbp_tile_type', 'tbp_lv_A', 'tbp_lv_Aext', 'tbp_lv_B', 'tbp_lv_Bext',
       'tbp_lv_C', 'tbp_lv_Cext', 'tbp_lv_H', 'tbp_lv_Hext', 'tbp_lv_L',
       'tbp_lv_Lext', 'tbp_lv_areaMM2', 'tbp_lv_area_perim_ratio',
       'tbp_lv_color_std_mean', 'tbp_lv_deltaA', 'tbp_lv_deltaB',
       'tbp_lv_deltaL', 'tbp_lv_deltaLB', 'tbp_lv_deltaLBnorm',
       'tbp_lv_eccentricity', 'tbp_lv_location', 'tbp_lv_location_simple',
       'tbp_lv_minorAxisMM', 'tbp_lv_nevi_confidence', 'tbp_lv_norm_border',
       'tbp_lv_norm_color', 'tbp_lv_perimeterMM',
       'tbp_lv_radial_color_std_max', 'tbp_lv_stdL', 'tbp_lv_stdLExt',
       'tbp_lv_symm_2axis', 'tbp_lv_symm_2axis_angle', 'tbp_lv_x', 'tbp_lv_y',
       'tbp_lv_z', 'attribution', 'copyright_license', 'lesion_id',
       'iddx_full', 'iddx_1', 'iddx_2', 'iddx_3', 'iddx_4', 'iddx_5',
       'mel_mitotic_index', '

In [107]:
#list data types
print(train_data.dtypes)

isic_id                          object
target                            int64
patient_id                       object
age_approx                      float64
sex                              object
anatom_site_general              object
clin_size_long_diam_mm          float64
image_type                       object
tbp_tile_type                    object
tbp_lv_A                        float64
tbp_lv_Aext                     float64
tbp_lv_B                        float64
tbp_lv_Bext                     float64
tbp_lv_C                        float64
tbp_lv_Cext                     float64
tbp_lv_H                        float64
tbp_lv_Hext                     float64
tbp_lv_L                        float64
tbp_lv_Lext                     float64
tbp_lv_areaMM2                  float64
tbp_lv_area_perim_ratio         float64
tbp_lv_color_std_mean           float64
tbp_lv_deltaA                   float64
tbp_lv_deltaB                   float64
tbp_lv_deltaL                   float64


In [108]:
# examine null values
print(train_data.isnull().sum())

isic_id                              0
target                               0
patient_id                           0
age_approx                        2798
sex                              11517
anatom_site_general               5756
clin_size_long_diam_mm               0
image_type                           0
tbp_tile_type                        0
tbp_lv_A                             0
tbp_lv_Aext                          0
tbp_lv_B                             0
tbp_lv_Bext                          0
tbp_lv_C                             0
tbp_lv_Cext                          0
tbp_lv_H                             0
tbp_lv_Hext                          0
tbp_lv_L                             0
tbp_lv_Lext                          0
tbp_lv_areaMM2                       0
tbp_lv_area_perim_ratio              0
tbp_lv_color_std_mean                0
tbp_lv_deltaA                        0
tbp_lv_deltaB                        0
tbp_lv_deltaL                        0
tbp_lv_deltaLB           

In [109]:
class CFG:
    train_data = Path('/content/train-metadata.csv')
    test_data = Path('/content/test-metadata.csv')
    subm_data = Path('/content/sample_submission.csv')
    early_stop = 30
    N = 91700
    lgb_weight = 0.4
    ctb_weight = 0.6

    lgb_params = {
        'min_child_samples': 48,
        'num_iterations': 3000,
        'learning_rate': 0.03,
        'objective': 'binary',
        'extra_trees': True,
        'metric': 'binary',
        'reg_lambda': 0.8,
        'reg_alpha': 0.1,
        'num_leaves': 64,
        'device': 'cpu',
        'max_bin': 128,
        'max_depth': 4,
        'verbose': -1,
        'seed': 42
    }

    ctb_params = {
        'grow_policy': 'Depthwise',
        'loss_function': 'Logloss',
        'min_child_samples': 48,
        'learning_rate': 0.03,
        'random_state': 42,
        'task_type': 'CPU',
        'reg_lambda': 0.8,
        'num_trees': 3000,
        'depth': 4
    }

In [110]:
class FeatureEngineering:
    """
    A class used to perform feature engineering on a dermatology dataset.
    This class includes methods for filtering, downsampling, setting datatypes,
    aggregating data, and extracting categorical columns.

    Methods
    -------
    filter_data(path: str) -> pl.DataFrame:
        Reads a dataset from a CSV file and removes redundant columns.

    downsample_data(df: pl.DataFrame, N: int) -> pl.DataFrame:
        Downsamples the dataset to balance the target classes.

    set_datatypes(df: pl.DataFrame) -> pl.DataFrame:
        Sets appropriate data types for each column, handling NA values where necessary.

    aggregate_data(df: pl.DataFrame) -> pl.DataFrame:
        Aggregates data by creating new features based on existing columns.

    extract_cat_cols(df: pl.DataFrame) -> list:
        Extracts and returns the names of categorical columns in the dataset.

    display_info(df: pl.DataFrame):
        Displays the shape, unique patient count, and memory usage of the dataset.

    process_data(path: str, N: int = None) -> tuple:
        A pipeline method that processes the data from start to finish, including
        filtering, downsampling, setting data types, and aggregating data.
    """

    @staticmethod
    def filter_data(path):
        """
        Reads a dataset from a CSV file and removes redundant or irrelevant columns.

        Parameters
        ----------
        path : str
            The file path to the CSV dataset.

        Returns
        -------
        df : pl.DataFrame
            A cleaned Polars DataFrame with redundant columns removed.
        """

        # Read the dataset as a Polars DataFrame
        df = pl.read_csv(path, low_memory=True)

        # Define the columns to be dropped
        columns_to_drop = [
            'isic_id',  # Redundant for loading train data
            'image_type',  # Only one unique value in train metadata
            'tbp_lv_location_simple',  # Similar information to 'tbp_lv_location'
            'copyright_license',  # Redundant information for lesion classification

            # Included only in train metadata, irrelevant for the broader task
            'lesion_id',
            'iddx_full',
            'iddx_1',
            'iddx_2',
            'iddx_3',
            'iddx_4',
            'iddx_5',
            'mel_mitotic_index',
            'mel_thick_mm',
            'tbp_lv_dnn_lesion_confidence'
        ]

        # Drop the defined columns if they exist in the DataFrame
        for col in columns_to_drop:
            if col in df.columns:
                df = df.drop(col)

        return df

    @staticmethod
    def downsample_data(df, N):
        """
        Downsamples the dataset to balance the number of positive and negative cases.

        Parameters
        ----------
        df : pl.DataFrame
            The Polars DataFrame containing the data to be downsampled.
        N : int
            The number of negative cases to sample.

        Returns
        -------
        df : pl.DataFrame
            A balanced Polars DataFrame with N negative cases and all positive cases.
        """

        # Separate positive and negative cases
        p_cases = df.filter(pl.col('target') == 1)
        n_cases = df.filter(pl.col('target') == 0)

        # Randomly sample N negative cases with shuffling
        n_cases = n_cases.sample(n=N, shuffle=True, seed=42)

        # Concatenate the positive cases with the downsampled negative cases
        df = pl.concat([n_cases, p_cases])

        return df

    @staticmethod
    def set_datatypes(df):
        """
        Sets the appropriate data types for each column, including handling missing values.

        Parameters
        ----------
        df : pl.DataFrame
            The Polars DataFrame with raw data.

        Returns
        -------
        df : pl.DataFrame
            The Polars DataFrame with columns cast to appropriate data types.
        """

        # Handle missing values in the 'age_approx' column by replacing 'NA' with -1
        if ('age_approx' in df.columns) and df.select(pl.col('age_approx').str.contains('NA').any()).item():
            df = df.with_columns(
                pl.when(pl.col('age_approx') == 'NA').then(-1).otherwise(pl.col('age_approx'))
                .alias('age_approx')
            )

        # Cast numeric columns to integer type
        int_cols = [
            'target',
            'age_approx',
            'tbp_lv_symm_2axis_angle'
        ]
        for col in int_cols:
            if col in df.columns:
                df = df.with_columns(pl.col(col).cast(pl.Int16))

        # Cast numeric columns to float type
        float_cols = [
            'clin_size_long_diam_mm',
            'tbp_lv_A',
            'tbp_lv_Aext',
            'tbp_lv_B',
            'tbp_lv_Bext',
            'tbp_lv_C',
            'tbp_lv_Cext',
            'tbp_lv_H',
            'tbp_lv_Hext',
            'tbp_lv_L',
            'tbp_lv_Lext',
            'tbp_lv_areaMM2',
            'tbp_lv_area_perim_ratio',
            'tbp_lv_color_std_mean',
            'tbp_lv_deltaA',
            'tbp_lv_deltaB',
            'tbp_lv_deltaL',
            'tbp_lv_deltaLB',
            'tbp_lv_deltaLBnorm',
            'tbp_lv_eccentricity',
            'tbp_lv_minorAxisMM',
            'tbp_lv_nevi_confidence',
            'tbp_lv_norm_border',
            'tbp_lv_norm_color',
            'tbp_lv_perimeterMM',
            'tbp_lv_radial_color_std_max',
            'tbp_lv_stdL',
            'tbp_lv_stdLExt',
            'tbp_lv_symm_2axis',
            'tbp_lv_x',
            'tbp_lv_y',
            'tbp_lv_z'
        ]
        for col in float_cols:
            if col in df.columns:
                df = df.with_columns(pl.col(col).cast(pl.Float32))

        # Cast categorical columns to categorical type
        cat_cols = [
            'sex',
            'anatom_site_general',
            'tbp_tile_type',
            'tbp_lv_location',
            'attribution'
        ]
        for col in cat_cols:
            if col in df.columns:
                df = df.with_columns(pl.col(col).cast(pl.Categorical))

        return df

    @staticmethod
    def aggregate_data(df):
        """
        Aggregates the dataset by creating new features, such as ratios and contrasts,
        based on existing columns.

        Parameters
        ----------
        df : pl.DataFrame
            The Polars DataFrame with data to be aggregated.

        Returns
        -------
        df : pl.DataFrame
            The Polars DataFrame with additional aggregated features.
        """

        # Create ratios for each color channel (A*, B*, C*, H*, L*) relative to their extended counterparts and age
        df = df.with_columns([
            pl.col('tbp_lv_A').truediv(pl.col('tbp_lv_Aext').mul(pl.col('tbp_lv_A').min()))
            .over('age_approx').cast(pl.Float32).alias('tbp_lv_ratio_A'),
            pl.col('tbp_lv_B').truediv(pl.col('tbp_lv_Bext').mul(pl.col('tbp_lv_B').min()))
            .over('age_approx').cast(pl.Float32).alias('tbp_lv_ratio_B'),
            pl.col('tbp_lv_C').truediv(pl.col('tbp_lv_Cext').mul(pl.col('tbp_lv_C').min()))
            .over('age_approx').cast(pl.Float32).alias('tbp_lv_ratio_C'),
            pl.col('tbp_lv_H').truediv(pl.col('tbp_lv_Hext').mul(pl.col('tbp_lv_H').min()))
            .over('age_approx').cast(pl.Float32).alias('tbp_lv_ratio_H'),
            pl.col('tbp_lv_L').truediv(pl.col('tbp_lv_Lext').mul(pl.col('tbp_lv_L').min()))
            .over('age_approx').cast(pl.Float32).alias('tbp_lv_ratio_L'),
        ])

        # Create contrast features by subtracting the extended values from the original values
        df = df.with_columns([
            pl.col('tbp_lv_A').sub(pl.col('tbp_lv_Aext')).cast(pl.Float32).alias('tbp_lv_contrast_A'),
            pl.col('tbp_lv_B').sub(pl.col('tbp_lv_Bext')).cast(pl.Float32).alias('tbp_lv_contrast_B'),
            pl.col('tbp_lv_C').sub(pl.col('tbp_lv_Cext')).cast(pl.Float32).alias('tbp_lv_contrast_C'),
            pl.col('tbp_lv_H').sub(pl.col('tbp_lv_Hext')).cast(pl.Float32).alias('tbp_lv_contrast_H'),
            pl.col('tbp_lv_L').sub(pl.col('tbp_lv_Lext')).cast(pl.Float32).alias('tbp_lv_contrast_L'),
        ])

        # Calculate patient-specific ratios and contrasts
        df = df.with_columns([
            pl.col('tbp_lv_ratio_A').truediv(pl.col('tbp_lv_ratio_A').mean()).over('patient_id')
            .cast(pl.Float32).alias('tbp_lv_patient_ratio_A'),
            pl.col('tbp_lv_ratio_B').truediv(pl.col('tbp_lv_ratio_B').mean()).over('patient_id')
            .cast(pl.Float32).alias('tbp_lv_patient_ratio_B'),
            pl.col('tbp_lv_ratio_C').truediv(pl.col('tbp_lv_ratio_C').mean()).over('patient_id')
            .cast(pl.Float32).alias('tbp_lv_patient_ratio_C'),
            pl.col('tbp_lv_ratio_H').truediv(pl.col('tbp_lv_ratio_H').mean()).over('patient_id')
            .cast(pl.Float32).alias('tbp_lv_patient_ratio_H'),
            pl.col('tbp_lv_ratio_L').truediv(pl.col('tbp_lv_ratio_L').mean()).over('patient_id')
            .cast(pl.Float32).alias('tbp_lv_patient_ratio_L'),

            pl.col('tbp_lv_contrast_A').truediv(pl.col('tbp_lv_contrast_A').mean()).over('patient_id')
            .cast(pl.Float32).alias('tbp_lv_patient_contrast_A'),
            pl.col('tbp_lv_contrast_B').truediv(pl.col('tbp_lv_contrast_B').mean()).over('patient_id')
            .cast(pl.Float32).alias('tbp_lv_patient_contrast_B'),
            pl.col('tbp_lv_contrast_C').truediv(pl.col('tbp_lv_contrast_C').mean()).over('patient_id')
            .cast(pl.Float32).alias('tbp_lv_patient_contrast_C'),
            pl.col('tbp_lv_contrast_H').truediv(pl.col('tbp_lv_contrast_H').mean()).over('patient_id')
            .cast(pl.Float32).alias('tbp_lv_patient_contrast_H'),
            pl.col('tbp_lv_contrast_L').truediv(pl.col('tbp_lv_contrast_L').mean()).over('patient_id')
            .cast(pl.Float32).alias('tbp_lv_patient_contrast_L'),
        ])

        # Calculate age-specific ratios and contrasts
        df = df.with_columns([
            pl.col('tbp_lv_ratio_A').truediv(pl.col('tbp_lv_ratio_A').mean()).over('age_approx')
            .cast(pl.Float32).alias('tbp_lv_age_ratio_A'),
            pl.col('tbp_lv_ratio_B').truediv(pl.col('tbp_lv_ratio_B').mean()).over('age_approx')
            .cast(pl.Float32).alias('tbp_lv_age_ratio_B'),
            pl.col('tbp_lv_ratio_C').truediv(pl.col('tbp_lv_ratio_C').mean()).over('age_approx')
            .cast(pl.Float32).alias('tbp_lv_age_ratio_C'),
            pl.col('tbp_lv_ratio_H').truediv(pl.col('tbp_lv_ratio_H').mean()).over('age_approx')
            .cast(pl.Float32).alias('tbp_lv_age_ratio_H'),
            pl.col('tbp_lv_ratio_L').truediv(pl.col('tbp_lv_ratio_L').mean()).over('age_approx')
            .cast(pl.Float32).alias('tbp_lv_age_ratio_L'),

            pl.col('tbp_lv_contrast_A').truediv(pl.col('tbp_lv_contrast_A').mean()).over('age_approx')
            .cast(pl.Float32).alias('tbp_lv_age_contrast_A'),
            pl.col('tbp_lv_contrast_B').truediv(pl.col('tbp_lv_contrast_B').mean()).over('age_approx')
            .cast(pl.Float32).alias('tbp_lv_age_contrast_B'),
            pl.col('tbp_lv_contrast_C').truediv(pl.col('tbp_lv_contrast_C').mean()).over('age_approx')
            .cast(pl.Float32).alias('tbp_lv_age_contrast_C'),
            pl.col('tbp_lv_contrast_H').truediv(pl.col('tbp_lv_contrast_H').mean()).over('age_approx')
            .cast(pl.Float32).alias('tbp_lv_age_contrast_H'),
            pl.col('tbp_lv_contrast_L').truediv(pl.col('tbp_lv_contrast_L').mean()).over('age_approx')
            .cast(pl.Float32).alias('tbp_lv_age_contrast_L'),
        ])

        return df

    @staticmethod
    def extract_cat_cols(df):
        """
        Extracts and returns a list of names of categorical columns in the dataset.

        Parameters
        ----------
        df : pl.DataFrame
            The Polars DataFrame from which to extract categorical columns.

        Returns
        -------
        cat_cols : list
            A list of column names that are categorical in the DataFrame.
        """

        # Initialize an empty list to store the names of categorical columns
        cat_cols = []

        # Iterate through all columns and check if their dtype is Categorical
        for col in df.columns:
            if df[col].dtype == pl.Categorical:
                cat_cols.append(col)

        return cat_cols

    @staticmethod
    def display_info(df):
        """
        Displays basic information about the DataFrame, including shape,
        unique patient count, and memory usage.

        Parameters
        ----------
        df : pl.DataFrame
            The Polars DataFrame for which to display information.
        """

        # Display the shape of the DataFrame (rows, columns)
        print(f'Shape: {df.shape}')

        # Display the count of unique patients
        count = df['patient_id'].nunique()
        print(f'Unique patients: {count}')

        # Display the memory usage of the DataFrame in MB
        mem = df.memory_usage().sum() / 1024**2
        print('Memory usage: {:.2f} MB\n'.format(mem))

    def process_data(self, path, N=None):
        """
        A pipeline method that processes the data from start to finish, including
        filtering, downsampling, setting data types, and aggregating data.

        Parameters
        ----------
        path : str
            The file path to the CSV dataset.
        N : int, optional
            The number of negative cases to sample for downsampling. If None,
            no downsampling is performed.

        Returns
        -------
        df : pd.DataFrame
            The processed DataFrame after all transformations.
        cat_cols : list
            A list of categorical column names in the DataFrame.
        """

        # Load and clean the dataset
        df = self.filter_data(path)

        # Downsample negative samples if N is provided
        if N is not None:
            df = self.downsample_data(df, N)

        # Set proper datatypes for the DataFrame columns
        df = self.set_datatypes(df)

        # Aggregate the dataset to create new features
        df = self.aggregate_data(df)

        # Extract categorical columns
        cat_cols = self.extract_cat_cols(df)

        # Convert the Polars DataFrame to a Pandas DataFrame for further processing
        df = df.to_pandas()

        # Display information about the DataFrame
        self.display_info(df)

        return df, cat_cols


In [111]:
# Initialize class for feature engineering
fe = FeatureEngineering()

In [112]:
# Load and process train metadata
train_data, cat_cols = fe.process_data(CFG.train_data, CFG.N)

Shape: (92093, 71)
Unique patients: 1032
Memory usage: 23.45 MB



In [113]:
# Load and process test metadata
test_data, _ = fe.process_data(CFG.test_data)

Shape: (3, 70)
Unique patients: 3
Memory usage: 0.00 MB



In [114]:
class EDA:
    """
    A class used for exploratory data analysis (EDA) on a dermatology dataset.
    This class provides methods to visualize and analyze the distribution of gender,
    lesion locations, and age within the dataset.

    Attributes
    ----------
    sex_mapping : dict
        A mapping of encoded sex values to their respective labels.
    anatom_site_mapping : dict
        A mapping of encoded anatomical site values to their respective labels.

    Methods
    -------
    gender_distribution(data, spectrum, cat_cols, target_val, title):
        Visualizes the gender distribution of a specific target value using a pie chart.

    lesion_location_distribution(data, spectrum, cat_cols):
        Visualizes the distribution of lesion locations across different anatomical sites.

    age_distribution(data, age_col='age_approx'):
        Analyzes and visualizes the age distribution using a histogram, box plot, and KDE plot.
    """

    def __init__(self):
        """
        Initializes the EDA class with predefined mappings for sex and anatomical site categories.
        """
        self.sex_mapping = {0: 'Male', 1: 'Female', 2: 'Unknown'}
        self.anatom_site_mapping = {0: 'Lower Extremity', 1: 'Upper Extremity', 2: 'Head/Neck',
                                    3: 'Anterior Torso', 4: 'Posterior Torso', 5: 'Unknown'}

    def gender_distribution(self, data, spectrum, cat_cols, target_val, title):
        """
        Visualizes the gender distribution for a specific target value using a pie chart.

        Parameters
        ----------
        data : pd.DataFrame
            The dataset containing the gender and target columns.
        spectrum : str
            The color spectrum to be used for the pie chart.
        cat_cols : list
            List of categorical columns to be converted to category type.
        target_val : int
            The target value for which the gender distribution should be visualized.
        title : str
            The title of the pie chart.
        """
        # Copy the data to avoid modifying the original dataframe
        df = data.copy()

        # Convert categorical columns to 'category' dtype
        for col in cat_cols:
            df[col] = df[col].astype('category')

        # Encode 'sex' column, handle missing values, and map to readable labels
        df['sex'] = pd.Categorical(df['sex']).codes
        df['sex'] = df['sex'].map(lambda x: 2 if x == -1 else x)

        # Group by 'sex' and 'target', then count occurrences
        counts = df.groupby(['sex', 'target']).size().unstack(fill_value=0)
        cases = counts[target_val]

        # Create a pie chart to visualize gender distribution
        fig = go.Figure(data=[go.Pie(labels=[self.sex_mapping[k] for k in cases.index], values=cases.values,
                                     marker=dict(colors=px.colors.sample_colorscale(spectrum, [i/(len(cases)-1) for i in range(len(cases))])),
                                     hole=0.1, textinfo='none',
                                     hovertemplate='%{label}<br>%{value}<br>%{percent}<extra></extra>',
                                     textposition='inside')])

        # Update layout of the pie chart
        fig.update_layout(title=title, height=600, showlegend=False, plot_bgcolor='rgba(0,0,0,0)',
                          paper_bgcolor='rgba(0,0,0,0)', margin=dict(l=0, r=0, t=60, b=0))
        fig.show()

    def lesion_location_distribution(self, data, spectrum, cat_cols):
        """
        Visualizes the distribution of lesion locations across different anatomical sites using a scatter plot.

        Parameters
        ----------
        data : pd.DataFrame
            The dataset containing lesion location information.
        spectrum : str
            The color spectrum to be used for the scatter plot.
        cat_cols : list
            List of categorical columns to be converted to category type.
        """
        # Copy the data to avoid modifying the original dataframe
        df = data.copy()

        # Convert categorical columns to 'category' dtype
        for col in cat_cols:
            df[col] = df[col].astype('category')

        # Encode 'anatom_site_general' column, handle missing values, and map to readable labels
        df['anatom_site_general'] = pd.Categorical(df['anatom_site_general']).codes
        df['anatom_site_general'] = df['anatom_site_general'].map(lambda x: 5 if x == -1 else x)

        # Count occurrences for each anatomical site
        counts = df['anatom_site_general'].value_counts().reindex(range(6), fill_value=0)
        plot_data = pd.DataFrame({'anatom_site_general': [self.anatom_site_mapping[i] for i in range(6)],
                                  'count': counts.values})

        # Create a scatter plot to visualize lesion location distribution
        fig = px.scatter(plot_data.sort_values('count', ascending=False), x='anatom_site_general', y='count',
                         size='count', color='count', size_max=180, color_continuous_scale=spectrum,
                         labels={'anatom_site_general': 'Anatomical Site', 'count': 'Count'},
                         hover_name='anatom_site_general', hover_data={'count': True})

        # Update layout of the scatter plot
        fig.update_layout(title_text='Lesion Location Distribution', title_x=0.5,
                          plot_bgcolor='rgba(0,0,0,0)', paper_bgcolor='rgba(0,0,0,0)',
                          margin=dict(l=20, r=20, t=50, b=20), xaxis_title='', yaxis_title='Count',
                          xaxis={'categoryorder':'total descending'}, height=720)
        fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')),
                          hovertemplate='<b>%{hovertext}</b><br>%{marker.size}<extra></extra>')
        fig.show()

    def age_distribution(self, data, age_col='age_approx'):
        """
        Analyzes and visualizes the age distribution using a histogram, box plot, and KDE plot.

        Parameters
        ----------
        data : pd.DataFrame
            The dataset containing the age information.
        age_col : str, optional
            The column name for age data (default is 'age_approx').
        """
        # Copy data to avoid modifying the original dataframe
        df = data.copy()

        # Calculate summary statistics for the age column
        summary_stats = df[age_col].describe()

        # Create a subplot figure with 3 plots (histogram, box plot, KDE plot)
        fig = make_subplots(rows=3, cols=1, shared_xaxes=True,
                            subplot_titles=('Age Distribution (Histogram)',
                                            'Age Distribution (Box Plot)',
                                            'Age Distribution (KDE Plot)'))

        # Histogram for age distribution
        hist = go.Histogram(x=df[age_col], nbinsx=20, name='Histogram', marker=dict(color='blue'))
        fig.add_trace(hist, row=1, col=1)

        # Box plot for age distribution
        box = go.Box(x=df[age_col], name='Box Plot', marker=dict(color='green'))
        fig.add_trace(box, row=2, col=1)

        # KDE plot for age distribution
        kde = ff.create_distplot([df[age_col].dropna()], group_labels=['KDE'], show_hist=False, colors=['red'])
        fig.add_trace(kde['data'][0], row=3, col=1)

        # Update layout of the figure
        fig.update_layout(height=900, title_text='Age Distribution Analysis', showlegend=False,
                          plot_bgcolor='rgba(0,0,0,0)', paper_bgcolor='rgba(0,0,0,0)',
                          margin=dict(l=0, r=0, t=50, b=0))
        fig.show()

        # Print summary statistics for the age column
        print("Summary Statistics:")
        print(summary_stats)


In [115]:
# Initialize class for exploratory data analysis
eda = EDA()

In [116]:
eda.gender_distribution(train_data, 'Purp', cat_cols, target_val=1, title='Gender distribution for malignant cases')

In [117]:
eda.gender_distribution(train_data, 'Teal', cat_cols, target_val=0, title='Gender distribution for benign cases')

In [118]:
eda.lesion_location_distribution(train_data, 'Peach', cat_cols)

In [119]:
eda.age_distribution(train_data)

Summary Statistics:
count    92093.000000
mean        57.579393
std         14.386586
min         -1.000000
25%         50.000000
50%         60.000000
75%         70.000000
max         85.000000
Name: age_approx, dtype: float64


In [120]:
class Metrics:
    @staticmethod
    def calculate_pauc(y_true, y_scores, tpr_threshold=0.8):
        """
        Calculate the partial AUC above a given true positive rate threshold.

        Parameters:
        y_true (array-like): True binary labels.
        y_scores (array-like): Target scores, can either be probability estimates of the positive class or confidence values.
        tpr_threshold (float): Threshold for the true positive rate.

        Returns:
        float: Normalized partial AUC score.
        """
        fpr, tpr, thresholds = roc_curve(y_true, y_scores)
        mask = tpr >= tpr_threshold
        fpr_above_threshold = fpr[mask]
        tpr_above_threshold = tpr[mask]
        partial_auc = auc(fpr_above_threshold, tpr_above_threshold)
        pauc = partial_auc * (1 - tpr_threshold)
        return pauc

    @staticmethod
    def plot_cv(fold_scores, model_name):
        """
        Plot cross-validation scores.

        Parameters:
        fold_scores (list): List of pAUC scores for each fold.
        model_name (str): Name of the model being evaluated.
        """
        fold_scores = [round(score, 4) for score in fold_scores]
        mean_score = round(np.mean(fold_scores), 4)
        std_score = round(np.std(fold_scores), 4)

        fig = go.Figure()

        fig.add_trace(go.Scatter(
            x=list(range(1, len(fold_scores) + 1)),
            y=fold_scores,
            mode='lines+markers',
            name='Fold Scores',
            line=dict(color='#E30B5C', width=2),
            marker=dict(size=12, color='#E30B5C'),
            text=[f'{score:.4f}' for score in fold_scores],
            hovertemplate='Fold %{x}: %{text}<extra></extra>'
        ))

        fig.add_trace(go.Scatter(
            x=[1, len(fold_scores)],
            y=[mean_score, mean_score],
            mode='lines',
            name=f'Mean: {mean_score:.4f}',
            line=dict(dash='dash', color='#FFAC1C'),
            hoverinfo='none'
        ))

        fig.update_layout(
            title=f'{model_name} Cross-Validation pAUC Scores | Variation of CV scores: {mean_score} ± {std_score}',
            xaxis_title='Fold',
            yaxis_title='pAUC Score',
            plot_bgcolor='rgba(0,0,0,0)',
            paper_bgcolor='rgba(0,0,0,0)',
            xaxis=dict(gridcolor='lightgray', tickmode='linear', tick0=1, dtick=1, range=[0.5, len(fold_scores) + 0.5]),
            yaxis=dict(gridcolor='lightgray')
        )

        fig.show()

    @staticmethod
    def plot_cm(y_true, y_pred):
        """
        Plot a confusion matrix.

        Parameters:
        y_true (array-like): True binary labels.
        y_pred (array-like): Predicted scores or binary labels.
        """
        labels = sorted(np.unique(y_true))
        cm = confusion_matrix(y_true, (y_pred > 0.5).astype(int), labels=labels)

        fig = go.Figure(data=go.Heatmap(
            z=cm,
            x=labels,
            y=labels,
            colorscale='Redor',
            zmin=0,
            zmax=np.max(cm),
            text=cm,
            texttemplate='%{text:.0f}',
            hovertemplate='True: %{y}<br>Predicted: %{x}<br>Count: %{z:,.0f}<extra></extra>'
        ))

        fig.update_layout(
            plot_bgcolor='rgba(0,0,0,0)',
            paper_bgcolor='rgba(0,0,0,0)',
            xaxis_title='Predicted Labels',
            yaxis_title='True Labels',
            xaxis=dict(constrain='domain'),
            yaxis=dict(constrain='domain', scaleanchor='x'),
            width=800,
            height=800,
            margin=dict(t=80, b=80, l=80, r=80)
        )

        fig.show()


In [121]:
class ModelDevelopment:
    """
    A class to develop, train, and infer using LightGBM and CatBoost models with cross-validation,
    ensemble learning, and performance evaluation. The methods handle both training and inference phases,
    and combine the results of LightGBM and CatBoost models into a weighted ensemble.

    Methods
    -------
    train_lgb(data, cat_cols, params, early_stop)
        Trains LightGBM models using GroupKFold cross-validation and evaluates them using pAUC scores.

    train_ctb(data, cat_cols, params, early_stop)
        Trains CatBoost models using GroupKFold cross-validation and evaluates them using pAUC scores.

    infer_lgb(data, cat_cols, models)
        Performs inference using trained LightGBM models and averages their predictions.

    infer_ctb(data, cat_cols, models)
        Performs inference using trained CatBoost models and averages their predictions.

    generate_preds(train, test, cat_cols, lgb_params, ctb_params, early_stop, lgb_weight, ctb_weight)
        Trains both LightGBM and CatBoost models, performs inference, and combines the predictions
        into a weighted ensemble. Evaluates the ensemble's performance on training data.
    """

    @staticmethod
    def train_lgb(data, cat_cols, params, early_stop):
        """
        Trains LightGBM models using GroupKFold cross-validation and evaluates them using pAUC scores.

        Parameters
        ----------
        data : pd.DataFrame
            The training dataset containing features and target.
        cat_cols : list of str
            List of categorical column names that need special handling.
        params : dict
            Dictionary of hyperparameters for LightGBM.
        early_stop : int
            Number of rounds for early stopping during training.

        Returns
        -------
        models : list of lgb.Booster
            A list of trained LightGBM models, one for each fold.
        """

        # Convert categorical columns to category dtype
        for col in cat_cols:
            data[col] = data[col].astype('category')

        # Split features and label
        X = data.drop(['target', 'patient_id'], axis=1)
        y = data['target']
        groups = data['patient_id']

        # Initialize cross-validation strategy (GroupKFold)
        cv = GroupKFold(5)

        # Initialize lists to store models and cross-validation scores
        models = []
        scores = []

        # Perform cross-validation
        for fold, (train_index, valid_index) in enumerate(cv.split(X, y, groups)):
            # Split the data into training and validation sets for the current fold
            X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]
            y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]

            # Create LightGBM datasets
            train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=cat_cols)
            valid_data = lgb.Dataset(X_valid, label=y_valid, categorical_feature=cat_cols, reference=train_data)

            # Train the model
            model = lgb.train(params,
                              train_data,
                              valid_sets=[valid_data],
                              callbacks=[lgb.early_stopping(early_stop, verbose=0),
                                         lgb.log_evaluation(0)])

            # Append the trained model to the list
            models.append(model)

            # Calculate and store the pAUC score for the current (validation) fold
            y_pred = model.predict(X_valid)
            score = Metrics.calculate_pauc(y_valid, y_pred)
            scores.append(score)

        # Plot the cross-validation results
        Metrics.plot_cv(scores, 'LightGBM')

        return models

    @staticmethod
    def train_ctb(data, cat_cols, params, early_stop):
        """
        Trains CatBoost models using GroupKFold cross-validation and evaluates them using pAUC scores.

        Parameters
        ----------
        data : pd.DataFrame
            The training dataset containing features and target.
        cat_cols : list of str
            List of categorical column names that need special handling.
        params : dict
            Dictionary of hyperparameters for CatBoost.
        early_stop : int
            Number of rounds for early stopping during training.

        Returns
        -------
        models : list of catboost.CatBoostClassifier
            A list of trained CatBoost models, one for each fold.
        """

        # Convert categorical columns to string (CatBoost requires categorical features to be strings)
        for col in cat_cols:
            data[col] = data[col].astype(str)

        # Split features and label
        X = data.drop(['target', 'patient_id'], axis=1)
        y = data['target']
        groups = data['patient_id']

        # Initialize cross-validation strategy (GroupKFold)
        cv = GroupKFold(5)

        # Initialize lists to store models and cross-validation scores
        models = []
        scores = []

        # Perform cross-validation
        for fold, (train_index, valid_index) in enumerate(cv.split(X, y, groups)):
            # Split the data into training and validation sets for the current fold
            X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]
            y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]

            # Create CatBoost pools
            train_pool = Pool(X_train, y_train, cat_features=cat_cols)
            valid_pool = Pool(X_valid, y_valid, cat_features=cat_cols)

            # Initialize CatBoost model
            model = CatBoostClassifier(**params, verbose=0)

            # Train the model
            model.fit(train_pool,
                      eval_set=valid_pool,
                      early_stopping_rounds=early_stop)

            # Append the trained model to the list
            models.append(model)

            # Calculate and store the pAUC score for the current (validation) fold
            y_pred = model.predict_proba(valid_pool)[:, 1]
            score = Metrics.calculate_pauc(y_valid, y_pred)
            scores.append(score)

        # Plot the cross-validation results
        Metrics.plot_cv(scores, 'CatBoost')

        return models

    @staticmethod
    def infer_lgb(data, cat_cols, models):
        """
        Performs inference using trained LightGBM models and averages their predictions.

        Parameters
        ----------
        data : pd.DataFrame
            The dataset for which predictions are to be made.
        cat_cols : list of str
            List of categorical column names that need special handling.
        models : list of lgb.Booster
            List of trained LightGBM models.

        Returns
        -------
        preds : np.ndarray
            Averaged predictions from the ensemble of LightGBM models.
        """

        # Convert categorical columns to category dtype
        for col in cat_cols:
            data[col] = data[col].astype('category')

        # Average the predictions of the LightGBM classifiers
        preds = np.mean([model.predict(data) for model in models], axis=0)

        return preds

    @staticmethod
    def infer_ctb(data, cat_cols, models):
        """
        Performs inference using trained CatBoost models and averages their predictions.

        Parameters
        ----------
        data : pd.DataFrame
            The dataset for which predictions are to be made.
        cat_cols : list of str
            List of categorical column names that need special handling.
        models : list of catboost.CatBoostClassifier
            List of trained CatBoost models.

        Returns
        -------
        preds : np.ndarray
            Averaged predictions from the ensemble of CatBoost models.
        """

        # Convert categorical columns to string (CatBoost requires categorical features to be strings)
        for col in cat_cols:
            data[col] = data[col].astype(str)

        # Create CatBoost pool for inference
        inference_pool = Pool(data, cat_features=cat_cols)

        # Average the predictions of the CatBoost classifiers
        preds = np.mean([model.predict_proba(inference_pool)[:, 1] for model in models], axis=0)

        return preds

    def generate_preds(self, train, test, cat_cols, lgb_params, ctb_params, early_stop, lgb_weight, ctb_weight):
        """
        Trains both LightGBM and CatBoost models, performs inference, and combines the predictions
        into a weighted ensemble. Evaluates the ensemble's performance on training data.

        Parameters
        ----------
        train : pd.DataFrame
            The training dataset containing features and target.
        test : pd.DataFrame
            The test dataset for which predictions are to be made.
        cat_cols : list of str
            List of categorical column names that need special handling.
        lgb_params : dict
            Dictionary of hyperparameters for LightGBM.
        ctb_params : dict
            Dictionary of hyperparameters for CatBoost.
        early_stop : int
            Number of rounds for early stopping during training.
        lgb_weight : float
            The weight assigned to LightGBM predictions in the ensemble.
        ctb_weight : float
            The weight assigned to CatBoost predictions in the ensemble.

        Returns
        -------
        test_preds : np.ndarray
            Final ensemble predictions on the test data.
        """

        # Train LightGBM and CatBoost models
        lgb_models = self.train_lgb(train, cat_cols, lgb_params, early_stop)
        ctb_models = self.train_ctb(train, cat_cols, ctb_params, early_stop)

        # Extract features label column from train data
        X = train.drop(['target', 'patient_id'], axis=1)
        y = train['target']

        # Infer LightGBM and CatBoost on train data
        train_lgb_preds = self.infer_lgb(X, cat_cols, lgb_models)
        train_ctb_preds = self.infer_ctb(X, cat_cols, ctb_models)

        # Weight-ensemble LightGBM and CatBoost predictions
        train_preds = train_lgb_preds * lgb_weight + train_ctb_preds * ctb_weight

        # Calculate pAUC scores
        train_pauc = Metrics.calculate_pauc(y, train_preds)
        print(f'Ensemble pAUC: {train_pauc:.3f}')

        # Plot confusion matrix for Ensemble predictions on train data
        print('Ensemble confusion matrix:')
        Metrics.plot_cm(y, train_preds)

        # Prepare test data for inference
        test = test.drop('patient_id', axis=1)

        # Infer LightGBM and CatBoost on test data
        test_lgb_preds = self.infer_lgb(test, cat_cols, lgb_models)
        test_ctb_preds = self.infer_ctb(test, cat_cols, ctb_models)

        # Weight-ensemble LightGBM and CatBoost predictions
        test_preds = test_lgb_preds * lgb_weight + test_ctb_preds * ctb_weight

        return test_preds


In [122]:
# Initialize class for model training
md = ModelDevelopment()


In [123]:
# Generate predictions on test data using LightGBM and CatBoost
preds = md.generate_preds(train_data,
                          test_data,
                          cat_cols,
                          CFG.lgb_params,
                          CFG.ctb_params,
                          CFG.early_stop,
                          CFG.lgb_weight,
                          CFG.ctb_weight)

Ensemble pAUC: 0.197
Ensemble confusion matrix:


In [None]:
from sklearn.model_selection import RandomizedSearchCV
import lightgbm as lgb

# Define the parameter grid for LightGBM
param_grid = {
    'num_leaves': [31, 50, 100],
    'max_depth': [-1, 10, 20, 30],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 500],
    'subsample': [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'reg_alpha': [0, 0.1, 0.5, 1],
    'reg_lambda': [0, 0.1, 0.5, 1],
}

# Initialize the LightGBM model
lgb_model = lgb.LGBMClassifier()

# Set up the RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=lgb_model,
                                   param_distributions=param_grid,
                                   n_iter=20,  # Number of different combinations to try
                                   scoring='roc_auc',  # Use AUC to score
                                   cv=5,  # 5-fold cross-validation
                                   verbose=1,
                                   random_state=42,
                                   n_jobs=-1)  # Use all available cores

# Perform the search on the training data
random_search.fit(X_train, y_train)

# Print the best parameters and the corresponding score
print("Best parameters found: ", random_search.best_params_)
print("Best AUC score: ", random_search.best_score_)


Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [96]:
# Load submission data
subm_data = pd.read_csv(CFG.subm_data)
display(subm_data.head())

Unnamed: 0,isic_id,target
0,ISIC_0015657,0.3
1,ISIC_0015729,0.3
2,ISIC_0015740,0.3


In [97]:
# Assign predictions to submission DataFrame
subm_data['target'] = preds

In [98]:
# Save the submission dataframe
subm_data.to_csv('submission.csv', index=False)
display(subm_data.head())

Unnamed: 0,isic_id,target
0,ISIC_0015657,0.000491
1,ISIC_0015729,0.000236
2,ISIC_0015740,0.000809
