## This notebook is based on the works:

1) **[Naive LightGBM](https://www.kaggle.com/code/bguberfain/naive-lightgbm)** by master [*Bruno G. do Amaral*](https://www.kaggle.com/bguberfain)


2) **[Tabular Ensemble: LGBM+Catboost](https://www.kaggle.com/code/snnclsr/tabular-ensemble-lgbm-catboost)** by master [*Sinan Calisir*](https://www.kaggle.com/snnclsr)


3) **[ISIC 2024 Skin Cancer - Getting Started](https://www.kaggle.com/code/coderinunderpants/isic-2024-skin-cancer-getting-started)** by contributor [*Joy Banikl*](https://www.kaggle.com/coderinunderpants)


4) **[SIC: Tabular model + Image model features](https://www.kaggle.com/code/motono0223/isic-tabular-model-image-model-features)** by master [*motono0223*](https://www.kaggle.com/motono0223)


5) **[Only Tabular Features: XGB+CATB+LGBM Ensemble](https://www.kaggle.com/code/rzatemizel/only-tabular-features-xgb-catb-lgbm-ensemble)** by master [*rıza temizel*](https://www.kaggle.com/rzatemizel)


7) **[I'll be revising it tomorrow](https://www.kaggle.com/code/linguilin/i-ll-be-revising-it-tomorrow)** by contributor [*Ship of Theseus*](https://www.kaggle.com/linguilin)


8) **[ISIC - Detect Skin Cancer - Let's Learn Together](https://www.kaggle.com/code/dschettler8845/isic-detect-skin-cancer-let-s-learn-together)** by grandmaster [*Darien Schettler*](https://www.kaggle.com/dschettler8845)


9) **[ISIC 2024 Skin Cancer Detection hdf5](https://www.kaggle.com/code/mpwolke/isic-2024-skin-cancer-detection-hdf5)** by grandmaster [*Marília Prata*](https://www.kaggle.com/mpwolke)


10) **[AI in Dermoscopy. ADAE algorithm.](https://www.kaggle.com/competitions/isic-2024-challenge/discussion/515369)** by grandmaster [*Marília Prata*](https://www.kaggle.com/mpwolke)


there is a constant experiment with the help of the master's toolkit,
please look at the [versions](https://www.kaggle.com/code/vyacheslavbolotin/lightgbm-catboost-with-new-features?scriptVersionId=189856305), the coefficient may be different

## *[Master](https://www.kaggle.com/snnclsr)* **[work](https://www.kaggle.com/code/snnclsr/tabular-ensemble-lgbm-catboost):**

In my previous work [here](https://www.kaggle.com/code/snnclsr/lgbm-baseline-with-new-features), I showed the effectiveness of additional tabular features. Since there are not enough positive samples, every bit of contribution will be important in the final stage.

With that motivation, I want to show how ensembles are useful in predictions.

**Edit:** New version adds two image model predictions from here: https://www.kaggle.com/code/motono0223/isic-tabular-model-image-model-features and one from my own model (augmentations used [here](https://www.kaggle.com/code/snnclsr/image-augmentations-from-winning-solutions)).

In [None]:
NO_RUN = False

import os

if NO_RUN and not os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    # To save some time.
    import pandas as pd
    df_sub = pd.read_csv("/kaggle/input/isic-2024-challenge/sample_submission.csv")
    df_sub.to_csv("submission.csv", index=False)
    exit(0)

# Imports

In [None]:
import random
import numpy as np
import pandas as pd
import pandas.api.types
import matplotlib.pyplot as plt

from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.model_selection import GroupKFold, StratifiedGroupKFold
from sklearn.ensemble import VotingClassifier

import optuna
import catboost as cb
import lightgbm as lgb
import xgboost as xgb

OPTIMIZE_OPTUNA = False
SUBSAMPLE = False
SUBSAMPLE_RATIO = 0.5 # only effective if SUBSAMPLE=True
DISPLAY_FEATURE_IMPORTANCE = False

## Generating the image level predictions

In [None]:
!python /kaggle/input/isic-script-inference-effnetv1b0-f313ae/main.py /kaggle/input/isic-pytorch-training-baseline-image-only/AUROC0.5171_Loss0.3476_epoch35.bin
!mv submission.csv submission_effnetv1b0.csv
# My model
!python /kaggle/input/isic-2024-pl-submission-script-and-preds/pl_submission.py
!mv submission.csv submission_image3.csv

# Feature Engineering

In [None]:
df_train = pd.read_csv("/kaggle/input/isic-2024-challenge/train-metadata.csv")
df_test = pd.read_csv("/kaggle/input/isic-2024-challenge/test-metadata.csv")

def feature_engineering(df):
    # New features to try...
    df["lesion_size_ratio"]              = df["tbp_lv_minorAxisMM"] / df["clin_size_long_diam_mm"]
    df["lesion_shape_index"]             = df["tbp_lv_areaMM2"] / (df["tbp_lv_perimeterMM"] ** 2)
    df["hue_contrast"]                   = (df["tbp_lv_H"] - df["tbp_lv_Hext"]).abs()
    df["luminance_contrast"]             = (df["tbp_lv_L"] - df["tbp_lv_Lext"]).abs()
    df["lesion_color_difference"]        = np.sqrt(df["tbp_lv_deltaA"] ** 2 + df["tbp_lv_deltaB"] ** 2 + df["tbp_lv_deltaL"] ** 2)
    df["border_complexity"]              = df["tbp_lv_norm_border"] + df["tbp_lv_symm_2axis"]
    df["color_uniformity"]               = df["tbp_lv_color_std_mean"] / df["tbp_lv_radial_color_std_max"]
    
    df["3d_position_distance"]           = np.sqrt(df["tbp_lv_x"] ** 2 + df["tbp_lv_y"] ** 2 + df["tbp_lv_z"] ** 2) 
    df["perimeter_to_area_ratio"]        = df["tbp_lv_perimeterMM"] / df["tbp_lv_areaMM2"]
    df["area_to_perimeter_ratio"]        = df["tbp_lv_areaMM2"] / df["tbp_lv_perimeterMM"]
    df["lesion_visibility_score"]        = df["tbp_lv_deltaLBnorm"] + df["tbp_lv_norm_color"]
    df["combined_anatomical_site"]       = df["anatom_site_general"] + "_" + df["tbp_lv_location"]
    df["symmetry_border_consistency"]    = df["tbp_lv_symm_2axis"] * df["tbp_lv_norm_border"]
    df["consistency_symmetry_border"]    = df["tbp_lv_symm_2axis"] * df["tbp_lv_norm_border"] / (df["tbp_lv_symm_2axis"] + df["tbp_lv_norm_border"])
    
    df["color_consistency"]              = df["tbp_lv_stdL"] / df["tbp_lv_Lext"]
    df["consistency_color"]              = df["tbp_lv_stdL"] * df["tbp_lv_Lext"] / (df["tbp_lv_stdL"] + df["tbp_lv_Lext"])
    df["size_age_interaction"]           = df["clin_size_long_diam_mm"] * df["age_approx"]
    df["hue_color_std_interaction"]      = df["tbp_lv_H"] * df["tbp_lv_color_std_mean"]
    df["lesion_severity_index"]          = (df["tbp_lv_norm_border"] + df["tbp_lv_norm_color"] + df["tbp_lv_eccentricity"]) / 3
    df["shape_complexity_index"]         = df["border_complexity"] + df["lesion_shape_index"]
    df["color_contrast_index"]           = df["tbp_lv_deltaA"] + df["tbp_lv_deltaB"] + df["tbp_lv_deltaL"] + df["tbp_lv_deltaLBnorm"]
    
    df["log_lesion_area"]                = np.log(df["tbp_lv_areaMM2"] + 1)
    df["normalized_lesion_size"]         = df["clin_size_long_diam_mm"] / df["age_approx"]
    df["mean_hue_difference"]            = (df["tbp_lv_H"] + df["tbp_lv_Hext"]) / 2
    df["std_dev_contrast"]               = np.sqrt((df["tbp_lv_deltaA"] ** 2 + df["tbp_lv_deltaB"] ** 2 + df["tbp_lv_deltaL"] ** 2) / 3)
    df["color_shape_composite_index"]    = (df["tbp_lv_color_std_mean"] + df["tbp_lv_area_perim_ratio"] + df["tbp_lv_symm_2axis"]) / 3
    df["3d_lesion_orientation"]          = np.arctan2(df_train["tbp_lv_y"], df_train["tbp_lv_x"])
    df["overall_color_difference"]       = (df["tbp_lv_deltaA"] + df["tbp_lv_deltaB"] + df["tbp_lv_deltaL"]) / 3
    
    df["symmetry_perimeter_interaction"] = df["tbp_lv_symm_2axis"] * df["tbp_lv_perimeterMM"]
    df["comprehensive_lesion_index"]     = (df["tbp_lv_area_perim_ratio"] + df["tbp_lv_eccentricity"] + df["tbp_lv_norm_color"] + df["tbp_lv_symm_2axis"]) / 4
    df["color_variance_ratio"]           = df["tbp_lv_color_std_mean"] / df["tbp_lv_stdLExt"]
    df["border_color_interaction"]       = df["tbp_lv_norm_border"] * df["tbp_lv_norm_color"]
    df["size_color_contrast_ratio"]      = df["clin_size_long_diam_mm"] / df["tbp_lv_deltaLBnorm"]
    df["age_normalized_nevi_confidence"] = df["tbp_lv_nevi_confidence"] / df["age_approx"]
    df["color_asymmetry_index"]          = df["tbp_lv_radial_color_std_max"] * df["tbp_lv_symm_2axis"]
    
    df["3d_volume_approximation"]        = df["tbp_lv_areaMM2"] * np.sqrt(df["tbp_lv_x"]**2 + df["tbp_lv_y"]**2 + df["tbp_lv_z"]**2)
    df["color_range"]                    = (df["tbp_lv_L"] - df["tbp_lv_Lext"]).abs() + (df["tbp_lv_A"] - df["tbp_lv_Aext"]).abs() + (df["tbp_lv_B"] - df["tbp_lv_Bext"]).abs()
    df["shape_color_consistency"]        = df["tbp_lv_eccentricity"] * df["tbp_lv_color_std_mean"]
    df["border_length_ratio"]            = df["tbp_lv_perimeterMM"] / (2 * np.pi * np.sqrt(df["tbp_lv_areaMM2"] / np.pi))
    df["age_size_symmetry_index"]        = df["age_approx"] * df["clin_size_long_diam_mm"] * df["tbp_lv_symm_2axis"]
    df["index_age_size_symmetry"]        = df["age_approx"] * df["tbp_lv_areaMM2"] * df["tbp_lv_symm_2axis"]
    # Until here..
    # df['np1']                           = np.sqrt(df["tbp_lv_deltaB"]**2 + df["tbp_lv_deltaL"]**2 + df["tbp_lv_deltaLB"]**2) / (df["tbp_lv_deltaB"] + df["tbp_lv_deltaL"] + df["tbp_lv_deltaLB"])
    # df['np2']                           = (df["tbp_lv_deltaA"] + df["tbp_lv_deltaLB"]) / np.sqrt(df["tbp_lv_deltaA"]**2 + df["tbp_lv_deltaLB"]**2)
    # df['np3']                           = ?
    # ...
    # df['npn']                           = ?
    
    new_num_cols = [
        "lesion_size_ratio",             # tbp_lv_minorAxisMM      / clin_size_long_diam_mm
        "lesion_shape_index",            # tbp_lv_areaMM2          / tbp_lv_perimeterMM **2
        "hue_contrast",                  # tbp_lv_H                - tbp_lv_Hext              abs
        "luminance_contrast",            # tbp_lv_L                - tbp_lv_Lext              abs
        "lesion_color_difference",       # tbp_lv_deltaA **2       + tbp_lv_deltaB **2 + tbp_lv_deltaL **2  sqrt  
        "border_complexity",             # tbp_lv_norm_border      + tbp_lv_symm_2axis
        "color_uniformity",              # tbp_lv_color_std_mean   / tbp_lv_radial_color_std_max
        
        "3d_position_distance",          # tbp_lv_x **2 + tbp_lv_y **2 + tbp_lv_z **2  sqrt
        "perimeter_to_area_ratio",       # tbp_lv_perimeterMM      / tbp_lv_areaMM2
        "area_to_perimeter_ratio",       # tbp_lv_areaMM2          / tbp_lv_perimeterMM
        "lesion_visibility_score",       # tbp_lv_deltaLBnorm      + tbp_lv_norm_color
        # "combined_anatomical_site"      # anatom_site_general     + "_" + tbp_lv_location ! categorical feature
        "symmetry_border_consistency",   # tbp_lv_symm_2axis       * tbp_lv_norm_border
        "consistency_symmetry_border",   # tbp_lv_symm_2axis       * tbp_lv_norm_border / (tbp_lv_symm_2axis + tbp_lv_norm_border)
        
        "color_consistency",             # tbp_lv_stdL             / tbp_lv_Lext
        "consistency_color",             # tbp_lv_stdL*tbp_lv_Lext / tbp_lv_stdL + tbp_lv_Lext
        "size_age_interaction",          # clin_size_long_diam_mm  * age_approx
        "hue_color_std_interaction",     # tbp_lv_H                * tbp_lv_color_std_mean
        "lesion_severity_index",         # tbp_lv_norm_border      + tbp_lv_norm_color + tbp_lv_eccentricity / 3
        "shape_complexity_index",        # border_complexity       + lesion_shape_index
        "color_contrast_index",          # tbp_lv_deltaA + tbp_lv_deltaB + tbp_lv_deltaL + tbp_lv_deltaLBnorm
        
        "log_lesion_area",               # tbp_lv_areaMM2          + 1  np.log
        "normalized_lesion_size",        # clin_size_long_diam_mm  / age_approx
        "mean_hue_difference",           # tbp_lv_H                + tbp_lv_Hext    / 2
        "std_dev_contrast",              # tbp_lv_deltaA **2 + tbp_lv_deltaB **2 + tbp_lv_deltaL **2   / 3  np.sqrt
        "color_shape_composite_index",   # tbp_lv_color_std_mean   + bp_lv_area_perim_ratio + tbp_lv_symm_2axis   / 3
        "3d_lesion_orientation",         # tbp_lv_y                , tbp_lv_x  np.arctan2
        "overall_color_difference",      # tbp_lv_deltaA           + tbp_lv_deltaB + tbp_lv_deltaL   / 3
        
        "symmetry_perimeter_interaction",# tbp_lv_symm_2axis       * tbp_lv_perimeterMM
        "comprehensive_lesion_index",    # tbp_lv_area_perim_ratio + tbp_lv_eccentricity + bp_lv_norm_color + tbp_lv_symm_2axis   / 4
        "color_variance_ratio",          # tbp_lv_color_std_mean   / tbp_lv_stdLExt
        "border_color_interaction",      # tbp_lv_norm_border      * tbp_lv_norm_color
        "size_color_contrast_ratio",     # clin_size_long_diam_mm  / tbp_lv_deltaLBnorm
        "age_normalized_nevi_confidence",# tbp_lv_nevi_confidence  / age_approx
        "color_asymmetry_index",         # tbp_lv_symm_2axis       * tbp_lv_radial_color_std_max
        
        "3d_volume_approximation",       # tbp_lv_areaMM2          * sqrt(tbp_lv_x**2 + tbp_lv_y**2 + tbp_lv_z**2)
        "color_range",                   # abs(tbp_lv_L - tbp_lv_Lext) + abs(tbp_lv_A - tbp_lv_Aext) + abs(tbp_lv_B - tbp_lv_Bext)
        "shape_color_consistency",       # tbp_lv_eccentricity     * tbp_lv_color_std_mean
        "border_length_ratio",           # tbp_lv_perimeterMM      / pi * sqrt(tbp_lv_areaMM2 / pi)
        "age_size_symmetry_index",       # age_approx              * clin_size_long_diam_mm * tbp_lv_symm_2axis
         #"index_age_size_symmetry",      # age_approx              * sqrt(tbp_lv_areaMM2 * tbp_lv_symm_2axis)
        "index_age_size_symmetry",       # age_approx              * tbp_lv_areaMM2 * tbp_lv_symm_2axis
         # Until here..
         # 'np1',                         # in case of a positive manifestation
         # 'np2',                         # in case of a positive manifestation
         # 'np3'                          # = ?
         # ...
         # 'npn'                          # = ?
    ]
    
    
    
    
    # The following features have been added:
    
    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    
    # "area_to_perimeter_ratio",       # tbp_lv_areaMM2          / tbp_lv_perimeterMM
    # "consistency_symmetry_border",   # tbp_lv_symm_2axis       * tbp_lv_norm_border / (tbp_lv_symm_2axis + tbp_lv_norm_border)
    # "consistency_color",             # tbp_lv_stdL*tbp_lv_Lext / tbp_lv_stdL + tbp_lv_Lext
    # "index_age_size_symmetry",       # age_approx              * tbp_lv_areaMM2 * tbp_lv_symm_2axis
    
    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    
    # They were added all at once and by instinct.
    # I think this is not quite right, it would be better to add them one by one
    # and then visually look(even better is a machine gun) at the result
     
    # Working only with the taboo values of expert Andreas Bisi showed that it is possible to work 
    # with a smaller number of functions available to us, 
    # so in the next versions we will try to remove some functions rather than add them.
    
    
    
    
    
    new_cat_cols = ["combined_anatomical_site"]
    
    return df, new_num_cols, new_cat_cols


num_cols = [
    'age_approx',                        # Approximate age of patient at time of imaging.
    'clin_size_long_diam_mm',            # Maximum diameter of the lesion (mm).+
    'tbp_lv_A',                          # A inside  lesion.+
    'tbp_lv_Aext',                       # A outside lesion.+
    'tbp_lv_B',                          # B inside  lesion.+
    'tbp_lv_Bext',                       # B outside lesion.+ 
    'tbp_lv_C',                          # Chroma inside  lesion.+
    'tbp_lv_Cext',                       # Chroma outside lesion.+
    'tbp_lv_H',                          # Hue inside the lesion; calculated as the angle of A* and B* in LAB* color space. Typical values range from 25 (red) to 75 (brown).+
    'tbp_lv_Hext',                       # Hue outside lesion.+
    'tbp_lv_L',                          # L inside lesion.+
    'tbp_lv_Lext',                       # L outside lesion.+
    'tbp_lv_areaMM2',                    # Area of lesion (mm^2).+
    'tbp_lv_area_perim_ratio',           # Border jaggedness, the ratio between lesions perimeter and area. Circular lesions will have low values; irregular shaped lesions will have higher values. Values range 0-10.+
    'tbp_lv_color_std_mean',             # Color irregularity, calculated as the variance of colors within the lesion's boundary.
    'tbp_lv_deltaA',                     # Average A contrast (inside vs. outside lesion).+
    'tbp_lv_deltaB',                     # Average B contrast (inside vs. outside lesion).+
    'tbp_lv_deltaL',                     # Average L contrast (inside vs. outside lesion).+
    'tbp_lv_deltaLB',                    #
    'tbp_lv_deltaLBnorm',                # Contrast between the lesion and its immediate surrounding skin. Low contrast lesions tend to be faintly visible such as freckles; high contrast lesions tend to be those with darker pigment. Calculated as the average delta LB of the lesion relative to its immediate background in LAB* color space. Typical values range from 5.5 to 25.+
    'tbp_lv_eccentricity',               # Eccentricity.+
    'tbp_lv_minorAxisMM',                # Smallest lesion diameter (mm).+
    'tbp_lv_nevi_confidence',            # Nevus confidence score (0-100 scale) is a convolutional neural network classifier estimated probability that the lesion is a nevus. The neural network was trained on approximately 57,000 lesions that were classified and labeled by a dermatologist.+,++
    'tbp_lv_norm_border',                # Border irregularity (0-10 scale); the normalized average of border jaggedness and asymmetry.+
    'tbp_lv_norm_color',                 # Color variation (0-10 scale); the normalized average of color asymmetry and color irregularity.+
    'tbp_lv_perimeterMM',                # Perimeter of lesion (mm).+
    'tbp_lv_radial_color_std_max',       # Color asymmetry, a measure of asymmetry of the spatial distribution of color within the lesion. This score is calculated by looking at the average standard deviation in LAB* color space within concentric rings originating from the lesion center. Values range 0-10.+
    'tbp_lv_stdL',                       # Standard deviation of L inside  lesion.+
    'tbp_lv_stdLExt',                    # Standard deviation of L outside lesion.+
    'tbp_lv_symm_2axis',                 # Border asymmetry; a measure of asymmetry of the lesion's contour about an axis perpendicular to the lesion's most symmetric axis. Lesions with two axes of symmetry will therefore have low scores (more symmetric), while lesions with only one or zero axes of symmetry will have higher scores (less symmetric). This score is calculated by comparing opposite halves of the lesion contour over many degrees of rotation. The angle where the halves are most similar identifies the principal axis of symmetry, while the second axis of symmetry is perpendicular to the principal axis. Border asymmetry is reported as the asymmetry value about this second axis. Values range 0-10.+
    'tbp_lv_symm_2axis_angle',           # Lesion border asymmetry angle.+
    'tbp_lv_x',                          # X-coordinate of the lesion on 3D TBP.+
    'tbp_lv_y',                          # Y-coordinate of the lesion on 3D TBP.+
    'tbp_lv_z',                          # Z-coordinate of the lesion on 3D TBP.+
]

df_train[num_cols] = df_train[num_cols].fillna(df_train[num_cols].median())
df_test [num_cols] = df_test [num_cols].fillna(df_train[num_cols].median())

df_train, new_num_cols, new_cat_cols = feature_engineering(df_train.copy())
df_test, _, _                        = feature_engineering(df_test.copy())

num_cols += new_num_cols

# anatom_site_general
cat_cols = ["sex", "tbp_tile_type", "tbp_lv_location", "tbp_lv_location_simple"] + new_cat_cols
train_cols = num_cols + cat_cols






df_eff = pd.read_csv("/kaggle/input/isic-inference-effnetv1b0-for-training-data/train_effnetv1b0.csv")
df_eff = df_eff[["target_effnetv1b0"]]

df_nex = pd.read_csv("/kaggle/input/nextvit/train_effnetv1b0.csv")
df_nex = df_nex[["target_effnetv1b0"]]
df_sel = pd.read_csv("/kaggle/input/selecsls42b-in1k-drop/train_effnetv1b0.csv")
df_sel = df_sel[["target_effnetv1b0"]]

df_train["target_nexnetv1b0"] = df_nex["target_effnetv1b0"]
df_train["target_selnetv1b0"] = df_sel["target_effnetv1b0"]


df_image_3 = pd.read_csv("/kaggle/input/isic-2024-pl-submission-script-and-preds/train_preds.csv")

df_train["target_effnetv1b0"] = df_eff["target_effnetv1b0"]
df_train["target_3"] = df_image_3["pred"]

train_cols += ["target_3","target_nexnetv1b0","target_effnetv1b0","target_selnetv1b0"]





# ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~
# approximately the same FE does not give anything yet the correct 
# one asks to add one or more parallel lines
# ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~


category_encoder = OrdinalEncoder(
    categories='auto',
    dtype=int,
    handle_unknown='use_encoded_value',
    unknown_value=-2,
    encoded_missing_value=-1,
)

X_cat = category_encoder.fit_transform(df_train[cat_cols])
for c, cat_col in enumerate(cat_cols):
    df_train[cat_col] = X_cat[:, c]

# CV Setup

In [None]:
N_SPLITS = 5

gkf = StratifiedGroupKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

if SUBSAMPLE:
    df_pos = df_train[df_train["target"] == 1]
    df_neg = df_train[df_train["target"] == 0]
    df_neg = df_neg.sample(frac=SUBSAMPLE_RATIO, random_state=42)
    df_train = pd.concat([df_pos, df_neg]).sample(frac=1.0, random_state=42).reset_index(drop=True)    

df_train["fold"] = -1

for idx, (train_idx, val_idx) in enumerate(gkf.split(df_train, df_train["target"], groups=df_train["patient_id"])):
    df_train.loc[val_idx, "fold"] = idx

# Competition Metric

In [None]:
def comp_score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str, min_tpr: float=0.80):
    v_gt = abs(np.asarray(solution.values)-1)
    v_pred = np.array([1.0 - x for x in submission.values])
    max_fpr = abs(1-min_tpr)
    partial_auc_scaled = roc_auc_score(v_gt, v_pred, max_fpr=max_fpr)
    # change scale from [0.5, 1.0] to [0.5 * max_fpr**2, max_fpr]
    # https://math.stackexchange.com/questions/914823/shift-numbers-into-a-different-range
    partial_auc = 0.5 * max_fpr**2 + (max_fpr - 0.5 * max_fpr**2) / (1.0 - 0.5) * (partial_auc_scaled - 0.5)
    return partial_auc

def custom_lgbm_metric(y_true, y_hat):
    # TODO: Refactor with the above.
    min_tpr = 0.80
    v_gt = abs(y_true-1)
    v_pred = np.array([1.0 - x for x in y_hat])
    max_fpr = abs(1-min_tpr)
    partial_auc_scaled = roc_auc_score(v_gt, v_pred, max_fpr=max_fpr)
    # change scale from [0.5, 1.0] to [0.5 * max_fpr**2, max_fpr]
    # https://math.stackexchange.com/questions/914823/shift-numbers-into-a-different-range
    partial_auc = 0.5 * max_fpr**2 + (max_fpr - 0.5 * max_fpr**2) / (1.0 - 0.5) * (partial_auc_scaled - 0.5)
    return "pauc80", partial_auc, True


# LGBM Model

In [None]:
def objective(trial):
    param = {
        "objective":         "binary",
        # "metric":           "custom",
        "verbosity":         -1,
        "boosting_type":     "gbdt",
        "lambda_l1":         trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2":         trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "num_leaves":        trial.suggest_int("num_leaves", 2, 256),
        "feature_fraction":  trial.suggest_float("feature_fraction", 0.4, 1.0),
        "bagging_fraction":  trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq":      trial.suggest_int("bagging_freq", 1, 7),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "device":            "gpu"
    }
    
    scores = []
    
    for fold in range(N_SPLITS):
        _df_train = df_train[df_train["fold"] != fold].reset_index(drop=True)
        _df_valid = df_train[df_train["fold"] == fold].reset_index(drop=True)
        dtrain = lgb.Dataset(_df_train[train_cols], label=_df_train["target"])
        gbm = lgb.train(param, dtrain)
        preds = gbm.predict(_df_valid[train_cols])
        score = comp_score(_df_valid[["target"]], pd.DataFrame(preds, columns=["prediction"]), "")
        scores.append(score)
        
    return np.mean(scores)

In [None]:
if OPTIMIZE_OPTUNA:
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=21)

    print("Number of finished trials: {}".format(len(study.trials)))

    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))

    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

In [None]:
new_params = {
    "objective": "binary",
    "verbosity": -1,
    "boosting_type": "gbdt",
    "n_estimators": 123,
    'learning_rate': 0.0341,    
    'lambda_l1': 7.38258, 
    'lambda_l2': 0.663075, 
    'num_leaves': 103, 
    'feature_fraction': 0.5392005444882538, 
    'bagging_fraction': 0.9577412548866563, 
    'bagging_freq': 6,
    'min_child_samples': 60,
    "device": "gpu"
}
lgb_scores = []
lgb_models = []
oof_df = pd.DataFrame()
for fold in range(N_SPLITS):
    _df_train = df_train[df_train["fold"] != fold].reset_index(drop=True)
    _df_valid = df_train[df_train["fold"] == fold].reset_index(drop=True)
    model = lgb.LGBMClassifier(**new_params)
    # model = VotingClassifier([(f"lgb_{i}", lgb.LGBMClassifier(random_state=i, **new_params)) for i in range(7)], voting="soft")
    model.fit(_df_train[train_cols], _df_train["target"])
    preds = model.predict_proba(_df_valid[train_cols])[:, 1]
    score = comp_score(_df_valid[["target"]], pd.DataFrame(preds, columns=["prediction"]), "")
    print(f"fold: {fold} - Partial AUC Score: {score:.5f}")
    lgb_models.append(model)
    oof_single = _df_valid[["isic_id", "target"]].copy()
    oof_single["pred"] = preds
    oof_df = pd.concat([oof_df, oof_single])

In [None]:
# mrs = []

# for j in range(7):
    
#     lgb_params = {
#         "objective":         "binary",
#         "boosting_type":     "gbdt",
#         "n_estimators":      random.choice([500,1000,1500,2000]),
#         'learning_rate':     random.choice([0.003,0.005,0.01,0.03,0.05]),  
#         'lambda_l1':         random.choice([0.001,0.0001,0.00047,0.00077,0.00087]),
#         'lambda_l2':         random.choice([0.7,1.4,3,4,5,7,8.77]),
#         'num_leaves':        random.choice([50,100,150,200,250]),
# #         'feature_fraction':  random.choice([0,34,0.54,0.74,0.85]),
# #         'bagging_fraction':  random.choice([0.77,0.85,0.95]),
#         'bagging_freq':      random.choice([5,6,7,8]),
#         'min_child_samples': random.choice([40,50,60,70]),
#         "verbosity":         -1,
#         "device": "gpu"
#     }
    
#     lgb_scores = []
#     lgb_models = []
    
#     oof_df = pd.DataFrame()
#     for fold in range(N_SPLITS):
#         _df_train = df_train[df_train["fold"] != fold].reset_index(drop=True)
#         _df_valid = df_train[df_train["fold"] == fold].reset_index(drop=True)
#         model = VotingClassifier([(f"lgb_{i}", lgb.LGBMClassifier(random_state=i, **lgb_params)) for i in range(3)], voting="soft")
#         model.fit(_df_train[train_cols], _df_train["target"])
#         preds = model.predict_proba(_df_valid[train_cols])[:, 1]
#         score = comp_score(_df_valid[["target"]], pd.DataFrame(preds, columns=["prediction"]), "")
#         print(f"fold: {fold} - Partial AUC Score: {score:.5f}")
#         lgb_scores.append(score)
#         lgb_models.append(model)

#     mrs.append({'models':lgb_models, 'params':lgb_params, 'scores':lgb_scores})
#     print('\n',np.mean(lgb_scores),'\n')
    
# _score,i,j  = -1,0,0

# for mr in mrs:
#     _mean = np.mean(mr['scores'])
#     if _mean > _score: 
#         _score = _mean
#         j = i
#     i += 1
    
# lgb_scores = mrs[j]['scores']
# lgb_models = mrs[j]['models']
# lgb_params = mrs[j]['params']

In [None]:
lgbm_score = comp_score(oof_df["target"], oof_df["pred"], "")
print(f"LGBM Score: {lgbm_score:.5f}")

# LGBM Feature Importances

In [None]:
if DISPLAY_FEATURE_IMPORTANCE:
    # Make sure that this is a single model, not voting classifier. Will handle that later on.
    importances = np.mean([model.feature_importances_ for model in lgb_models], 0)
    df_imp = pd.DataFrame({"feature": model.feature_name_, "importance": importances}).sort_values("importance").reset_index(drop=True)

    plt.figure(figsize=(16, 12))
    plt.barh(df_imp["feature"], df_imp["importance"])
    plt.show()

# Catboost Model

In [None]:
def objective(trial):
    param = {
        "objective":         trial.suggest_categorical("objective", ["Logloss", "CrossEntropy"]),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
        "depth":             trial.suggest_int("depth", 1, 12),
        "boosting_type":     trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
        "bootstrap_type":    trial.suggest_categorical("bootstrap_type", ["Bayesian", "Bernoulli", "MVS"]),
        # "task_type":       "GPU",
        # "used_ram_limit":  "3gb",
    }
    
    if param["bootstrap_type"] == "Bayesian":
        param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10)
    
    if param["bootstrap_type"] == "Bernoulli":
        param["subsample"]           = trial.suggest_float("subsample", 0.1, 1)

    scores = []
    
    for fold in range(N_SPLITS):
        _df_train = df_train[df_train["fold"] != fold].reset_index(drop=True)
        _df_valid = df_train[df_train["fold"] == fold].reset_index(drop=True)
        gbm = cb.CatBoostClassifier(**param)
        gbm.fit(_df_train[train_cols], _df_train["target"], eval_set=[(_df_valid[train_cols], _df_valid["target"])], verbose=0, early_stopping_rounds=100)
        preds = gbm.predict(_df_valid[train_cols])
        score = comp_score(_df_valid[["target"]], pd.DataFrame(preds, columns=["prediction"]), "")
        scores.append(score)
        
    return np.mean(scores)

In [None]:
if OPTIMIZE_OPTUNA:
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=21, timeout=500)
    print("Number of finished trials: {}".format(len(study.trials)))
    print("Best trial:")
    trial = study.best_trial
    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

In [None]:
cb_params = {
    'objective': 'Logloss',
    # "random_state": 42,
    # "colsample_bylevel": 0.3, # 0.01, 0.1
    "iterations": 228,
    "learning_rate": 0.0341,
    "cat_features": cat_cols,
    "max_depth": 8,
    "l2_leaf_reg": 8.218571,
    "task_type": "GPU",
    # "scale_pos_weight": 2,
    "verbose": 0,
}
cb_scores = []
cb_models = []
for fold in range(N_SPLITS):
    _df_train = df_train[df_train["fold"] != fold].reset_index(drop=True)
    _df_valid = df_train[df_train["fold"] == fold].reset_index(drop=True)
    # model = cb.CatBoostClassifier(**cb_params)
    model = VotingClassifier([(f"cb_{i}", cb.CatBoostClassifier(random_state=i, **cb_params)) for i in range(3)], voting="soft")
    # eval_set=(_df_valid[train_cols], _df_valid["target"]), early_stopping_rounds=50
    model.fit(_df_train[train_cols], _df_train["target"])
    preds = model.predict_proba(_df_valid[train_cols])[:, 1]
    score = comp_score(_df_valid[["target"]], pd.DataFrame(preds, columns=["prediction"]), "")
    print(f"fold: {fold} - Partial AUC Score: {score:.5f}")
    cb_scores.append(score)
    cb_models.append(model)

In [None]:
# mrs = []

# for j in range(7):
#     cb_params = {
#         'objective':         'Logloss',
#         # "random_state":      42,
#         # "colsample_bylevel": 0.3, # 0.01, 0.1
#         "iterations":         random.choice([400,700,1100]),
#         "learning_rate":      random.choice([0.05,0.03,0.01]),
#         "cat_features":       cat_cols,
#         "max_depth":          random.choice([7,8,9]),
#         "l2_leaf_reg":        random.choice([5,4,7]),
#         "task_type":          "GPU",
#         # "scale_pos_weight":  2,
#         "verbose":            0,
#     }

#     cb_scores = []
#     cb_models = []

#     for fold in range(N_SPLITS):
#         _df_train = df_train[df_train["fold"] != fold].reset_index(drop=True)
#         _df_valid = df_train[df_train["fold"] == fold].reset_index(drop=True)
#         model = VotingClassifier([(f"cb_{i}", cb.CatBoostClassifier(random_state=i, **cb_params)) for i in range(3)], voting="soft")
#         model.fit(_df_train[train_cols], _df_train["target"])
#         preds = model.predict_proba(_df_valid[train_cols])[:, 1]
#         score = comp_score(_df_valid[["target"]], pd.DataFrame(preds, columns=["prediction"]), "")
#         print(f"fold: {fold} - Partial AUC Score: {score:.5f}")
#         cb_scores.append(score)
#         cb_models.append(model)
        
#     mrs.append({'models':cb_models, 'params':cb_params, 'scores':cb_scores})
#     print('\n',np.mean(cb_scores),'\n')
    
# _score,i,j  = -1,0,0

# for mr in mrs:
#     _mean = np.mean(mr['scores'])
#     if _mean > _score: 
#         _score = _mean
#         j = i
#     i += 1
    
# cb_scores = mrs[j]['scores']
# cb_models = mrs[j]['models']
# cb_params = mrs[j]['params']

In [None]:
cb_score = np.mean(cb_scores)
print(f"CatBoost Score: {cb_score:.5f}")
# print(f"CatBoost Param: {cb_params:}")

# CatBoost Feature Importances

In [None]:
if DISPLAY_FEATURE_IMPORTANCE:
    # Same here.
    importances = np.mean([model.feature_importances_ for model in cb_models], 0)
    df_imp = pd.DataFrame({"feature": model.feature_names_, "importance": importances}).sort_values("importance").reset_index(drop=True)

    plt.figure(figsize=(16, 12))
    plt.barh(df_imp["feature"], df_imp["importance"])
    plt.show()

# Ensembling

In [None]:
X_cat = category_encoder.transform(df_test[cat_cols])
for c, cat_col in enumerate(cat_cols):
    df_test[cat_col] = X_cat[:, c]

In [None]:



df_3 = pd.read_csv("submission_image3.csv")
df_test["target_3"] = df_3["target"]

df_6 = pd.read_csv("/kaggle/input/nextvit/submission.csv")
df_test["target_nexnetv1b0"] = df_6["target"]
df_7 = pd.read_csv("/kaggle/input/selecsls42b-in1k-drop/submission.csv")
df_test["target_selnetv1b0"] = df_7["target"]

df_eff = pd.read_csv("submission_effnetv1b0.csv")
df_test["target_effnetv1b0"] = df_eff["target"]



# ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~
# approximately the same FE does not give anything yet the correct
# one asks to add one or more parallel lines
# ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~ ~ + ~




In [None]:
lgb_preds = np.mean([model.predict_proba(df_test[train_cols])[:, 1] for model in lgb_models], 0)
cb_preds  = np.mean([model.predict_proba(df_test[train_cols])[:, 1] for model in cb_models],  0)

# preds = lgb_preds * 0.70 + cb_preds * 0.30 # preds = lgb_preds * 0.555 + cb_preds * 0.445
preds = lgb_preds * 0.595 + cb_preds * 0.405 # preds = lgb_preds * 0.555 + cb_preds * 0.445

In [None]:

df_sub = pd.read_csv("/kaggle/input/isic-2024-challenge/sample_submission.csv")
df_sub["target"] = preds
df_sub.to_csv("submission.csv", index=False)