## Skin Lesions Dataset Description

The dataset comprises various types of skin lesions, each falling under specific categories:

- **Actinic Keratoses and Intraepithelial Carcinoma / Bowen's Disease (AKIEC)**
- **Basal Cell Carcinoma (BCC)**
- **Benign Keratosis-like Lesions**:
  - *Solar Lentigines*
  - *Seborrheic Keratoses*
  - *Lichen-Planus like Keratoses (BKL)*
- **Dermatofibroma (DF)**
- **Melanoma (MEL)**
- **Melanocytic Nevi (NV)**
- **Vascular Lesions**:
  - *Angiomas*
  - *Angiokeratomas*
  - *Pyogenic Granulomas*
  - *Hemorrhage (VASC)*

**Diagnosis Confirmation:**

- Over 50% of lesions in this dataset are confirmed through **histopathology (Histo)**, serving as the ground truth for these cases.
- For the remaining cases, confirmation methods include:
  - *Follow-up examination (Follow_Up)*
  - *Expert Consensus (Consensus)*
  - *Confirmation by In-Vivo Confocal Microscopy (Confocal)*

Each lesion might have multiple associated images, allowing for tracking via the `lesion_id` column within the `HAM10000_metadata` file.

This diverse dataset contains various types of skin lesions, each categorized and confirmed through different diagnostic approaches, contributing to a comprehensive resource for research and analysis.


In [1]:
labels_dict = {
    'akiec': "Actinic Keratoses and Intraepithelial Carcinoma / Bowen's Disease (AKIEC)",
    'bcc': "Basal Cell Carcinoma (BCC)",
    'bkl': "Benign Keratosis-like Lesions",
    'df': "Dermatofibroma (DF)",
    'mel': "Melanoma (MEL)",
    'nv': "Melanocytic Nevi (NV)",
    'vasc': "Vascular Lesions",
    'histo': "Confirmed through Histopathology (Histo)",
    'follow_up': "Follow-up examination (Follow_Up)",
    'consensus': "Expert Consensus (Consensus)",
    'confocal': "Confirmation by In-Vivo Confocal Microscopy (Confocal)"
}


In [2]:
labels = ['akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc']

# Imports

In [3]:
import os  
import glob
import sklearn
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import multilabel_confusion_matrix


import PIL 
import random
import numpy as np
import matplotlib.pyplot as plt 

from tool_preprocessing import *
# from models import *
# import torch
# import torch.nn as nn
# import torchvision.models as models
# from efficientnet_pytorch import EfficientNet
# from torchinfo import summary 

# import torch.optim as optim
# from IPython.display import Image
# from torch.utils.data import DataLoader, Dataset

# from torchvision.datasets import ImageFolder
# from torchvision.transforms import transforms
# import torch.nn.functional as F

# import pydicom
# from PIL import Image

# import cv2

# Preprocessing

In this notebook, labels are initially considered as categorical.

## Manual Part

If the images are organized in the folders of each label, the following flag must be True

In [20]:
flag_folder_sep = False

In [21]:
# base_path = 'C:/Users/lucas/OneDrive - unb.br/Documents/UnB/Semestres-ENE/TCC/COVID_Dataset_original'
base_path = 'C:/Users/lucas/OneDrive - unb.br/Documents/UnB/Semestres-ENE/TCC/The HAM10000 dataset'

In [22]:
if flag_folder_sep :
    label_column = 'label'
    train_df, test_df, val_df  = make_dataset_by_folder(base_path=base_path, label_column=label_column)

else:
    
    path_train_df = f'{base_path}/HAM10000_metadata'
    path_test_df = f'{base_path}/test.csv'
    
    path_train = f"{base_path}/treino"
    path_test = f"{base_path}/test"
    
    paths_image = [path_train, path_test]
    paths_df = [path_train_df, path_test_df]
    label_column = 'dx'
    
    train_df, test_df, val_df = make_dataset_by_df(paths_image, paths_df, label_column=label_column)
    
    

## Analysis

### Train

In [23]:
# image_analysis_train = image_analysis(train_df)

Smallest pixel value: 0

Largest pixel value: 255

Total images processed: 10015

Channel Statistics:
Channel 'R':
  - Average: 194.6979202056175
  - Standard Deviation: 22.85509458222392
Channel 'G':
  - Average: 139.26262746509866
  - Standard Deviation: 30.1684115555478
Channel 'B':
  - Average: 145.4852413568536
  - Standard Deviation: 33.90319049131724

In [24]:
train_df = check_images_existence(train_df, path_column='path')

In [25]:
dict_df_train = get_label_counts_and_print(train_df, label_column=label_column)
# shapes_train = analyze_image_shapes(train_df, min_shape=(800, 800), path_column='path')

Total number of images: 9010
Number of unique labels: 7
Label 'nv' has 6034 images.
Label 'mel' has 1001 images.
Label 'bkl' has 989 images.
Label 'bcc' has 462 images.
Label 'akiec' has 294 images.
Label 'vasc' has 127 images.
Label 'df' has 103 images.


### Test

In [26]:
# image_analysis_test = image_analysis(test_df)

Smallest pixel value: 0

Largest pixel value: 255

Total images processed: 1511

Channel Statistics:
Channel 'R':
  - Average: 193.96235187636344
  - Standard Deviation: 24.606448726550262
Channel 'G':
  - Average: 141.7550379243572
  - Standard Deviation: 31.9625774054364
Channel 'B':
  - Average: 147.87214814814814
  - Standard Deviation: 35.78649271231254

In [27]:
test_df = check_images_existence(test_df, path_column='path')

Image not found in folder: None
Removed lines:
lesion_id       HAMTEST_0000496
image_id           ISIC_0035068
dx                           nv
dx_type               consensus
age                         NaN
sex                         NaN
localization                NaN
dataset                     NaN
path                       None
Name: 534, dtype: object


In [28]:
dict_df_test = get_label_counts_and_print(test_df, label_column=label_column)
# shapes_test = analyze_image_shapes(test_df, min_shape=(461, 601), path_column='path')

Total number of images: 1511
Number of unique labels: 7
Label 'nv' has 908 images.
Label 'bkl' has 217 images.
Label 'mel' has 171 images.
Label 'bcc' has 93 images.
Label 'df' has 44 images.
Label 'akiec' has 43 images.
Label 'vasc' has 35 images.


Average image shape - Height: 450.0, Width: 600.0
Number of images with shape smaller than (461, 601): 1511


### Validation

In [29]:
# image_analysis_val = image_analysis(val_df)

Smallest pixel value: 0

Largest pixel value: 255

Total images processed: 1005

Channel Statistics:
Channel 'R':
  - Average: 194.84180818500093
  - Standard Deviation: 23.428856516708883
Channel 'G':
  - Average: 139.9250088557214
  - Standard Deviation: 30.539483110296192
Channel 'B':
  - Average: 146.21765087525336
  - Standard Deviation: 34.290640143609394

In [30]:
val_df = check_images_existence(val_df, path_column='path')

In [31]:
dict_df_val = get_label_counts_and_print(val_df, label_column=label_column)
# shapes_val = analyze_image_shapes(val_df, min_shape=(461, 601), path_column='path')

Total number of images: 1005
Number of unique labels: 7
Label 'nv' has 671 images.
Label 'mel' has 112 images.
Label 'bkl' has 110 images.
Label 'bcc' has 52 images.
Label 'akiec' has 33 images.
Label 'vasc' has 15 images.
Label 'df' has 12 images.
Average image shape - Height: 450.0, Width: 600.0
Number of images with shape smaller than (461, 601): 1005


## Model Preparation

In [None]:
from model_processing import *

Passar de categorial para binário 

Pesos para a loss

In [None]:
def labels2dict(dataframe, label_column):
    labels = (train_df[label_column].unique())
    labels_dict = {}
    for index, label in enumerate(labels):
        labels_dict[label] = index
        
    labels_dict

In [46]:
labels = (train_df[label_column].unique())
labels_dict = {}
for index, label in enumerate(labels):
    labels_dict[label] = index
    
labels_dict

{'nv': 0, 'mel': 1, 'bkl': 2, 'bcc': 3, 'akiec': 4, 'vasc': 5, 'df': 6}

In [47]:
test_df.head()

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization,dataset,path
0,HAMTEST_0000000,ISIC_0034524,nv,follow_up,40.0,female,back,vidir_molemax,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
1,HAMTEST_0000001,ISIC_0034525,nv,histo,70.0,male,abdomen,rosendahl,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
2,HAMTEST_0000002,ISIC_0034526,bkl,histo,70.0,male,back,rosendahl,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
3,HAMTEST_0000003,ISIC_0034527,nv,histo,35.0,male,trunk,vienna_dias,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
4,HAMTEST_0000004,ISIC_0034528,nv,follow_up,75.0,female,trunk,vidir_molemax,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...


In [None]:
[labels_dict[item] if item in labels_dict else item for item in list(train_df[label_column])]

In [57]:
def categorial2number(dataframe, labels_dict, label_column):
    label_dataframe = [labels_dict[item] if item in labels_dict else item for item in list(dataframe[label_column])]
    return label_dataframe


In [58]:
def dflabel2number(df_lists, labels_dict, label_column):
    df1 = categorial2number(df_lists[0], labels_dict, label_column)
    df2 = categorial2number(df_lists[1], labels_dict, label_column)
    df3 = categorial2number(df_lists[2], labels_dict, label_column)

    return df1, df2, df3

In [59]:
train_label, test_label, val_label = dflabel2number([train_df, test_df, val_df], labels_dict, label_column)

In [62]:
labels_dict

{'nv': 0, 'mel': 1, 'bkl': 2, 'bcc': 3, 'akiec': 4, 'vasc': 5, 'df': 6}

In [61]:
train_df

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization,dataset,path
1995,HAM_0000466,ISIC_0027749,nv,follow_up,45.0,female,lower extremity,vidir_molemax,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
2102,HAM_0000690,ISIC_0030182,nv,consensus,30.0,female,back,vidir_modern,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
3489,HAM_0005306,ISIC_0027056,nv,follow_up,60.0,male,back,vidir_molemax,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
2705,HAM_0003201,ISIC_0030386,nv,follow_up,50.0,female,trunk,vidir_molemax,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
7032,HAM_0006383,ISIC_0025183,nv,follow_up,75.0,female,trunk,vidir_molemax,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
...,...,...,...,...,...,...,...,...,...
9090,HAM_0003555,ISIC_0030321,df,histo,30.0,male,upper extremity,rosendahl,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
948,HAM_0004317,ISIC_0025504,df,histo,50.0,female,lower extremity,rosendahl,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
7594,HAM_0006371,ISIC_0033780,df,consensus,35.0,female,lower extremity,vidir_modern,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
3862,HAM_0004103,ISIC_0028880,df,histo,55.0,male,lower extremity,vidir_modern,C:/Users/lucas/OneDrive - unb.br/Documents/UnB...
