## Skin Lesions Dataset Description

The dataset comprises various types of skin lesions, each falling under specific categories:

- **Actinic Keratoses and Intraepithelial Carcinoma / Bowen's Disease (AKIEC)**
- **Basal Cell Carcinoma (BCC)**
- **Benign Keratosis-like Lesions**:
  - *Solar Lentigines*
  - *Seborrheic Keratoses*
  - *Lichen-Planus like Keratoses (BKL)*
- **Dermatofibroma (DF)**
- **Melanoma (MEL)**
- **Melanocytic Nevi (NV)**
- **Vascular Lesions**:
  - *Angiomas*
  - *Angiokeratomas*
  - *Pyogenic Granulomas*
  - *Hemorrhage (VASC)*

**Diagnosis Confirmation:**

- Over 50% of lesions in this dataset are confirmed through **histopathology (Histo)**, serving as the ground truth for these cases.
- For the remaining cases, confirmation methods include:
  - *Follow-up examination (Follow_Up)*
  - *Expert Consensus (Consensus)*
  - *Confirmation by In-Vivo Confocal Microscopy (Confocal)*

Each lesion might have multiple associated images, allowing for tracking via the `lesion_id` column within the `HAM10000_metadata` file.

This diverse dataset contains various types of skin lesions, each categorized and confirmed through different diagnostic approaches, contributing to a comprehensive resource for research and analysis.


In [22]:
labels_dict = {
    'akiec': "Actinic Keratoses and Intraepithelial Carcinoma / Bowen's Disease (AKIEC)",
    'bcc': "Basal Cell Carcinoma (BCC)",
    'bkl': "Benign Keratosis-like Lesions",
    'df': "Dermatofibroma (DF)",
    'mel': "Melanoma (MEL)",
    'nv': "Melanocytic Nevi (NV)",
    'vasc': "Vascular Lesions",
    'histo': "Confirmed through Histopathology (Histo)",
    'follow_up': "Follow-up examination (Follow_Up)",
    'consensus': "Expert Consensus (Consensus)",
    'confocal': "Confirmation by In-Vivo Confocal Microscopy (Confocal)"
}


In [23]:
labels = ['akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc']

# Imports

In [24]:
import os  
import glob
import sklearn
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import multilabel_confusion_matrix


import PIL 
import random
import numpy as np
import matplotlib.pyplot as plt 

from tool_preprocessing import *
# from models import *
# import torch
# import torch.nn as nn
# import torchvision.models as models
# from efficientnet_pytorch import EfficientNet
# from torchinfo import summary 

# import torch.optim as optim
# from IPython.display import Image
# from torch.utils.data import DataLoader, Dataset

# from torchvision.datasets import ImageFolder
# from torchvision.transforms import transforms
# import torch.nn.functional as F

# import pydicom
# from PIL import Image

# import cv2

# Preprocessing

In this notebook, labels are initially considered as categorical.

## Manual Part

If the images are organized in the folders of each label, the following flag must be True

In [25]:
flag_folder_sep = False

In [26]:
# base_path = 'C:/Users/lucas/OneDrive - unb.br/Documents/UnB/Semestres-ENE/TCC/COVID_Dataset_original'
base_path = 'C:/Users/lucas/OneDrive - unb.br/Documents/UnB/Semestres-ENE/TCC/The HAM10000 dataset'

In [27]:
if flag_folder_sep :
    label_column = 'label'
    train_df, test_df, val_df  = make_dataset_by_folder(base_path=base_path, label_column=label_column)

else:
    
    path_train_df = f'{base_path}/HAM10000_metadata'
    path_test_df = f'{base_path}/test.csv'
    
    path_train = f"{base_path}/treino"
    path_test = f"{base_path}/test"
    
    paths_image = [path_train, path_test]
    paths_df = [path_train_df, path_test_df]
    label_column = 'dx'
    
    train_df, test_df, val_df = make_dataset_by_df(paths_image, paths_df, label_column=label_column)
    
    

## Analysis

### Train

In [28]:
# image_analysis_train = image_analysis(train_df)

Smallest pixel value: 0

Largest pixel value: 255

Total images processed: 10015

Channel Statistics:
Channel 'R':
  - Average: 194.6979202056175
  - Standard Deviation: 22.85509458222392
Channel 'G':
  - Average: 139.26262746509866
  - Standard Deviation: 30.1684115555478
Channel 'B':
  - Average: 145.4852413568536
  - Standard Deviation: 33.90319049131724

In [29]:
train_df = check_images_existence(train_df, path_column='path')

In [30]:
dict_train_qntd = get_label_counts_and_print(train_df, label_column=label_column)
# shapes_train = analyze_image_shapes(train_df, min_shape=(800, 800), path_column='path')

Total number of images: 9010
Number of unique labels: 7
Label 'nv' has 6034 images.
Label 'mel' has 1001 images.
Label 'bkl' has 989 images.
Label 'bcc' has 462 images.
Label 'akiec' has 294 images.
Label 'vasc' has 127 images.
Label 'df' has 103 images.


In [31]:
dict_train_qntd

{'nv': 6034,
 'mel': 1001,
 'bkl': 989,
 'bcc': 462,
 'akiec': 294,
 'vasc': 127,
 'df': 103}

### Test

In [32]:
# image_analysis_test = image_analysis(test_df)

Smallest pixel value: 0

Largest pixel value: 255

Total images processed: 1511

Channel Statistics:
Channel 'R':
  - Average: 193.96235187636344
  - Standard Deviation: 24.606448726550262
Channel 'G':
  - Average: 141.7550379243572
  - Standard Deviation: 31.9625774054364
Channel 'B':
  - Average: 147.87214814814814
  - Standard Deviation: 35.78649271231254

In [33]:
test_df = check_images_existence(test_df, path_column='path')

Image not found in folder: None
Removed lines:
lesion_id       HAMTEST_0000496
image_id           ISIC_0035068
dx                           nv
dx_type               consensus
age                         NaN
sex                         NaN
localization                NaN
dataset                     NaN
path                       None
Name: 534, dtype: object


In [34]:
dict_test_qntd = get_label_counts_and_print(test_df, label_column=label_column)
# shapes_test = analyze_image_shapes(test_df, min_shape=(461, 601), path_column='path')

Total number of images: 1511
Number of unique labels: 7
Label 'nv' has 908 images.
Label 'bkl' has 217 images.
Label 'mel' has 171 images.
Label 'bcc' has 93 images.
Label 'df' has 44 images.
Label 'akiec' has 43 images.
Label 'vasc' has 35 images.


### Validation

In [35]:
# image_analysis_val = image_analysis(val_df)

Smallest pixel value: 0

Largest pixel value: 255

Total images processed: 1005

Channel Statistics:
Channel 'R':
  - Average: 194.84180818500093
  - Standard Deviation: 23.428856516708883
Channel 'G':
  - Average: 139.9250088557214
  - Standard Deviation: 30.539483110296192
Channel 'B':
  - Average: 146.21765087525336
  - Standard Deviation: 34.290640143609394

In [36]:
val_df = check_images_existence(val_df, path_column='path')

In [37]:
dict_val_qntd = get_label_counts_and_print(val_df, label_column=label_column)
# shapes_val = analyze_image_shapes(val_df, min_shape=(461, 601), path_column='path')

Total number of images: 1005
Number of unique labels: 7
Label 'nv' has 671 images.
Label 'mel' has 112 images.
Label 'bkl' has 110 images.
Label 'bcc' has 52 images.
Label 'akiec' has 33 images.
Label 'vasc' has 15 images.
Label 'df' has 12 images.


## Model Preparation

In [38]:
from model_preprocessing import *

Passar de categorial para binário 

Pesos para a loss

### Categorial to number

In [39]:
labels_dict = labels2dict(train_df, label_column)
labels_dict

{'nv': 0, 'mel': 1, 'bkl': 2, 'bcc': 3, 'akiec': 4, 'vasc': 5, 'df': 6}

In [40]:
train_label, test_label, val_label = dflabel2number([train_df, test_df, val_df], labels_dict, label_column)

### Weights

In [45]:
if len(labels_dict) == 1:
    weights = calculate_weights(train_df, labels_dict, dict_train_qntd)
    weights = max(weights)
else:
    weights = calculate_weights(train_df, labels_dict, dict_train_qntd)
    print(weights)

[0.21331502438562433, 1.2858570001427145, 1.3014589050989456, 2.7860235003092146, 4.3780369290573375, 10.13498312710911, 12.496532593619973]


# Model

In [42]:
from models import *

ModuleNotFoundError: No module named 'torch'

## Dataset Class

In [None]:
class CT_Dataset(Dataset):
    def __init__(self, img_path, img_labels, img_transforms=None, grayscale=True):
        self.img_path = img_path
        self.img_labels = torch.Tensor(img_labels)
        if (img_transforms is None) & (grayscale == True):
            self.transforms = transforms.Compose([transforms.Grayscale(),
                                                  transforms.Resize((250, 250)),
                                                  transforms.ToTensor()])
        elif grayscale == False:
            self.transforms = transforms.Compose([transforms.Resize((250, 250)),
                                                  transforms.ToTensor()])
        else:
            self.transforms = img_transforms
    
    def __getitem__(self, index):
        # load image
        cur_path = self.img_path[index]
        cur_img = PIL.Image.open(cur_path).convert('RGB')
        cur_img = self.transforms(cur_img)

        return cur_img, self.img_labels[index]
    
    def __len__(self):
        return len(self.img_path)

In [None]:
train_dataset = CT_Dataset(img_path=np.array(train_df['path']), img_labels=np.array(train_label))
val_dataset = CT_Dataset(img_path=np.array(val_df['path']), img_labels=np.array(val_label))
test_dataset = CT_Dataset(img_path=np.array(test_df['path']), img_labels=np.array(test_label))

## GPU

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device 

In [None]:
print("Current GPU memory usage:", torch.cuda.memory_allocated() / (1024 ** 2), "MB")
print("Max GPU memory usage:", torch.cuda.max_memory_allocated() / (1024 ** 2), "MB")

torch.cuda.empty_cache()

## Random Seed

In [None]:
random_seed = 124
np.random.seed(random_seed)

torch.manual_seed(random_seed)
torch.backends.cudnn.deterministic = True

## Training

In [None]:
from train_function import *

In [None]:
batch_size = 32
epoch = 1



model_kernel = ResNet50(num_classes=7)
# model_kernel = ResNet101(num_classes=7)
# model_kernel = EfficientNetB(num_classes=7)
# model_kernel = EfficientNetB4(num_classes=7)
# model_kernel = EfficientNetB7(num_classes=7)
# model_kernel = VGG16(num_classes=7)



path_save_model = f'C:/Users/Lucas/Documents/PIBIC/DATASET/NIH-CHEST/model_/{model_kernel.get_name()}_{epoch}'
hist_kernel, model_kernel = train_model(model_kernel, train_dataset, val_dataset, test_dataset, device, path_save_model,weights, batch_size= batch_size, epochs=epoch)