## Skin Lesions Dataset Description

The dataset comprises various types of skin lesions, each falling under specific categories:

- **Actinic Keratoses and Intraepithelial Carcinoma / Bowen's Disease (AKIEC)**
- **Basal Cell Carcinoma (BCC)**
- **Benign Keratosis-like Lesions**:
  - *Solar Lentigines*
  - *Seborrheic Keratoses*
  - *Lichen-Planus like Keratoses (BKL)*
- **Dermatofibroma (DF)**
- **Melanoma (MEL)**
- **Melanocytic Nevi (NV)**
- **Vascular Lesions**:
  - *Angiomas*
  - *Angiokeratomas*
  - *Pyogenic Granulomas*
  - *Hemorrhage (VASC)*

**Diagnosis Confirmation:**

- Over 50% of lesions in this dataset are confirmed through **histopathology (Histo)**, serving as the ground truth for these cases.
- For the remaining cases, confirmation methods include:
  - *Follow-up examination (Follow_Up)*
  - *Expert Consensus (Consensus)*
  - *Confirmation by In-Vivo Confocal Microscopy (Confocal)*

Each lesion might have multiple associated images, allowing for tracking via the `lesion_id` column within the `HAM10000_metadata` file.

This diverse dataset contains various types of skin lesions, each categorized and confirmed through different diagnostic approaches, contributing to a comprehensive resource for research and analysis.


In [1]:
labels_dict = {
    'akiec': "Actinic Keratoses and Intraepithelial Carcinoma / Bowen's Disease (AKIEC)",
    'bcc': "Basal Cell Carcinoma (BCC)",
    'bkl': "Benign Keratosis-like Lesions",
    'df': "Dermatofibroma (DF)",
    'mel': "Melanoma (MEL)",
    'nv': "Melanocytic Nevi (NV)",
    'vasc': "Vascular Lesions",
    'histo': "Confirmed through Histopathology (Histo)",
    'follow_up': "Follow-up examination (Follow_Up)",
    'consensus': "Expert Consensus (Consensus)",
    'confocal': "Confirmation by In-Vivo Confocal Microscopy (Confocal)"
}


In [2]:
labels = ['akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc']

# Imports

In [3]:
import os  
import glob
import sklearn
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import multilabel_confusion_matrix


import PIL 
import random
import numpy as np
import matplotlib.pyplot as plt 

from tool_preprocessing import *
# from models import *
# import torch
# import torch.nn as nn
# import torchvision.models as models
# from efficientnet_pytorch import EfficientNet
# from torchinfo import summary 

# import torch.optim as optim
# from IPython.display import Image
# from torch.utils.data import DataLoader, Dataset

# from torchvision.datasets import ImageFolder
# from torchvision.transforms import transforms
# import torch.nn.functional as F

# import pydicom
# from PIL import Image

# import cv2

# Preprocessing

In this notebook, labels are initially considered as categorical.

## Manual Part

If the images are organized in the folders of each label, the following flag must be True

In [4]:
flag_folder_sep = True

In [5]:
base_path = 'C:/Users/lucas/OneDrive - unb.br/Documents/UnB/Semestres-ENE/TCC/COVID_Dataset_original'
# base_path = 'C:/Users/lucas/OneDrive - unb.br/Documents/UnB/Semestres-ENE/TCC/The HAM10000 dataset'

In [6]:
if flag_folder_sep :
    label_column = 'label'
    train_df, test_df, val_df  = make_dataset_by_folder(base_path=base_path, label_column=label_column)

else:
    
    path_train_df = f'{base_path}/HAM10000_metadata'
    path_test_df = f'{base_path}/test.csv'
    
    path_train = f"{base_path}/treino"
    path_test = f"{base_path}/test"
    
    paths_image = [path_train, path_test]
    paths_df = [path_train_df, path_test_df]
    label_column = 'dx'
    
    train_df, test_df, val_df = make_dataset_by_df(paths_image, paths_df, label_column=label_column)
    
    

Is equal
Is equal
Is equal
Is equal
Is equal
Is equal
Label 'COVID' is not equal. Expected - 361, Actual - 362
Is equal
Label 'Normal' is not equal. Expected - 1019, Actual - 1020
Label 'Lung_Opacity' is not equal. Expected - 601, Actual - 602
Label 'COVID' is not equal. Expected - 361, Actual - 362
Label 'Viral_Pneumonia' is not equal. Expected - 134, Actual - 135


## Analysis

### Train

In [7]:
# image_analysis_train = image_analysis(train_df)

Smallest pixel value: 0

Largest pixel value: 255

Total images processed: 10015

Channel Statistics:
Channel 'R':
  - Average: 194.6979202056175
  - Standard Deviation: 22.85509458222392
Channel 'G':
  - Average: 139.26262746509866
  - Standard Deviation: 30.1684115555478
Channel 'B':
  - Average: 145.4852413568536
  - Standard Deviation: 33.90319049131724

In [8]:
train_df = check_images_existence(train_df, path_column='path')

In [9]:
dict_df_train = get_label_counts_and_print(train_df, label_column=label_column)
# shapes_train = analyze_image_shapes(train_df, min_shape=(800, 800), path_column='path')

Total number of images: 16930
Number of unique labels: 4
Label 'Normal' has 8153 images.
Label 'Lung_Opacity' has 4809 images.
Label 'COVID' has 2892 images.
Label 'Viral_Pneumonia' has 1076 images.


### Test

In [10]:
# image_analysis_test = image_analysis(test_df)

Smallest pixel value: 0

Largest pixel value: 255

Total images processed: 1511

Channel Statistics:
Channel 'R':
  - Average: 193.96235187636344
  - Standard Deviation: 24.606448726550262
Channel 'G':
  - Average: 141.7550379243572
  - Standard Deviation: 31.9625774054364
Channel 'B':
  - Average: 147.87214814814814
  - Standard Deviation: 35.78649271231254

In [11]:
test_df = check_images_existence(test_df, path_column='path')

In [12]:
dict_df_test = get_label_counts_and_print(test_df, label_column=label_column)
shapes_test = analyze_image_shapes(test_df, min_shape=(461, 601), path_column='path')

Total number of images: 2116
Number of unique labels: 4
Label 'Normal' has 1019 images.
Label 'Lung_Opacity' has 601 images.
Label 'COVID' has 362 images.
Label 'Viral_Pneumonia' has 134 images.
Average image shape - Height: 299.0, Width: 299.0
Number of images with shape smaller than (461, 601): 2116


### Validation

In [13]:
# image_analysis_val = image_analysis(val_df)

Smallest pixel value: 0

Largest pixel value: 255

Total images processed: 1005

Channel Statistics:
Channel 'R':
  - Average: 194.84180818500093
  - Standard Deviation: 23.428856516708883
Channel 'G':
  - Average: 139.9250088557214
  - Standard Deviation: 30.539483110296192
Channel 'B':
  - Average: 146.21765087525336
  - Standard Deviation: 34.290640143609394

In [14]:
val_df = check_images_existence(val_df, path_column='path')

In [15]:
dict_df_val = get_label_counts_and_print(val_df, label_column=label_column)
shapes_val = analyze_image_shapes(val_df, min_shape=(461, 601), path_column='path')

Total number of images: 2119
Number of unique labels: 4
Label 'Normal' has 1020 images.
Label 'Lung_Opacity' has 602 images.
Label 'COVID' has 362 images.
Label 'Viral_Pneumonia' has 135 images.
Average image shape - Height: 299.0, Width: 299.0
Number of images with shape smaller than (461, 601): 2119
