<a href="https://colab.research.google.com/github/Agoston03/Deep-Learning-42/blob/main/deep_learning_42_milestone2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a homework project in "Deep Learning a gyakorlatban Python és Lua alapokon".  
The team members are:

* Gyulai Gergő László
* Horváth Ágoston
* Frink Dávid

You can read more information about our chosen homework at the link below:  
https://www.kaggle.com/competitions/isic-2024-challenge

## Download and setup

Download Kaggle

In [26]:
!pip install kaggle==1.5.12



Configure Kaggle to access the API  
**Warning!** You need to copy your own kaggle.json file into Colab in order to validate yourself

In [27]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Download the data

In [28]:
!kaggle competitions download -c isic-2024-challenge

isic-2024-challenge.zip: Skipping, found more recently modified local copy (use --force to force download)


Unpacking the data  
**Warning!** This might take a few minuttes

In [None]:
!unzip isic-2024-challenge.zip

Archive:  isic-2024-challenge.zip
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

### Preparing train, test and valid set

Gather information about the dataset based on the metadata

In [None]:
import pandas as pd

In [None]:
metadata = pd.read_csv('train-metadata.csv')
metadata.head()

Lets check the number of benign and malignant data

In [None]:
benign_data = metadata[metadata['target'] == 0]
malignant_data = metadata[metadata['target'] == 1]

print(f'Benign images: {len(benign_data)}')
print(f'Malignant images: {len(malignant_data)}')

In [None]:
benign_ids = benign_data['isic_id'].tolist()
malignant_ids = malignant_data['isic_id'].tolist()

print(f'Benign images ids: {benign_ids[:5]}')
print(f'Malignant images ids: {malignant_ids[:5]}')

In [None]:
test_split = 0.1
valid_split = 0.1

test_benign_ids = benign_ids[:int(len(benign_ids) * test_split)]
test_malignant_ids = malignant_ids[:int(len(malignant_ids) * test_split)]

valid_benign_ids = benign_ids[int(len(benign_ids) * test_split):int(len(benign_ids) * (test_split + valid_split))]
valid_malignant_ids = malignant_ids[int(len(malignant_ids) * test_split):int(len(malignant_ids) * (test_split + valid_split))]

train_benign_ids = benign_ids[int(len(benign_ids) * (test_split + valid_split)):]
train_malignant_ids = malignant_ids[int(len(malignant_ids) * (test_split + valid_split)):]

print(f'Test benign images: {len(test_benign_ids)}')
print(f'Test malignant images: {len(test_malignant_ids)}')
print(f'Valid benign images: {len(valid_benign_ids)}')
print(f'Valid malignant images: {len(valid_malignant_ids)}')
print(f'Train benign images: {len(train_benign_ids)}')
print(f'Train malignant images: {len(train_malignant_ids)}')

We want to train another model on the metadata.  
The if both models say true, then the leisure is probably malignant.  
Here we select the relevant metadata for that model.

In [None]:
COLUMNS = [
    'clin_size_long_diam_mm',
    'tbp_lv_areaMM2',
    'tbp_lv_area_perim_ratio',
    'tbp_lv_color_std_mean',
    'tbp_lv_deltaLBnorm',
    'tbp_lv_minorAxisMM',
    'tbp_lv_perimeterMM'
]

malignant_data[COLUMNS].head()

We saple the ids to get a sample data.  
We will basically use it to "test" each model before training it on huge data.

In [None]:
import random

sample_size = 10000

benign_sample_ids = random.sample(train_benign_ids, k=min(sample_size, len(train_benign_ids)))
malignant_sample_ids = random.sample(train_malignant_ids, k=min(sample_size, len(train_malignant_ids)))

print(f'Benign sample ids: {benign_sample_ids[:5]}')
print(f'Malignant sample ids: {malignant_sample_ids[:5]}')

print(f'Benign sample size: {len(benign_sample_ids)}')
print(f'Malignant sample size: {len(malignant_sample_ids)}')

### Loadig the images

Prepare to load and show images

In [None]:
import matplotlib.pyplot as plt
from PIL import Image
import io
import matplotlib.image as mpimg
import h5py

Visualize some of the data we have

In [None]:
# number of rows and colums displayed
nrows = 4
ncols = 4

fig = plt.gcf()
fig.set_size_inches(ncols * 3, nrows * 3)

next_benign_pix = [key for key in benign_ids[:int(ncols*nrows/2)]]
next_malignant_pix = [key for key in malignant_ids[:int(ncols*nrows/2)]]

with h5py.File('train-image.hdf5', 'r') as f:
  for i, img_key in enumerate(next_benign_pix + next_malignant_pix):
    image_data = f[img_key][()]
    image = Image.open(io.BytesIO(image_data))

    sp = plt.subplot(nrows, ncols, i + 1)
    plt.imshow(image)

plt.show()

## Defining the model

We need a **data generator** because the size of the data is enormous.  
It is a slightly complicated function and understanding code takes time, so here is **the idea briefly**:  
We read data from the file in order, and if there is a **malignant** image, then we **generate more** by rotating it.

In [None]:
import random

def next_data_generator(benign_ids, malignant_ids):
  ids = benign_ids + malignant_ids
  random.shuffle(ids)

  while True:
    with h5py.File('train-image.hdf5', 'r') as f:
      for img_id in ids:
        image_data = f[img_id][()]
        image = Image.open(io.BytesIO(image_data))
        image = image.resize((224, 224))
        image_array = np.array(image)

        if img_id in malignant_ids:
          # Generate more images by rotating the image
          for _ in range(3):
            rotated_image = image.rotate(random.randint(-10, 10))
            rotated_array = np.array(rotated_image)
            yield rotated_array, 1
        else:
          yield image_array, 0

In [None]:
def data_generator(benign_ids, malignant_ids, batch_size=32):
  get_next = next_data_generator(benign_ids, malignant_ids)

  while True:
    batch_data = []
    batch_labels = []

    for _ in range(batch_size):
      image_array, label = next(get_next)
      batch_data.append(image_array)
      batch_labels.append(label)

    yield np.array(batch_data), np.array(batch_labels)

Imoprting keras stuff

In [None]:
import tensorflow as tf
from tensorflow.keras.applications import EfficientNetV2B0
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D

**Defining** the model for training.  
We use a **pretrained CNN** model for this assignment.  
We will try a few other models later, but this seems enough for now.  
What we will definitely have to do later is to experiment with **different loss functions and optimizers**.

In [None]:
base_model = EfficientNetV2B0(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

for layer in base_model.layers:
    layer.trainable = False

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(1, activation='sigmoid')(x)

model = Model(inputs=base_model.input, outputs=predictions)

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['recall'])

**Training** the model.  
This might also be the subject of different **modifications** in the future.

In [None]:
import numpy as np

batch_size = 128
epochs = 5

# For now we only train on a small dataset to save resources
train_generator = data_generator(benign_sample_ids, malignant_sample_ids, batch_size=batch_size)

model.fit(train_generator,
          epochs = epochs,
          steps_per_epoch = 5,
          batch_size = batch_size)

Here we fill an aray with the **predictions** to simplify the code for the visualizations later.

In [None]:
train_ids = test_benign_ids[:100] + test_malignant_ids
predictions_train = np.zeros((len(train_ids), 1))


for i, img_id in enumerate(train_ids):
  with h5py.File('train-image.hdf5', 'r') as f:
    image_data = f[img_id][()]
    image = Image.open(io.BytesIO(image_data)).resize((224, 224))
    image_array = tf.keras.preprocessing.image.img_to_array(image)
    image_array = np.expand_dims(image_array, axis=0)

    # Make prediction using the model
    prediction = model.predict(image_array, verbose=0)
    predictions_train[i] = prediction

# Now predictions_train contains the model's predictions for all train images
predictions_train.shape

Plot the **confusion matrix**.  
It tells a lot about the model, very intuitively.

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

true_labels_train = [0] * 100 + [1] * len(test_malignant_ids)

# Convert predictions to binary (0 or 1) using a threshold (e.g., 0.5)
predicted_labels_train = (predictions_train > 0.5).astype(int)

# Calculate the confusion matrix
cm = confusion_matrix(true_labels_train, predicted_labels_train)

# Plot the confusion matrix using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

**Partial area under the ROC curve** (pAUC) above 80% true positive rate (TPR) for binary classification of malignant examples.

The receiver operating characteristic (ROC) curve illustrates the diagnostic ability of a given binary classifier system as its discrimination threshold is varied. However, there are regions in the ROC space where the values of TPR are unacceptable in clinical practice. Systems that aid in diagnosing cancers are required to be highly-sensitive, so this metric focuses on the area under the ROC curve AND above 80% TRP. Hence, scores range from [0.0, 0.2].

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

def generate_pAUC_plot(algorithm_probs, true_labels, min_tpr=0.8):
    """
    Generate a plot pAUC above a given minimum TPR for algorithm.

    Parameters:
        algorithm_probs (list): Predicted probabilities from Algorithm.
        true_labels (list): Ground truth binary labels.
        min_tpr (float): Minimum TPR threshold for pAUC calculation (default=0.8).
    """

    # Compute ROC curves
    fpr_a, tpr_a, _ = roc_curve(true_labels, algorithm_probs)

    # Find index of TPR above the min_tpr
    min_tpr_idx_a = np.where(tpr_a >= min_tpr)[0]

    # Filter FPR and TPR above min_tpr for algorithm
    fpr_a_high_tpr, tpr_a_high_tpr = fpr_a[min_tpr_idx_a], tpr_a[min_tpr_idx_a]

    # Calculate pAUC above min_tpr
    pAUC_a = auc(fpr_a_high_tpr, tpr_a_high_tpr) if len(min_tpr_idx_a) > 0 else 0.0

    # Plot ROC curves
    plt.figure(figsize=(10, 6))
    plt.plot(fpr_a, tpr_a, label=f'Algorithm (pAUC={pAUC_a:.3f})', color='blue', linewidth=2)

    # Shade the pAUC region above min_tpr
    plt.fill_between(fpr_a_high_tpr, tpr_a_high_tpr, min_tpr, color='blue', alpha=0.2, label='pAUC region')

    # Add labels, legend, and grid
    plt.axhline(y=min_tpr, color='red', linestyle='--', label=f'Minimum TPR ({min_tpr*100:.0f}%)')
    plt.xlabel('False Positive Rate (FPR)')
    plt.ylabel('True Positive Rate (TPR)')
    plt.title('Partial AUC Above Minimum TPR')
    plt.legend(loc='lower right')
    plt.grid(alpha=0.3)
    plt.show()

generate_pAUC_plot(predictions_train, true_labels_train)

**Extra:**  
For the next assignment we will compare two approaches:  
- Gather new data from a different source to enhance the CNN.
- Use the metadata to enhance the decisions.  

These two might not be compatible, so we need to try and evaluate both.