<p align="center">
  <img src="../src/assets/image/tensorflow_logo.png" alt="TensorFlow Image" width="50" align="left"/>
</p>

# Skin Problem Classification Notebook 


---

## Team Information

**Team ID:** C241-PS385  

**Members:**    
- Stefanus Bernard Melkisedek - [GitHub Profile](https://github.com/stefansphtr)
- Debby Trinita - [GitHub Profile](https://github.com/debbytrinita)
- Mhd. Reza Kurniawan Lubis - [GitHub Profile](https://github.com/rezakur)

## Chosen Development Environment

For this project, our team opted to utilize Google Colab as our primary development environment. The decision to use Google Colab was primarily driven by its provision of complimentary access to GPU and TPU resources. These resources significantly expedite the model training process, thereby enhancing our productivity and efficiency.

## 1. Setup

## 1.1 Import Libraries

In [None]:
# Standard library imports
import io
import itertools
import sys
import zipfile
import os
import shutil
from shutil import copyfile
from datetime import datetime
import random
from IPython.display import Image

# Third-party imports
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, Dense, BatchNormalization, Dropout, MaxPooling2D, Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import plot_model
import sklearn
from sklearn.metrics import confusion_matrix

## 1.2 Load TensorBoard Extension

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

## 2. Load and Preprocess Data

This section will cover the data loading and preprocessing steps. 

The step by step process is as follows:
1. [Mounting Google Drive](#step1) - This step is necessary to access the dataset stored in Google Drive.
   
2. [Extracting the Dataset](#step2) - The dataset is stored in a zip file. We will extract the contents of the zip file to access the dataset.
   
3. [Copying the Data to the Local Directory](#step3) - We will copy the dataset to the local directory to facilitate data loading.
   
4. [Defining Directories and Parameters](#step4) - We will define the directories and parameters required for data loading.
   
5. [Checking Column Names](#step5) - We will check the column names to ensure that they are clean and consistent.
   
6. [Cleaning Column Names](#step6) - We will clean the column names to ensure that they are consistent and easy to work with.

<a id='step1'></a>

### 2.1 Mounting Google Drive

The code mounts Google Drive to access the dataset stored there.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

<a id='step2'></a>

### 2.2 Extracting the Dataset and Create Train & Validation Directory

The code defines the path to the dataset zip file and the path where the dataset will be extracted. It then checks if the data is already extracted. If not, it extracts the zip file. additioanlly, it prints the number of images for each directory, and create the train and validation directory.

In [None]:
# Define the paths
ZIP_FILE_PATH = "/content/drive/Shareddrives/Capstone_Project/Machine_Learning/data/skin_problem_dataset.zip"
DESTINATION_PATH = "/content/drive/Shareddrives/Capstone_Project/Machine_Learning/data/"

# Define the skin classes
SKIN_CLASSES = ['acnes', 'blackheads', 'darkspots', 'wrinkles']

# Define root directory
ROOT_DIR = '/content/drive/My Drive/Capstone_Project_Bangkit/Machine_Learning/data/split'

In [None]:
def extract_dataset(zip_file_path, destination_path):
    # Check if the destination directory already exists
    if not os.path.exists(destination_path):
        # Create a ZipFile object
        with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
            # Extract all the contents of the zip file to the destination directory
            zip_ref.extractall(destination_path)
            print(f"Zip file extracted to {destination_path}")
    else:
        print(f"Directory {destination_path} already exists, skipping extraction\n")

    # Add print statement to check the content of the data directory
    print(f"Content of {destination_path}: {os.listdir(destination_path)}\n")

def print_num_images(source_dir, skin_classes):
    # For each skin class, print the number of images
    for skin_class in skin_classes:
        class_dir = os.path.join(source_dir, skin_class)
        try:
            num_images = len(os.listdir(class_dir))
            print(f"The number of images in {skin_class} class: {num_images}\n")
        except FileNotFoundError:
            print(f"Directory {class_dir} does not exist.")
        except NotADirectoryError:
            print(f"{class_dir} is not a directory.")
        except Exception as e:
            print(f"An error occurred: {e}")

def create_train_val_dirs(root_path, skin_classes):
    # Empty directory to prevent FileExistsError if the function is run several times
    if os.path.exists(root_path):
        shutil.rmtree(root_path)

    # Create train and validation directories
    # train and validation directories for skin-case
    train_dir = os.path.join(root_path, 'training')
    val_dir = os.path.join(root_path, 'validation')

    # Create directories for each skin class
    for skin_class in skin_classes:
        # Create train directory for this skin class
        train_class_dir = os.path.join(train_dir, skin_class)
        os.makedirs(train_class_dir, exist_ok=True)

        # Create validation directory for this skin class
        val_class_dir = os.path.join(val_dir, skin_class)
        os.makedirs(val_class_dir, exist_ok=True)

    # Show all the directories in the root directory
    for root, dirs, files in os.walk(root_path):
        for subdir in dirs:
            print(os.path.join(root, subdir))
            

In [None]:
try:
    extract_dataset(ZIP_FILE_PATH, DESTINATION_PATH)
    print_num_images(DESTINATION_PATH, SKIN_CLASSES)
    create_train_val_dirs(ROOT_DIR, SKIN_CLASSES)
except Exception as e:
    print(f"An error occurred: {e}")

<a id='step3'></a>

### 2.3 Split the dataset into train and validation

This section will cover the process of splitting the dataset into training and validation sets. The training set will be used to train the model, while the validation set will be used to evaluate the model's performance.

In [None]:
def split_data(SOURCE_DIR, TRAINING_DIR, VALIDATION_DIR, SPLIT_SIZE):
    """
    Splits the data in the source directory into a training set and a validation set.

    Parameters:
    SOURCE_DIR (str): The source directory.
    TRAINING_DIR (str): The directory where the training set should be copied.
    VALIDATION_DIR (str): The directory where the validation set should be copied.
    SPLIT_SIZE (float): The proportion of the data to be used for the training set.

    Returns:
    None
    """
    files = [filename for filename in os.listdir(SOURCE_DIR) if os.path.getsize(os.path.join(SOURCE_DIR, filename)) > 0]

    train_length = int(len(files) * SPLIT_SIZE)
    shuffled = random.sample(files, len(files))
    train_set = shuffled[:train_length]
    test_set = shuffled[train_length:]

    for filename in train_set:
        copy_file(SOURCE_DIR, TRAINING_DIR, filename)

    for filename in test_set:
        copy_file(SOURCE_DIR, VALIDATION_DIR, filename)

def copy_file(SOURCE_DIR, DEST_DIR, filename):
    src_file = os.path.join(SOURCE_DIR, filename)
    dest_file = os.path.join(DEST_DIR, filename)
    try:
        copyfile(src_file, dest_file)
    except Exception as e:
        print(f"An error occurred while copying {src_file} to {dest_file}: {e}")

def clear_directory(DIR):
    if len(os.listdir(DIR)) > 0:
        for file in os.scandir(DIR):
            os.remove(file.path)

In [None]:
# Define paths
BASE_DIR = "/content/drive/My Drive/Capstone_Project_Bangkit/Machine_Learning/data/skin_problem_dataset"
TRAINING_DIR = "/content/drive/My Drive/Capstone_Project_Bangkit/Machine_Learning/data/split/training"
VALIDATION_DIR = "/content/drive/My Drive/Capstone_Project_Bangkit/Machine_Learning/data/split/validation"

# Define skin problems
SKIN_PROBLEMS = ["acnes", "blackheads", "darkspots", "wrinkles"]

# Define proportion of images used for training
split_size = .8

# Split data for each skin problem
for problem in SKIN_PROBLEMS:
    SOURCE_DIR = os.path.join(BASE_DIR, problem)
    TRAINING_PROBLEM_DIR = os.path.join(TRAINING_DIR, problem)
    VALIDATION_PROBLEM_DIR = os.path.join(VALIDATION_DIR, problem)

    clear_directory(TRAINING_PROBLEM_DIR)
    clear_directory(VALIDATION_PROBLEM_DIR)

    split_data(SOURCE_DIR, TRAINING_PROBLEM_DIR, VALIDATION_PROBLEM_DIR, split_size)

    print(f"\n\nOriginal {problem}'s directory has {len(os.listdir(SOURCE_DIR))} images")
    print(f"\n\nThere are {len(os.listdir(TRAINING_PROBLEM_DIR))} images of {problem} for training")
    print(f"\n\nThere are {len(os.listdir(VALIDATION_PROBLEM_DIR))} images of {problem} for validation")

## 3. Data Augmentation

This section will cover the data augmentation process.

The step by step process is as follows:

1. [Creating Data Generators](#step7) - We will create data generators for the training, validation, and test sets.

2. [Visualizing Images](#step8) - We will visualize some images from the training set to understand the data better.
   
3. [Checking Batch Sizes](#step9) - We will check the batch sizes of the data generators.


<a id='step4'></a>

### 3.1 Creating Data Generators

This section will cover the creation of the image data generator. The image data generator is used to load the images from the dataset and preprocess them before feeding them into the model.

In [None]:
# Define constants
IMAGE_SIZE = (224, 224)
BATCH_SIZE = 32
CLASS_MODE = 'categorical'
RESCALE = 1./255

# Define data augmentation parameters
data_augmentation_params = {
    "rescale": RESCALE,
    "rotation_range": 40,
    "width_shift_range": 0.2,
    "height_shift_range": 0.2,
    "shear_range": 0.2,
    "zoom_range": 0.2,
    "horizontal_flip": True,
    "fill_mode": 'nearest'
}

# Define generator parameters
generator_params = {
    "directory": TRAINING_DIR,
    "batch_size": BATCH_SIZE,
    "class_mode": CLASS_MODE,
    "target_size": IMAGE_SIZE
}

In [None]:
# Instantiate the ImageDataGenerator class for training data with augmentation
train_datagen = ImageDataGenerator(**data_augmentation_params)

# Pass in the appropriate arguments to the flow_from_directory method
train_generator = train_datagen.flow_from_directory(**generator_params)

# Instantiate the ImageDataGenerator class for validation data without augmentation
validation_datagen = ImageDataGenerator(rescale=RESCALE)

# Pass in the appropriate arguments to the flow_from_directory method
validation_generator = validation_datagen.flow_from_directory(**generator_params)

In [None]:
# Get a single batch of images and labels from each generator
train_images, train_labels = next(train_generator)
validation_images, validation_labels = next(validation_generator)

# Print the shape of the images and labels
print(f"Image batch of training generator has shape: {train_images.shape}")
print(f"Image batch of validation generator has shape: {validation_images.shape}")

print("\n")

print(f"Label batch of training generator has shape: {train_labels.shape}")
print(f"Label batch of validation generator has shape: {validation_labels.shape}")

<a id='step5'></a>

### 3.2 Visualizing Images

This section will cover the visualization of images from the training set. The images will be displayed along with their corresponding labels.

In [None]:
def visualize_images(dataset, num_images):
    """
    This function takes a dataset and a number of images to display. It selects a random sample of images from the dataset
    and displays them in a grid.

    Parameters:
    dataset (DataFrameIterator): The dataset to select images from. This should be a TensorFlow DataFrameIterator object.
    num_images (int): The number of images to display. This should be a positive integer.

    Returns:
    None
    """
    # Define the labels list
    labels_list = ['Acnes', 'Blackheads', 'Darkspots', 'Wrinkles']

    # Take one batch from the dataset
    images, labels = next(dataset)

    # Select a few random images from the batch
    random_indices = random.sample(range(images.shape[0]), num_images)
    selected_images = images[random_indices]
    selected_labels = labels[random_indices]

    # Map the one-hot encoded labels back to their original string labels
    selected_labels = [labels_list[np.argmax(label)] for label in selected_labels]

    # Display the selected images and their labels
    plt.figure(figsize=(10, 10))
    for i in range(num_images):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(selected_images[i])
        plt.title(f"Label: {selected_labels[i]}")
        plt.axis("off")

In [None]:
# Visualize images from each generator
visualize_images(train_generator, 9)
visualize_images(validation_generator, 9)

<a id='step5'></a>

### 3.3 Checking Batch Sizes

This section will cover the checking of batch sizes for the data generators. The batch size is the number of images that will be fed into the model at once during training.

In [None]:
def check_batch_size(dataset):
    """
    This function takes a DataFrameIterator dataset and prints out the size of a batch.

    Parameters:
    dataset (DataFrameIterator): The dataset to check the batch size of. This should be a TensorFlow DataFrameIterator object.

    Returns:
    None
    """
    # Take one batch from the dataset
    images, _ = next(dataset)

    # Return the batch size
    return images.shape[0]

In [None]:
# Check the batch size of the training, validation, and test datasets
print(f"Batch size of the training dataset: {check_batch_size(train_generator)}")
print(f"Batch size of the validation dataset: {check_batch_size(validation_generator)}")

## 4. Build Model

This section will cover the model building process.

The step by step process is as follows:
- Define the model architecture
- Compile the model

### 4.1 Define the Model Architecture for Tune Learning Rate

This section will cover the definition of the model architecture. The model architecture defines the structure of the neural network, including the number of layers, the type of layers, and the activation functions.

In [None]:
# # Constants for the image dimensions
# IMG_HEIGHT = 224
# IMG_WIDTH = 224

In [None]:
# def create_uncompiled_model():

#     model = tf.keras.models.Sequential(
#         [
#           # First Convolutional Block
#           Conv2D(16, (3, 3), activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
#           MaxPooling2D(pool_size=(2, 2)),

#           # Second Convolutional Block
#           Conv2D(32, (3, 3), activation='relu'),
#           MaxPooling2D(pool_size=(2, 2)),

#           # Third Convolutional Block
#           Conv2D(64, (3, 3), activation='relu'),
#           MaxPooling2D(pool_size=(2, 2)),

#           # Flatten the output and add Dense layers
#           Flatten(),

#           # Add Dense layer
#           Dense(512, activation='relu'),
#           Dense(64, activation='relu'),
#           Dropout(0.5),

#           # Output layer
#           Dense(4, activation='sigmoid') # 4 skin problems
#         ]
#     )

#     return model

In [None]:
# def adjust_learning_rate(dataset):

#     model = create_uncompiled_model()

#     lr_schedule = tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-5 * 10**(epoch / 20))

#     # Select your optimizer
#     optimizer = tf.keras.optimizers.Adam()

#     # Compile the model passing in the appropriate loss
#     model.compile(loss='binary_crossentropy',
#                   optimizer=optimizer,
#                   metrics=["accuracy"])

#     history = model.fit(
#         train_generator,
#         validation_data=validation_generator,
#         epochs=100,
#         callbacks=[lr_schedule]
#     )

#     return history

In [None]:
# Run the training with dynamic LR
# lr_history = adjust_learning_rate(train_generator)

In [None]:
# plt.semilogx(lr_history.history["lr"], lr_history.history["loss"])
# plt.axis([1e-6, 1e-1, 0, 0.8])
# plt.show

### 4.1 Define the Model Architecture after Tuning Learning Rate

This section will cover the definition of the model architecture. The model architecture defines the structure of the neural network, including the number of layers, the type of layers, and the activation functions.

In [None]:
# Constants for the image dimensions and number of classes
IMG_HEIGHT = 224
IMG_WIDTH = 224
NUM_CLASSES = 4  # Number of skin problems
LEARNING_RATE = 0.001

In [None]:
def create_model():
    """
    Creates a Sequential model with three convolutional blocks, a flatten layer,
    two dense layers, and an output layer.

    Returns:
        model: A Sequential model.
    """
    model = Sequential([
        # First Convolutional Block
        Conv2D(16, (3, 3), activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
        MaxPooling2D(pool_size=(2, 2)),

        # Second Convolutional Block
        Conv2D(32, (3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),

        # Third Convolutional Block
        Conv2D(64, (3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),

        # Flatten the output of the convolutions
        Flatten(),

        # Add Dense layers
        Dense(512, activation='relu'),
        Dense(64, activation='relu'),
        Dropout(0.5),

        # Output layer
        Dense(NUM_CLASSES, activation='sigmoid')
    ])
    return model

def compile_model(model):
    """
    Compiles the model with binary crossentropy as the loss function,
    Adam as the optimizer, and accuracy as the metric.

    Args:
        model: The model to compile.
    """
    optimizer = Adam(learning_rate=LEARNING_RATE)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=["accuracy"])

def print_and_plot_model(model):
    """
    Prints the summary of the model and plots the model architecture.

    Args:
        model: The model to print and plot.
    """
    # Print the model summary
    print(f"Model Summary:\n{model.summary()}")

    # Plot the model architecture and save it to a file
    plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=True)

    # Display the model architecture image
    display(Image(filename='model.png'))

In [None]:
# Create, compile, print and plot the model
model = create_model()
compile_model(model)
print_and_plot_model(model)

## 5. Train Model

This section will cover the model training process.

The step by step process is as follows:
- Fit the model on training data

## 6. Evaluate Model

This section will cover the model evaluation process.

The step by step process is as follows:
- Evaluate the model on the test data
- Generate predictions
- Print classification report
- Plot confusion matrix

## 7. Save Model

This section will cover the model saving process.

The step by step process is as follows:
- Save the model for future use
- Save the model architecture
- Save the model weights
- Save the model history

## 8. Convert Model to TensorFlow Lite

This section will cover the model conversion process.

The step by step process is as follows:
- Convert the model to the TensorFlow Lite format (.tflite) with quantization to reduce the model size
- Save the converted model

## 9. Integerate with Mobile Device (Android)

This section will cover the integration of the model with an Android application.

The step by step process is as follows:
- Integrate with an Android application developed by team Mobile Development
- Load the model in the Android application
- Perform inference on the device
- Display the results