<p align="center">
  <img src="../src/assets/image/tensorflow_logo.png" alt="TensorFlow Image" width="50" align="left"/>
</p>

# Skin Problem Classification Notebook 


---

## Team Information

**Team ID:** C241-PS385  

**Members:**    
- Stefanus Bernard Melkisedek - [GitHub Profile](https://github.com/stefansphtr)
- Debby Trinita - [GitHub Profile](https://github.com/debbytrinita)
- Mhd. Reza Kurniawan Lubis - [GitHub Profile](https://github.com/rezakur)

## Chosen Development Environment

For this project, our team opted to utilize Google Colab as our primary development environment. The decision to use Google Colab was primarily driven by its provision of complimentary access to GPU and TPU resources. These resources significantly expedite the model training process, thereby enhancing our productivity and efficiency.

## 1. Setup

## 1.1 Import Libraries

In [None]:
# Standard library imports
import io
import itertools
import sys
import zipfile
import os
import shutil
from shutil import copyfile
from datetime import datetime
import random
from IPython.display import Image

# Third-party imports
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, Dense, BatchNormalization, Dropout, MaxPooling2D, Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import plot_model
import sklearn
from sklearn.metrics import confusion_matrix

## 1.2 Load TensorBoard Extension

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

## 2. Load and Preprocess Data

This section will cover the data loading and preprocessing steps. 

The step by step process is as follows:
1. [Mounting Google Drive](#step1) - This step is necessary to access the dataset stored in Google Drive.
   
2. [Extracting the Dataset](#step2) - The dataset is stored in a zip file. We will extract the contents of the zip file to access the dataset.
   
3. [Copying the Data to the Local Directory](#step3) - We will copy the dataset to the local directory to facilitate data loading.
   
4. [Defining Directories and Parameters](#step4) - We will define the directories and parameters required for data loading.
   
5. [Checking Column Names](#step5) - We will check the column names to ensure that they are clean and consistent.
   
6. [Cleaning Column Names](#step6) - We will clean the column names to ensure that they are consistent and easy to work with.

<a id='step1'></a>

### 2.1 Mounting Google Drive

The code mounts Google Drive to access the dataset stored there.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

<a id='step2'></a>

### 2.2 Extracting the Dataset and Create Train & Validation Directory

The code defines the path to the dataset zip file and the path where the dataset will be extracted. It then checks if the data is already extracted. If not, it extracts the zip file. additioanlly, it prints the number of images for each directory, and create the train and validation directory.

In [None]:
# Define the paths
ZIP_FILE_PATH = "/content/drive/Shareddrives/Capstone_Project/Machine_Learning/data/skin_problem_dataset.zip"
DESTINATION_PATH = "/content/drive/Shareddrives/Capstone_Project/Machine_Learning/data/"

# Define the skin classes
SKIN_CLASSES = ['acnes', 'blackheads', 'darkspots', 'wrinkles']

# Define root directory
ROOT_DIR = '/content/drive/My Drive/Capstone_Project_Bangkit/Machine_Learning/data/split'

In [None]:
def extract_dataset(zip_file_path, destination_path):
    # Check if the destination directory already exists
    if not os.path.exists(destination_path):
        # Create a ZipFile object
        with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
            # Extract all the contents of the zip file to the destination directory
            zip_ref.extractall(destination_path)
            print(f"Zip file extracted to {destination_path}")
    else:
        print(f"Directory {destination_path} already exists, skipping extraction\n")

    # Add print statement to check the content of the data directory
    print(f"Content of {destination_path}: {os.listdir(destination_path)}\n")

def print_num_images(source_dir, skin_classes):
    # For each skin class, print the number of images
    for skin_class in skin_classes:
        class_dir = os.path.join(source_dir, skin_class)
        try:
            num_images = len(os.listdir(class_dir))
            print(f"The number of images in {skin_class} class: {num_images}\n")
        except FileNotFoundError:
            print(f"Directory {class_dir} does not exist.")
        except NotADirectoryError:
            print(f"{class_dir} is not a directory.")
        except Exception as e:
            print(f"An error occurred: {e}")

def create_train_val_dirs(root_path, skin_classes):
    # Empty directory to prevent FileExistsError if the function is run several times
    if os.path.exists(root_path):
        shutil.rmtree(root_path)

    # Create train and validation directories
    # train and validation directories for skin-case
    train_dir = os.path.join(root_path, 'training')
    val_dir = os.path.join(root_path, 'validation')

    # Create directories for each skin class
    for skin_class in skin_classes:
        # Create train directory for this skin class
        train_class_dir = os.path.join(train_dir, skin_class)
        os.makedirs(train_class_dir, exist_ok=True)

        # Create validation directory for this skin class
        val_class_dir = os.path.join(val_dir, skin_class)
        os.makedirs(val_class_dir, exist_ok=True)

    # Show all the directories in the root directory
    for root, dirs, files in os.walk(root_path):
        for subdir in dirs:
            print(os.path.join(root, subdir))
            

In [None]:
try:
    extract_dataset(ZIP_FILE_PATH, DESTINATION_PATH)
    print_num_images(DESTINATION_PATH, SKIN_CLASSES)
    create_train_val_dirs(ROOT_DIR, SKIN_CLASSES)
except Exception as e:
    print(f"An error occurred: {e}")

<a id='step3'></a>

### 2.3 Split the dataset into train and validation

This section will cover the process of splitting the dataset into training and validation sets. The training set will be used to train the model, while the validation set will be used to evaluate the model's performance.

In [None]:
def split_data(SOURCE_DIR, TRAINING_DIR, VALIDATION_DIR, SPLIT_SIZE):
    """
    Splits the data in the source directory into a training set and a validation set.

    Parameters:
    SOURCE_DIR (str): The source directory.
    TRAINING_DIR (str): The directory where the training set should be copied.
    VALIDATION_DIR (str): The directory where the validation set should be copied.
    SPLIT_SIZE (float): The proportion of the data to be used for the training set.

    Returns:
    None
    """
    files = [filename for filename in os.listdir(SOURCE_DIR) if os.path.getsize(os.path.join(SOURCE_DIR, filename)) > 0]

    train_length = int(len(files) * SPLIT_SIZE)
    shuffled = random.sample(files, len(files))
    train_set = shuffled[:train_length]
    test_set = shuffled[train_length:]

    for filename in train_set:
        copy_file(SOURCE_DIR, TRAINING_DIR, filename)

    for filename in test_set:
        copy_file(SOURCE_DIR, VALIDATION_DIR, filename)

def copy_file(SOURCE_DIR, DEST_DIR, filename):
    src_file = os.path.join(SOURCE_DIR, filename)
    dest_file = os.path.join(DEST_DIR, filename)
    try:
        copyfile(src_file, dest_file)
    except Exception as e:
        print(f"An error occurred while copying {src_file} to {dest_file}: {e}")

def clear_directory(DIR):
    if len(os.listdir(DIR)) > 0:
        for file in os.scandir(DIR):
            os.remove(file.path)

In [None]:
# Define paths
BASE_DIR = "/content/drive/My Drive/Capstone_Project_Bangkit/Machine_Learning/data/skin_problem_dataset"
TRAINING_DIR = "/content/drive/My Drive/Capstone_Project_Bangkit/Machine_Learning/data/split/training"
VALIDATION_DIR = "/content/drive/My Drive/Capstone_Project_Bangkit/Machine_Learning/data/split/validation"

# Define skin problems
SKIN_PROBLEMS = ["acnes", "blackheads", "darkspots", "wrinkles"]

# Define proportion of images used for training
split_size = .8

# Split data for each skin problem
for problem in SKIN_PROBLEMS:
    SOURCE_DIR = os.path.join(BASE_DIR, problem)
    TRAINING_PROBLEM_DIR = os.path.join(TRAINING_DIR, problem)
    VALIDATION_PROBLEM_DIR = os.path.join(VALIDATION_DIR, problem)

    clear_directory(TRAINING_PROBLEM_DIR)
    clear_directory(VALIDATION_PROBLEM_DIR)

    split_data(SOURCE_DIR, TRAINING_PROBLEM_DIR, VALIDATION_PROBLEM_DIR, split_size)

    print(f"\n\nOriginal {problem}'s directory has {len(os.listdir(SOURCE_DIR))} images")
    print(f"\n\nThere are {len(os.listdir(TRAINING_PROBLEM_DIR))} images of {problem} for training")
    print(f"\n\nThere are {len(os.listdir(VALIDATION_PROBLEM_DIR))} images of {problem} for validation")

<a id='step4'></a>

### 2.4 Create the Image Data Generator

This section will cover the creation of the image data generator. The image data generator is used to load the images from the dataset and preprocess them before feeding them into the model.

In [None]:
# Define constants
IMAGE_SIZE = (224, 224)
BATCH_SIZE = 32
CLASS_MODE = 'categorical'
RESCALE = 1./255

# Define data augmentation parameters
data_augmentation_params = {
    "rescale": RESCALE,
    "rotation_range": 40,
    "width_shift_range": 0.2,
    "height_shift_range": 0.2,
    "shear_range": 0.2,
    "zoom_range": 0.2,
    "horizontal_flip": True,
    "fill_mode": 'nearest'
}

# Define generator parameters
generator_params = {
    "directory": TRAINING_DIR,
    "batch_size": BATCH_SIZE,
    "class_mode": CLASS_MODE,
    "target_size": IMAGE_SIZE
}

In [None]:
# Instantiate the ImageDataGenerator class for training data with augmentation
train_datagen = ImageDataGenerator(**data_augmentation_params)

# Pass in the appropriate arguments to the flow_from_directory method
train_generator = train_datagen.flow_from_directory(**generator_params)

# Instantiate the ImageDataGenerator class for validation data without augmentation
validation_datagen = ImageDataGenerator(rescale=RESCALE)

# Pass in the appropriate arguments to the flow_from_directory method
validation_generator = validation_datagen.flow_from_directory(**generator_params)

In [None]:
# Get a single batch of images and labels from each generator
train_images, train_labels = next(train_generator)
validation_images, validation_labels = next(validation_generator)

# Print the shape of the images and labels
print(f"Image batch of training generator has shape: {train_images.shape}")
print(f"Image batch of validation generator has shape: {validation_images.shape}")

print("\n")

print(f"Label batch of training generator has shape: {train_labels.shape}")
print(f"Label batch of validation generator has shape: {validation_labels.shape}")

<a id='step5'></a>

### 2.5 Checking Column Names

The code loads the CSV files from each directory into pandas DataFrames and prints the column names.

In [None]:
# Check the columns name in each directory
df_train = pd.read_csv(os.path.join(train_dir, '_classes.csv'))
df_validation = pd.read_csv(os.path.join(val_dir, '_classes.csv'))
df_test = pd.read_csv(os.path.join(test_dir, '_classes.csv'))

# Print the columns name in each directory
print(f"Columns in the training directory: {df_train.columns}\n")
print(f"Columns in the validation directory: {df_validation.columns}\n")
print(f"Columns in the test directory: {df_test.columns}\n")

> It seems there are some issues with the column names in the CSV files. We will remove the extra spaces from the column names.

<a id='step6'></a>

### 2.6 Cleaning Column Names

The code removes the trailing whitespace from the column names and checks the cleaned column names.

In [None]:
# Remove the trailing whitespace from the column names
df_train.columns = df_train.columns.str.strip()
df_validation.columns = df_validation.columns.str.strip()
df_test.columns = df_test.columns.str.strip()

In [None]:
# Check the trailing whitespace in the column names
print(f"Columns in the training directory: {df_train.columns}\n")
print(f"Columns in the validation directory: {df_validation.columns}\n")
print(f"Columns in the test directory: {df_test.columns}\n")

## 3. Data Augmentation

This section will cover the data augmentation process.

The step by step process is as follows:

1. [Creating Data Generators](#step7) - We will create data generators for the training, validation, and test sets.

2. [Visualizing Images](#step8) - We will visualize some images from the training set to understand the data better.
   
3. [Checking Labels](#step9) - We will check the distribution of labels in the training, validation, and test sets.

4. [Checking Dataset Sizes](#step10) - We will check the sizes of the training, validation, and test sets.

5. [Checking Batch Sizes](#step11) - We will check the batch sizes of the data generators.

<a id='step7'></a>

### 3.1 Creating Data Generators

The code defines a function to create ImageDataGenerators for the training, validation, and test sets. It then calls this function to create the generators and generate a batch of data from each generator.

In [None]:
def create_data_generators(train_dir, val_dir, test_dir, img_height, img_width, batch_size):
    """
    This function creates ImageDataGenerators for the training, validation, and test sets.
    It also loads the datasets from CSV files.

    Parameters:
    train_dir (str): The directory where the training set is located.
    val_dir (str): The directory where the validation set is located.
    test_dir (str): The directory where the test set is located.
    img_height (int): The height of the images.
    img_width (int): The width of the images.
    batch_size (int): The batch size.

    Returns:
    tuple: A tuple containing the training, validation, and test generators.
    """
    labels = ['Acne', 'Blackheads', 'Dark Spots', 'Dry Skin', 'Eye bags', 'Normal Skin', 'Oily Skin', 'Pores', 'Skin Redness', 'Wrinkles']

    # Create an ImageDataGenerator for the training set
    train_datagen = ImageDataGenerator(
        rescale=1./255,
        rotation_range=20,
        zoom_range=0.2,
        horizontal_flip=True,
    )

    # Create an ImageDataGenerator for the validation set
    validation_datagen = ImageDataGenerator(rescale=1./255)

    # Create an ImageDataGenerator for the test set
    test_datagen = ImageDataGenerator(rescale=1./255)

    # Load the training set from the CSV file
    train_generator = train_datagen.flow_from_dataframe(
        df_train,
        directory=train_dir,
        x_col='filename',
        y_col=labels,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode='raw')

    # Load the validation set from the CSV file
    validation_generator = validation_datagen.flow_from_dataframe(
        df_validation,
        directory=val_dir,
        x_col='filename',
        y_col=labels,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode='raw')

    # Load the test set from the CSV file
    test_generator = test_datagen.flow_from_dataframe(
        df_test,
        directory=test_dir,
        x_col='filename',
        y_col=labels,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode='raw')

    return train_generator, validation_generator, test_generator

In [None]:
# Test generator creation
train_generator, validation_generator, test_generator = create_data_generators(train_dir, val_dir, test_dir, IMG_HEIGHT, IMG_WIDTH, BATCH_SIZE)

# Generate a batch of data from each generator
train_images, train_labels = next(train_generator)
validation_images, validation_labels = next(validation_generator)
test_images, test_labels = next(test_generator)

print("\n")

print(f"Image of training generator have shape: {train_images.shape}")
print(f"Image of validation generator have shape: {validation_images.shape}")
print(f"Image of test generator have shape: {test_images.shape}")

print("\n")

print(f"Labels of training generator have shape: {train_labels.shape}")
print(f"Labels of validation generator have shape: {validation_labels.shape}")
print(f"Labels of test generator have shape: {test_labels.shape}")

<a id='step8'></a>

### 3.2 Visualizing Images

The code defines a function to visualize a random sample of images from a dataset. It then calls this function to visualize images from each dataset.

In [None]:
def visualize_images(dataset, num_images):
    """
    This function takes a dataset and a number of images to display. It selects a random sample of images from the dataset
    and displays them in a grid.

    Parameters:
    dataset (DataFrameIterator): The dataset to select images from. This should be a TensorFlow DataFrameIterator object.
    num_images (int): The number of images to display. This should be a positive integer.

    Returns:
    None
    """
    # Define the labels list
    labels_list = ['Acne', 'Blackheads', 'Dark Spots', 'Dry Skin', 'Eye bags', 'Normal Skin', 'Oily Skin', 'Pores', 'Skin Redness', 'Wrinkles']

    # Take one batch from the dataset
    images, labels = next(dataset)

    # Select a few random images from the batch
    random_indices = random.sample(range(images.shape[0]), num_images)
    selected_images = images[random_indices]
    selected_labels = labels[random_indices]

    # Map the one-hot encoded labels back to their original string labels
    selected_labels = [labels_list[np.argmax(label)] for label in selected_labels]

    # Display the selected images and their labels
    plt.figure(figsize=(10, 10))
    for i in range(num_images):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(selected_images[i])
        plt.title(f"Label: {selected_labels[i]}")
        plt.axis("off")

In [None]:
# Visualize images from each dataset
visualize_images(train_generator, 9)
visualize_images(validation_generator, 9)
visualize_images(test_generator, 9)

<a id='step9'></a>

### 3.3 Checking Labels

The code defines a function to print out the first few labels from a dataset. It then calls this function to check labels from each dataset.

In [None]:
def check_labels(dataset, num_labels):
    """
    This function takes a dataset and prints out the first few labels in their original string form.

    Parameters:
    dataset (DataFrameIterator): The dataset to select labels from. This should be a TensorFlow DataFrameIterator object.
    num_labels (int): The number of labels to print. This should be a positive integer.

    Returns:
    None
    """
    # Define the labels list
    labels_list = ['Acne', 'Blackheads', 'Dark Spots', 'Dry Skin', 'Eye bags', 'Normal Skin', 'Oily Skin', 'Pores', 'Skin Redness', 'Wrinkles']

    # Take one batch from the dataset
    _, labels = next(dataset)

    # Select a few labels from the batch
    selected_labels = labels[:num_labels]

    # Map the one-hot encoded labels back to their original string labels
    selected_labels = [labels_list[np.argmax(label)] for label in selected_labels]

    # Print out the selected labels
    print(f"Labels: {selected_labels}")

In [None]:
# Check labels from each dataset
check_labels(train_generator, 5)
check_labels(validation_generator, 5)
check_labels(test_generator, 5)

<a id='step10'></a>

### 3.4 Checking Dataset Sizes

The code defines a function to print out the size of a dataset. It then calls this function to check the sizes of each dataset.

In [None]:
def check_dataset_size(dataset):
    """
    This function takes a DataFrameIterator dataset and prints out its size.

    Parameters:
    dataset (DataFrameIterator): The dataset to check the size of. This should be a TensorFlow DataFrameIterator object.

    Returns:
    Tuple: Number of batches and total number of images in the dataset
    """
    # Compute the number of batches in the dataset
    num_batches = len(dataset)

    # Compute the total number of images in the dataset
    num_images = dataset.samples

    return num_batches, num_images

In [None]:
# Check sizes of each dataset

# Print out the number of batches and the total number of images in training dataset
num_batches, num_images = check_dataset_size(train_generator)
print(f"Number of batches in the training dataset: {num_batches}")
print(f"Total number of images in the training dataset: {num_images}\n")

# Print out the number of batches and the total number of images in validation dataset
num_batches, num_images = check_dataset_size(validation_generator)
print(f"Number of batches in the validation dataset: {num_batches}")
print(f"Total number of images in the validation dataset: {num_images}\n")

# Print out the number of batches and the total number of images in test dataset
num_batches, num_images = check_dataset_size(test_generator)
print(f"Number of batches in the test dataset: {num_batches}")
print(f"Total number of images in the test dataset: {num_images}\n")

<a id='step11'></a>

### 3.5 Checking Batch Sizes

The code defines a function to print out the size of a batch from a dataset. It then calls this function to check the batch sizes of each dataset.

In [None]:
def check_batch_size(dataset):
    """
    This function takes a DataFrameIterator dataset and prints out the size of a batch.

    Parameters:
    dataset (DataFrameIterator): The dataset to check the batch size of. This should be a TensorFlow DataFrameIterator object.

    Returns:
    None
    """
    # Take one batch from the dataset
    images, _ = next(dataset)

    # Print out the size of the batch
    print(f"Batch size: {images.shape[0]}")

In [None]:
# Check batch sizes of each dataset
check_batch_size(train_generator)
check_batch_size(validation_generator)
check_batch_size(test_generator)

## 4. Build Model

This section will cover the model building process.

The step by step process is as follows:
- Define the model architecture
- Compile the model

## 5. Train Model

This section will cover the model training process.

The step by step process is as follows:
- Fit the model on training data

## 6. Evaluate Model

This section will cover the model evaluation process.

The step by step process is as follows:
- Evaluate the model on the test data
- Generate predictions
- Print classification report
- Plot confusion matrix

## 7. Save Model

This section will cover the model saving process.

The step by step process is as follows:
- Save the model for future use
- Save the model architecture
- Save the model weights
- Save the model history

## 8. Convert Model to TensorFlow Lite

This section will cover the model conversion process.

The step by step process is as follows:
- Convert the model to the TensorFlow Lite format (.tflite) with quantization to reduce the model size
- Save the converted model

## 9. Integerate with Mobile Device (Android)

This section will cover the integration of the model with an Android application.

The step by step process is as follows:
- Integrate with an Android application developed by team Mobile Development
- Load the model in the Android application
- Perform inference on the device
- Display the results