# Skin Problem Classification Notebook 
---

## Team Information

**Team ID:** C241-PS385  

**Members:**    
- Stefanus Bernard Melkisedek - [GitHub Profile](https://github.com/stefansphtr)
- Debby Trinita - [GitHub Profile](https://github.com/debbytrinita)
- Mhd. Reza Kurniawan Lubis - [GitHub Profile](https://github.com/rezakur)

## Chosen Development Environment

For this project, our team opted to utilize Google Colab as our primary development environment. The decision to use Google Colab was primarily driven by its provision of complimentary access to GPU and TPU resources. These resources significantly expedite the model training process, thereby enhancing our productivity and efficiency.

## 1. Import Libraries

In [None]:
# Standard library imports
import os
import random
import shutil
import zipfile

# Third-party imports
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.optimizers import Adam

## 2. Load and Preprocess Data

This section will cover the data loading and preprocessing steps. 

The step by step process is as follows:
1. **Mounting Google Drive** - This step is necessary to access the dataset stored in Google Drive.
   
2. **Extracting the Dataset** - The dataset is stored in a zip file. We will extract the contents of the zip file to access the dataset.
   
3. **Copying the Data to the Local Directory** - We will copy the dataset to the local directory to facilitate data loading.
   
4. **Defining Directories and Parameters** - We will define the directories and parameters required for data loading.
   
5. **Checking Column Names** - We will check the column names to ensure that they are clean and consistent.
   
6. **Cleaning Column Names** - We will clean the column names to ensure that they are consistent and easy to work with.

### 2.1 Mounting Google Drive

The code mounts Google Drive to access the dataset stored there.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

### 2.2 Extracting the Dataset

The code defines the path to the dataset zip file and the path where the dataset will be extracted. It then checks if the data is already extracted. If not, it extracts the zip file.

In [None]:
# Define the path to the dataset zip file
dataset_zip_file_path = '/content/drive/Shareddrives/Capstone_Project/Machine_Learning/data/skin_problem_dataset.zip'

# Define the path where the dataset will be extracted
extraction_path = '/content/drive/Shareddrives/Capstone_Project/Machine_Learning/data/'

# Check if the data is already extracted
if not os.path.exists(extraction_path):
    # Open the dataset zip file in read mode
    with zipfile.ZipFile(dataset_zip_file_path, 'r') as dataset_zip_file:
        try:
            # Extract all files from the dataset zip file to the defined path
            dataset_zip_file.extractall(extraction_path)
        except Exception as e:
            print(f"An error occurred while extracting the zip file: {e}")
else:
    print("Data is already extracted.")

### 2.3 Copying the Data to the Local Directory

The code defines the source and destination directories and copies the data from the source to the destination.

In [None]:
# Define source and destination directories
source_dir = '/content/drive/Shareddrives/Capstone_Project/Machine_Learning/data/'
destination_dir = '/content/data/'

In [None]:
# Copy the data to the local environment
try:
    if not os.path.exists(destination_dir):
        shutil.copytree(source_dir, destination_dir)
    else:
        print("Destination directory already exists. Files were not copied.")
except Exception as e:
    print(f"An error occurred while copying files: {e}")

### 2.4 Defining Directories and Parameters

The code defines the directories for the training, validation, and test sets. It also defines the batch size and image size.

In [None]:
# Define the directories
train_dir = os.path.join(destination_dir, 'train')
val_dir = os.path.join(destination_dir, 'valid')
test_dir = os.path.join(destination_dir, 'test')

In [None]:
# Define the batch size and image size
BATCH_SIZE = 32
IMG_HEIGHT, IMG_WIDTH = 224, 224

### 2.5 Checking Column Names

The code loads the CSV files from each directory into pandas DataFrames and prints the column names.

In [None]:
# Check the columns name in each directory
df_train = pd.read_csv(os.path.join(train_dir, '_classes.csv'))
df_validation = pd.read_csv(os.path.join(val_dir, '_classes.csv'))
df_test = pd.read_csv(os.path.join(test_dir, '_classes.csv'))

# Print the columns name in each directory
print(f"Columns in the training directory: {df_train.columns}\n")
print(f"Columns in the validation directory: {df_validation.columns}\n")
print(f"Columns in the test directory: {df_test.columns}\n")

> It seems there are some issues with the column names in the CSV files. We will remove the extra spaces from the column names.

### 2.6 Cleaning Column Names

The code removes the trailing whitespace from the column names and checks the cleaned column names.

In [None]:
# Remove the trailing whitespace from the column names
df_train.columns = df_train.columns.str.strip()
df_validation.columns = df_validation.columns.str.strip()
df_test.columns = df_test.columns.str.strip()

In [None]:
# Check the trailing whitespace in the column names
print(f"Columns in the training directory: {df_train.columns}\n")
print(f"Columns in the validation directory: {df_validation.columns}\n")
print(f"Columns in the test directory: {df_test.columns}\n")

## 3. Data Augmentation

This section will cover the data augmentation process.

The step by step process is as follows:

1. Creating Data Generators - We will create data generators for the training, validation, and test sets.

2. Visualizing Images - We will visualize some images from the training set to understand the data better.
   
3. Checking Labels - We will check the distribution of labels in the training, validation, and test sets.

4. Checking Dataset Sizes - We will check the sizes of the training, validation, and test sets.

5. Checking Batch Sizes - We will check the batch sizes of the data generators.

### 3.1 Creating Data Generators

The code defines a function to create ImageDataGenerators for the training, validation, and test sets. It then calls this function to create the generators and generate a batch of data from each generator.

In [None]:
def create_data_generators(train_dir, val_dir, test_dir, img_height, img_width, batch_size):
    """
    This function creates ImageDataGenerators for the training, validation, and test sets.
    It also loads the datasets from CSV files.

    Parameters:
    train_dir (str): The directory where the training set is located.
    val_dir (str): The directory where the validation set is located.
    test_dir (str): The directory where the test set is located.
    img_height (int): The height of the images.
    img_width (int): The width of the images.
    batch_size (int): The batch size.

    Returns:
    tuple: A tuple containing the training, validation, and test generators.
    """
    labels = ['Acne', 'Blackheads', 'Dark Spots', 'Dry Skin', 'Eye bags', 'Normal Skin', 'Oily Skin', 'Pores', 'Skin Redness', 'Wrinkles']

    # Create an ImageDataGenerator for the training set
    train_datagen = ImageDataGenerator(
        rescale=1./255,
        rotation_range=20,
        zoom_range=0.2,
        horizontal_flip=True,
    )

    # Create an ImageDataGenerator for the validation set
    validation_datagen = ImageDataGenerator(rescale=1./255)

    # Create an ImageDataGenerator for the test set
    test_datagen = ImageDataGenerator(rescale=1./255)

    # Load the training set from the CSV file
    train_generator = train_datagen.flow_from_dataframe(
        df_train,
        directory=train_dir,
        x_col='filename',
        y_col=labels,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode='raw')

    # Load the validation set from the CSV file
    validation_generator = validation_datagen.flow_from_dataframe(
        df_validation,
        directory=val_dir,
        x_col='filename',
        y_col=labels,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode='raw')

    # Load the test set from the CSV file
    test_generator = test_datagen.flow_from_dataframe(
        df_test,
        directory=test_dir,
        x_col='filename',
        y_col=labels,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode='raw')

    return train_generator, validation_generator, test_generator

In [None]:
# Test generator creation
train_generator, validation_generator, test_generator = create_data_generators(train_dir, val_dir, test_dir, IMG_HEIGHT, IMG_WIDTH, BATCH_SIZE)

# Generate a batch of data from each generator
train_images, train_labels = next(train_generator)
validation_images, validation_labels = next(validation_generator)
test_images, test_labels = next(test_generator)

print("\n")

print(f"Image of training generator have shape: {train_images.shape}")
print(f"Image of validation generator have shape: {validation_images.shape}")
print(f"Image of test generator have shape: {test_images.shape}")

print("\n")

print(f"Labels of training generator have shape: {train_labels.shape}")
print(f"Labels of validation generator have shape: {validation_labels.shape}")
print(f"Labels of test generator have shape: {test_labels.shape}")