<a href="https://colab.research.google.com/github/LatiefDataVisionary/deep-learning-college-task/blob/main/tasks/week_5_tasks/Task_CNN_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Section 0: Initial Setup (Pengaturan Awal)**

Bagian ini untuk melakukan instalasi library penting yang mungkin belum ada di Colab dan menghubungkan Google Drive.

## **0.1. Install Libraries (Instalasi Library)**

Menginstal library `mtcnn` yang akan digunakan untuk deteksi wajah.

In [5]:
!pip install opencv-python
!pip install mtcnn

Collecting mtcnn
  Downloading mtcnn-1.0.0-py3-none-any.whl.metadata (5.8 kB)
Collecting lz4>=4.3.3 (from mtcnn)
  Downloading lz4-4.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading mtcnn-1.0.0-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m66.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lz4-4.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m75.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lz4, mtcnn
Successfully installed lz4-4.4.4 mtcnn-1.0.0


## **0.2. Mount Google Drive (Menghubungkan Google Drive)**

Menghubungkan notebook dengan Google Drive agar dapat mengakses dataset.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Section 1: Import Libraries and Environment Setup (Impor Library dan Pengaturan Lingkungan)**

**Penjelasan:** Di sini kita akan mengimpor semua modul dan library yang dibutuhkan untuk keseluruhan proyek serta mendefinisikan variabel-variabel global seperti path direktori, ukuran gambar, dan parameter training.

### **1.1. Import Core Libraries (Impor Library Utama)**

**Penjelasan:** Mengimpor library utama seperti tensorflow, keras, numpy, matplotlib.pyplot, os, dan seaborn yang akan digunakan sepanjang proyek ini.

In [6]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, InputLayer, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
import numpy as np
import matplotlib.pyplot as plt
import os
import zipfile
import cv2
import glob
import shutil
from mtcnn.mtcnn import MTCNN

### **1.2. Define Configurations (Definisi Konfigurasi)**

**Penjelasan:** Mendefinisikan variabel-variabel konfigurasi yang akan digunakan di seluruh notebook, termasuk path ke dataset, ukuran gambar yang akan digunakan, ukuran batch untuk training, jumlah epoch, dan jumlah kelas (mahasiswa).

Link Dataset Google Drive: https://drive.google.com/drive/folders/1S5mRxYOfTPAmfqqFFLfbV_D5eWj5J9ox?usp=sharing

In [7]:
# Define Directory Paths (Definisi Path Direktori)
ZIP_PATH = '/content/drive/MyDrive/Dataset/Dataset Sistem Presensi Wajah V1.0.zip' # Path to the raw zip file in Google Drive
RAW_DATA_PATH = '/content/raw_dataset' # Directory to extract the raw dataset
PROCESSED_PATH = '/content/processed_dataset' # Directory to save the processed (face-detected) dataset

# Define Image Parameters (Definisi Parameter Gambar)
IMG_HEIGHT = 128 # Smaller size for custom CNN from scratch
IMG_WIDTH = 128
CHANNELS = 3 # RGB color images

# Define Training Parameters (Definisi Parameter Pelatihan)
BATCH_SIZE = 32
EPOCHS = 50 # Will be controlled by Early Stopping
# NUM_CLASSES will be determined later by the data generator

### **1.3. Extract Dataset (Ekstrak Dataset)**

**Penjelasan:** Mengekstrak file dataset dari Google Drive ke lingkungan Colab agar dapat diakses sebagai direktori biasa.

In [8]:
# Define the path to the zip file in Google Drive
zip_path = '/content/drive/MyDrive/Dataset/Dataset Sistem Presensi Wajah V1.0.zip'
extract_path = '/content/dataset' # Directory to extract the dataset

# Create the extraction directory if it doesn't exist
os.makedirs(extract_path, exist_ok=True)

# Extract the zip file
print(f"Extracting {zip_path} to {extract_path}...")
try:
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_path)
    print("Extraction complete.")
except FileNotFoundError:
    print(f"Error: Zip file not found at {zip_path}")
except zipfile.BadZipFile:
    print(f"Error: Could not open or read zip file at {zip_path}. It might be corrupted.")
except Exception as e:
    print(f"An error occurred during extraction: {e}")

# Update TRAIN_DIR and TEST_DIR to point to the extracted directories
# Based on the previous output, the extracted content is in a subfolder
extracted_subfolder = os.path.join(extract_path, 'Dataset Sistem Presensi Wajah V1.0')
TRAIN_DIR = os.path.join(extracted_subfolder, 'Data Train')
TEST_DIR = os.path.join(extracted_subfolder, 'Data Test')


print(f"Updated TRAIN_DIR: {TRAIN_DIR}")
print(f"Updated TEST_DIR: {TEST_DIR}")

# Verify that the directories exist after extraction
if os.path.exists(TRAIN_DIR):
    print(f"TRAIN_DIR exists: {TRAIN_DIR}")
else:
    print(f"Error: TRAIN_DIR not found after extraction at {TRAIN_DIR}")

if os.path.exists(TEST_DIR):
    print(f"TEST_DIR exists: {TEST_DIR}")
else:
    print(f"Error: TEST_DIR not found after extraction at {TEST_DIR}")

# Now it's safe to list contents if needed for verification after extraction
print(f"Contents of {extract_path} after extraction: {os.listdir(extract_path)}")

Extracting /content/drive/MyDrive/Dataset/Dataset Sistem Presensi Wajah V1.0.zip to /content/dataset...
Extraction complete.
Updated TRAIN_DIR: /content/dataset/Dataset Sistem Presensi Wajah V1.0/Data Train
Updated TEST_DIR: /content/dataset/Dataset Sistem Presensi Wajah V1.0/Data Test
TRAIN_DIR exists: /content/dataset/Dataset Sistem Presensi Wajah V1.0/Data Train
TEST_DIR exists: /content/dataset/Dataset Sistem Presensi Wajah V1.0/Data Test
Contents of /content/dataset after extraction: ['Dataset Sistem Presensi Wajah V1.0']


## **Section 2: Advanced Preprocessing - Face Detection and Cropping (Preprocessing Lanjutan - Deteksi dan Pemotongan Wajah)**

**Penjelasan:** Ini adalah tahap paling krusial dan merupakan upgrade utama. Kita akan memproses seluruh dataset mentah sekali jalan. Tujuannya adalah mendeteksi wajah di setiap gambar, memotongnya, dan menyimpannya ke struktur direktori baru yang bersih dan siap pakai. Proses ini menyelesaikan masalah distorsi aspect ratio dan noise latar belakang.

### **2.1. Unzip Raw Dataset (Ekstrak Dataset Mentah)**

**Penjelasan:** Mengekstrak file dataset mentah dari lokasi ZIP_PATH ke direktori RAW_DATA_PATH agar dapat diakses sebagai file gambar.

In [None]:
# Create the raw data extraction directory if it doesn't exist
os.makedirs(RAW_DATA_PATH, exist_ok=True)

# Extract the zip file to the raw data path
print(f"Extracting {ZIP_PATH} to {RAW_DATA_PATH}...")
try:
    with zipfile.ZipFile(ZIP_PATH, 'r') as zip_ref:
        zip_ref.extractall(RAW_DATA_PATH)
    print("Extraction complete.")
except FileNotFoundError:
    print(f"Error: Zip file not found at {ZIP_PATH}")
except zipfile.BadZipFile:
    print(f"Error: Could not open or read zip file at {ZIP_PATH}. It might be corrupted.")
except Exception as e:
    print(f"An error occurred during extraction: {e}")

# Verify contents of the extracted raw data directory
print(f"Contents of {RAW_DATA_PATH} after extraction: {os.listdir(RAW_DATA_PATH)}")

# Determine the actual path to the raw image files inside the extracted folder
# Assuming the zip contains a single main folder
extracted_items = os.listdir(RAW_DATA_PATH)
if len(extracted_items) == 1 and os.path.isdir(os.path.join(RAW_DATA_PATH, extracted_items[0])):
    ACTUAL_RAW_DATA_ROOT = os.path.join(RAW_DATA_PATH, extracted_items[0])
else:
    # If structure is different, you might need to adjust this
    ACTUAL_RAW_DATA_ROOT = RAW_DATA_PATH
    print("Warning: Extracted data structure is not a single subfolder. Assuming raw images are directly in RAW_DATA_PATH.")

print(f"Actual root directory for raw images: {ACTUAL_RAW_DATA_ROOT}")

# List a few files to confirm
raw_image_files = glob.glob(os.path.join(ACTUAL_RAW_DATA_ROOT, '*.*'))
print(f"Found {len(raw_image_files)} raw image files.")
if len(raw_image_files) > 5:
    print("First 5 raw files:", raw_image_files[:5])
elif len(raw_image_files) > 0:
     print("Raw files:", raw_image_files)
else:
    print("No raw image files found. Check ZIP_PATH and extraction process.")

### **2.2. Initialize Face Detector (Inisialisasi Detektor Wajah)**

**Penjelasan:** Menginisialisasi model MTCNN yang akan digunakan untuk mendeteksi wajah pada setiap gambar.

In [None]:
# Initialize MTCNN detector
detector = MTCNN()
print("MTCNN detector initialized.")

### **2.3. Prepare Processed Directory Structure (Siapkan Struktur Direktori Hasil Proses)**

**Penjelasan:** Membuat struktur direktori baru di PROCESSED_PATH untuk menyimpan gambar wajah yang sudah dideteksi dan dipotong. Struktur ini akan memiliki sub-folder untuk data training dan testing, dan di dalamnya akan ada sub-folder untuk setiap kelas (berdasarkan NIM).

In [None]:
# Clean up and create the processed data directories
if os.path.exists(PROCESSED_PATH):
    print(f"Removing existing processed data directory: {PROCESSED_PATH}")
    shutil.rmtree(PROCESSED_PATH)

os.makedirs(PROCESSED_PATH)
os.makedirs(os.path.join(PROCESSED_PATH, 'train'))
os.makedirs(os.path.join(PROCESSED_PATH, 'test'))
print(f"Created processed data directories: {PROCESSED_PATH}/train and {PROCESSED_PATH}/test")

# Get unique class names (NIMs) from raw filenames
# Assuming raw files are directly under ACTUAL_RAW_DATA_ROOT
raw_filenames = os.listdir(ACTUAL_RAW_DATA_ROOT)
# Filter for image files if necessary (e.g., ends with .jpg, .png)
image_filenames = [f for f in raw_filenames if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

# Extract NIM (first 10 digits) as class labels
class_names = sorted(list(set([f[:10] for f in image_filenames if len(f) >= 10])))

if not class_names:
    print("Error: No class names (NIMs) extracted from filenames. Check file naming convention and ACTUAL_RAW_DATA_ROOT.")
else:
    print(f"Found {len(class_names)} unique classes (NIMs).")
    # Create sub-folders for each class in train and test directories
    for class_name in class_names:
        os.makedirs(os.path.join(PROCESSED_PATH, 'train', class_name), exist_ok=True)
        os.makedirs(os.path.join(PROCESSED_PATH, 'test', class_name), exist_ok=True)
    print("Created class sub-folders in processed train and test directories.")

# Store class_names for later use
CLASS_NAMES = class_names

### **2.4. Run the Face Detection & Cropping Pipeline (Jalankan Pipeline Deteksi & Pemotongan Wajah)**

**Penjelasan:** Membuat dan menjalankan fungsi untuk mendeteksi wajah di setiap gambar mentah, memotongnya, dan menyimpannya ke struktur direktori yang sudah disiapkan di PROCESSED_PATH. Data akan dibagi secara manual menjadi training dan testing (misal: 80% train, 20% test per kelas).

In [None]:
def process_and_save_faces(raw_data_root_dir, processed_train_dir, processed_test_dir, detector, img_width, img_height, split_ratio=0.8):
    """
    Processes raw images: detects faces, crops them, resizes, and saves to
    processed train/test directories based on NIM from filename.

    Args:
        raw_data_root_dir (str): Directory containing all raw images.
        processed_train_dir (str): Destination directory for processed training images.
        processed_test_dir (str): Destination directory for processed testing images.
        detector (MTCNN): Initialized MTCNN face detector.
        img_width (int): Target width for processed images.
        img_height (int): Target height for processed images.
        split_ratio (float): Ratio of data to use for training (e.g., 0.8 for 80% train).
    """
    print(f"Starting face detection and cropping pipeline from {raw_data_root_dir}...")

    # Group files by class (NIM)
    class_files = {}
    all_raw_files = glob.glob(os.path.join(raw_data_root_dir, '*.*'))
    for filepath in all_raw_files:
        filename = os.path.basename(filepath)
        if len(filename) >= 10:
            nim = filename[:10]
            if nim in CLASS_NAMES: # Ensure NIM is one of the identified classes
                 if nim not in class_files:
                     class_files[nim] = []
                 class_files[nim].append(filepath)

    total_processed = 0
    total_skipped = 0

    for nim, files in class_files.items():
        print(f"Processing class {nim} with {len(files)} images...")
        # Shuffle files for random train/test split
        np.random.shuffle(files)
        split_index = int(len(files) * split_ratio)

        train_files = files[:split_index]
        test_files = files[split_index:]

        print(f"  - {len(train_files)} for training, {len(test_files)} for testing.")

        # Process training files
        for i, filepath in enumerate(train_files):
            filename = os.path.basename(filepath)
            try:
                image = cv2.imread(filepath)
                if image is None:
                    print(f"Warning: Could not read image file: {filepath}. Skipping.")
                    total_skipped += 1
                    continue

                # Convert BGR to RGB (MTCNN expects RGB)
                image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

                # Detect faces
                results = detector.detect_faces(image_rgb)

                if results:
                    # Get the first detected face (assuming one main face per image)
                    x, y, width, height = results[0]['box']

                    # Add margin (adjust as needed)
                    margin_x = int(width * 0.2)
                    margin_y = int(height * 0.2)
                    x1 = max(0, x - margin_x)
                    y1 = max(0, y - margin_y)
                    x2 = min(image.shape[1], x + width + margin_x)
                    y2 = min(image.shape[0], y + height + margin_y)

                    # Crop the face with margin
                    face_crop = image[y1:y2, x1:x2]

                    # Resize the cropped face to target size
                    face_resized = cv2.resize(face_crop, (img_width, img_height))

                    # Save the processed face image to the training directory
                    dest_filepath = os.path.join(processed_train_dir, nim, filename)
                    cv2.imwrite(dest_filepath, face_resized)
                    total_processed += 1
                else:
                    print(f"Warning: No face detected in {filepath}. Skipping.")
                    total_skipped += 1

            except Exception as e:
                print(f"Error processing {filepath}: {e}. Skipping.")
                total_skipped += 1

        # Process testing files
        for i, filepath in enumerate(test_files):
             filename = os.path.basename(filepath)
             try:
                image = cv2.imread(filepath)
                if image is None:
                    print(f"Warning: Could not read image file: {filepath}. Skipping.")
                    total_skipped += 1
                    continue

                # Convert BGR to RGB (MTCNN expects RGB)
                image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

                # Detect faces
                results = detector.detect_faces(image_rgb)

                if results:
                    # Get the first detected face
                    x, y, width, height = results[0]['box']

                    # Add margin
                    margin_x = int(width * 0.2)
                    margin_y = int(height * 0.2)
                    x1 = max(0, x - margin_x)
                    y1 = max(0, y - margin_y)
                    x2 = min(image.shape[1], x + width + margin_x)
                    y2 = min(image.shape[0], y + height + margin_y)


                    # Crop the face with margin
                    face_crop = image[y1:y2, x1:x2]

                    # Resize the cropped face to target size
                    face_resized = cv2.resize(face_crop, (img_width, img_height))

                    # Save the processed face image to the testing directory
                    dest_filepath = os.path.join(processed_test_dir, nim, filename)
                    cv2.imwrite(dest_filepath, face_resized)
                    total_processed += 1
                else:
                    print(f"Warning: No face detected in {filepath}. Skipping.")
                    total_skipped += 1

             except Exception as e:
                print(f"Error processing {filepath}: {e}. Skipping.")
                total_skipped += 1


    print("\nFace detection and cropping pipeline finished.")
    print(f"Total images processed and saved: {total_processed}")
    print(f"Total images skipped (no face detected or error): {total_skipped}")

# Run the pipeline
# Ensure ACTUAL_RAW_DATA_ROOT is correctly determined in step 2.1
if 'ACTUAL_RAW_DATA_ROOT' in globals() and os.path.exists(ACTUAL_RAW_DATA_ROOT):
    process_and_save_faces(ACTUAL_RAW_DATA_ROOT,
                           os.path.join(PROCESSED_PATH, 'train'),
                           os.path.join(PROCESSED_PATH, 'test'),
                           detector,
                           IMG_WIDTH, IMG_HEIGHT)
else:
    print("Error: ACTUAL_RAW_DATA_ROOT is not set or does not exist. Cannot run processing pipeline.")

### **2.5. Verify Processed Dataset (Verifikasi Dataset Hasil Proses)**

**Penjelasan:** Memeriksa jumlah gambar di direktori training dan testing yang sudah diproses untuk memastikan bahwa pipeline deteksi dan pemotongan wajah berjalan dengan sukses dan data tersimpan dengan benar.

In [None]:
# Function to count images in a directory, including subdirectories
def count_images_in_directory(directory):
    count = 0
    if not os.path.exists(directory):
        return 0
    for root, _, files in os.walk(directory):
        for file in files:
            if file.lower().endswith(('.jpg', '.jpeg', '.png')):
                count += 1
    return count

# Count images in processed train and test directories
train_count = count_images_in_directory(os.path.join(PROCESSED_PATH, 'train'))
test_count = count_images_in_directory(os.path.join(PROCESSED_PATH, 'test'))

print(f"Number of processed images in training directory ({os.path.join(PROCESSED_PATH, 'train')}): {train_count}")
print(f"Number of processed images in testing directory ({os.path.join(PROCESSED_PATH, 'test')}): {test_count}")

if train_count == 0 or test_count == 0:
    print("Warning: No processed images found in one or both directories. Check the processing pipeline and file paths.")
else:
    print("Processed dataset structure verified.")