<a href="https://colab.research.google.com/github/LatiefDataVisionary/deep-learning-college-task/blob/main/tasks/week_5_tasks/Task_ViT_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Section 0: Initial Setup (Pengaturan Awal)**

**Penjelasan:** Bagian ini untuk melakukan instalasi library penting yang mungkin belum ada di Colab dan menghubungkan Google Drive.

### **0.1. Install Libraries (Instalasi Library)**

**Penjelasan:** Menginstal library tambahan yang mungkin diperlukan, yaitu `mtcnn` yang merupakan library kunci untuk deteksi wajah.

In [11]:
# Install mtcnn library
!pip install opencv

[31mERROR: Could not find a version that satisfies the requirement opencv (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for opencv[0m[31m
[0m

### **0.2. Mount Google Drive (Menghubungkan Google Drive**)

**Penjelasan:** Menghubungkan notebook ini dengan akun Google Drive Anda. Ini diperlukan agar notebook dapat membaca file dataset gambar yang telah Anda simpan di Google Drive. Setelah menjalankan sel ini, ikuti instruksi otorisasi yang muncul.

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Section 1: Import Libraries and Environment Setup (Impor Library dan Pengaturan Lingkungan)**

**Penjelasan:** Mengimpor semua modul yang dibutuhkan dan mendefinisikan variabel-variabel global, termasuk path untuk data mentah dan data yang akan diproses.

### 1.1. Import Core Libraries (Impor Library Utama)

**Penjelasan:** Mengimpor library utama seperti tensorflow, keras, numpy, matplotlib, os, zipfile, cv2 (OpenCV), glob, shutil yang akan digunakan sepanjang proyek ini.

In [13]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization, InputLayer
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import zipfile
import cv2 # Import OpenCV
import glob # To list files
import shutil # To manage directories
import random
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns # For confusion matrix visualization

### **1.2. Define Configurations (Definisi Konfigurasi)**

**Penjelasan:** Mendefinisikan variabel-variabel konfigurasi yang akan digunakan di seluruh notebook, termasuk path ke dataset, ukuran gambar yang akan digunakan, ukuran batch untuk training, jumlah epoch, dan jumlah kelas (mahasiswa).

In [14]:
# Define Directory Paths (Definisi Path Direktori)
# Path ke file zip di Google Drive
ZIP_PATH = '/content/drive/MyDrive/Dataset/Dataset Sistem Presensi Wajah V1.0.zip'
# Path untuk mengekstrak data mentah (sebelum deteksi wajah)
RAW_DATA_PATH = '/content/raw_dataset'
# Path untuk menyimpan dataset yang sudah bersih (setelah deteksi wajah)
PROCESSED_PATH = '/content/processed_dataset'

# Define Image Parameters (Definisi Parameter Gambar)
IMG_HEIGHT = 128 # Ukuran yang lebih kecil cocok untuk model dari dasar
IMG_WIDTH = 128
CHANNELS = 3 # RGB color images

# Define Training Parameters (Definisi Parameter Pelatihan)
BATCH_SIZE = 32
EPOCHS = 50 # Will be controlled by Early Stopping
# NUM_CLASSES will be determined by the data generator later
NUM_CLASSES = None # Placeholder

## **Section 2: Advanced Preprocessing - Face Detection and Cropping (Preprocessing Lanjutan - Deteksi dan Pemotongan Wajah)**

**Penjelasan:** Ini adalah tahap paling krusial dan merupakan upgrade utama. Kita akan memproses seluruh dataset mentah sekali jalan. Tujuannya adalah mendeteksi wajah di setiap gambar, memotongnya, dan menyimpannya ke struktur direktori baru yang bersih dan siap pakai. Ini menyelesaikan masalah distorsi aspect ratio dan noise latar belakang menggunakan deteksi wajah berbasis OpenCV Haar Cascades.

### **2.1. Unzip Raw Dataset (Ekstrak Dataset Mentah)**

**Penjelasan:** Kode untuk mengekstrak file .zip ke RAW_DATA_PATH.

In [15]:
print(f"Extracting {ZIP_PATH} to {RAW_DATA_PATH}...")
try:
    with zipfile.ZipFile(ZIP_PATH, 'r') as zip_ref:
        zip_ref.extractall(RAW_DATA_PATH)
    print("Extraction complete.")
except FileNotFoundError:
    print(f"Error: Zip file not found at {ZIP_PATH}")
except zipfile.BadZipFile:
    print(f"Error: Could not open or read zip file at {ZIP_PATH}. It might be corrupted.")
except Exception as e:
    print(f"An error occurred during extraction: {e}")

# Verify extraction
if os.path.exists(RAW_DATA_PATH):
    print(f"\nContents of {RAW_DATA_PATH} after extraction: {os.listdir(RAW_DATA_PATH)}")
else:
    print(f"\nError: Raw data directory not found after extraction at {RAW_DATA_PATH}")

# Assuming the extracted content is in a subfolder within RAW_DATA_PATH
# Let's find the actual folder containing the images
raw_image_folder = None
for item in os.listdir(RAW_DATA_PATH):
    item_path = os.path.join(RAW_DATA_PATH, item)
    if os.path.isdir(item_path):
        # Simple check: if it contains subfolders or files ending with .jpg
        if any(os.path.isdir(os.path.join(item_path, sub_item)) for sub_item in os.listdir(item_path)) or \
           any(sub_item.endswith('.jpg') for sub_item in os.listdir(item_path)):
           raw_image_folder = item_path
           break

if raw_image_folder:
    print(f"\nIdentified raw image folder: {raw_image_folder}")
    # Update RAW_DATA_PATH to point directly to the folder containing the images
    RAW_DATA_PATH = raw_image_folder
    print(f"Updated RAW_DATA_PATH: {RAW_DATA_PATH}")
else:
    print("\nWarning: Could not identify the main folder containing raw images within the extracted directory.")
    print("Please manually inspect the extracted contents and update RAW_DATA_PATH if necessary.")

Extracting /content/drive/MyDrive/Dataset/Dataset Sistem Presensi Wajah V1.0.zip to /content/raw_dataset...
Extraction complete.

Contents of /content/raw_dataset after extraction: ['Dataset Sistem Presensi Wajah V1.0']

Identified raw image folder: /content/raw_dataset/Dataset Sistem Presensi Wajah V1.0
Updated RAW_DATA_PATH: /content/raw_dataset/Dataset Sistem Presensi Wajah V1.0


### **2.2. Initialize Face Detector (OpenCV Haar Cascade)**

**Penjelasan:** Memuat file Haar Cascade pre-trained untuk deteksi wajah dari OpenCV. Kita akan menggunakan detektor wajah frontal default.

In [16]:
# Remove the Haar Cascade XML file loading as we are no longer using it
# haar_cascade_path = cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'

# Remove the check and initialization
# if not os.path.exists(haar_cascade_path):
#     print(f"Error: Haar Cascade file not found at {haar_cascade_path}")
#     print("You might need to find its location or download it manually.")
#     face_cascade = None # Set cascade to None if file is not found
# else:
#     # Initialize the Haar Cascade face detector
#     face_cascade = cv2.CascadeClassifier(haar_cascade_path)

# if face_cascade is not None and face_cascade.empty():
#     print("Error: Failed to load Haar Cascade classifier.")
#     face_cascade = None
# else:
#     print("OpenCV Haar Cascade face detector initialized successfully.")

print("OpenCV Haar Cascade initialization skipped as face detection is being replaced by simple cropping.")
face_cascade = None # Explicitly set to None as it's not used

OpenCV Haar Cascade initialization skipped as face detection is being replaced by simple cropping.


### **2.3. Prepare Processed Directory Structure (Siapkan Struktur Direktori Hasil Proses)**

**Penjelasan:** Membuat direktori tujuan untuk data yang sudah diproses (`PROCESSED_PATH`) dan sub-direktori `train` dan `test` di dalamnya. Kemudian, mengidentifikasi kelas-kelas (mahasiswa) dari nama file mentah dan membuat sub-folder untuk setiap kelas di dalam direktori `train` dan `test` yang sudah diproses.

In [17]:
# Remove existing processed directory if it exists to start fresh
if os.path.exists(PROCESSED_PATH):
    print(f"Removing existing processed directory: {PROCESSED_PATH}")
    shutil.rmtree(PROCESSED_PATH)

# Create the main processed directory
os.makedirs(PROCESSED_PATH, exist_ok=True)

# Define train and test paths within the processed directory
PROCESSED_TRAIN_DIR = os.path.join(PROCESSED_PATH, 'train')
PROCESSED_TEST_DIR = os.path.join(PROCESSED_PATH, 'test')

# Create train and test sub-directories
os.makedirs(PROCESSED_TRAIN_DIR, exist_ok=True)
os.makedirs(PROCESSED_TEST_DIR, exist_ok=True)

print(f"Created processed directories: {PROCESSED_TRAIN_DIR} and {PROCESSED_TEST_DIR}")

# Identify class names (NIMs) from the subdirectories within RAW_DATA_PATH
# Assuming RAW_DATA_PATH now points to the main folder containing class subfolders
class_nims = sorted([d for d in os.listdir(RAW_DATA_PATH) if os.path.isdir(os.path.join(RAW_DATA_PATH, d))])


if not class_nims:
    print("Warning: No class subfolders found within the raw data directory. Please check the structure of the extracted data.")
else:
    print(f"\nIdentified {len(class_nims)} unique classes (NIMs) from subfolders. Example: {class_nims[:10]}...")

    # Create class sub-folders in both train and test processed directories
    for nim in class_nims:
        os.makedirs(os.path.join(PROCESSED_TRAIN_DIR, nim), exist_ok=True)
        os.makedirs(os.path.join(PROCESSED_TEST_DIR, nim), exist_ok=True)
    print(f"\nCreated sub-folders for {len(class_nims)} classes in {PROCESSED_TRAIN_DIR} and {PROCESSED_TEST_DIR}.")

# Update NUM_CLASSES global variable
NUM_CLASSES = len(class_nims)
print(f"Updated NUM_CLASSES: {NUM_CLASSES}")

Removing existing processed directory: /content/processed_dataset
Created processed directories: /content/processed_dataset/train and /content/processed_dataset/test

Identified 2 unique classes (NIMs) from subfolders. Example: ['Data Test', 'Data Train']...

Created sub-folders for 2 classes in /content/processed_dataset/train and /content/processed_dataset/test.
Updated NUM_CLASSES: 2


### **2.4. Run the Simple Cropping Pipeline (Jalankan Pipeline Pemotongan Sederhana)**

**Penjelasan:** Fungsi ini akan membaca gambar dari direktori sumber, melakukan pemotongan di bagian tengah gambar (simple center crop), mengubah ukurannya, dan menyimpannya ke direktori tujuan yang sudah distrukturkan per kelas. Kita akan membagi data secara manual ke direktori train/test selama proses penyimpanan. **Catatan:** Langkah ini menggantikan deteksi wajah yang lebih kompleks demi kecepatan, dengan asumsi wajah berada di tengah gambar.

In [19]:
def process_and_save_faces_by_split(raw_train_dir, raw_test_dir, processed_train_dir, processed_test_dir, img_width, img_height):
    """
    Processes images from raw train and test directories by performing a simple
    center crop and resize, and saves them to the corresponding processed
    train and test directories. Assumes raw directories contain subdirectories
    named after classes (NIMs).

    Args:
        raw_train_dir (str): Directory containing raw training images (with class subdirectories).
        raw_test_dir (str): Directory containing raw testing images (with class subdirectories).
        processed_train_dir (str): Directory to save processed training images.
        processed_test_dir (str): Directory to save processed testing images.
        img_width (int): Target width for cropped face images.
        img_height (int): Target height for cropped face images.
    """

    total_processed = 0
    skipped_error = 0

    print(f"\nStarting simple center cropping for training data from {raw_train_dir}...")

    # Process training data
    for class_dir in os.listdir(raw_train_dir):
        class_path = os.path.join(raw_train_dir, class_dir)
        if os.path.isdir(class_path):
            nim = class_dir # Class name is the directory name (NIM)
            image_files = glob.glob(os.path.join(class_path, '*.jpg'))

            for img_path in image_files:
                try:
                    # Read image
                    img = cv2.imread(img_path)
                    if img is None:
                        print(f"Warning: Could not read image {img_path}. Skipping.")
                        skipped_error += 1
                        continue

                    # Get image dimensions
                    h, w, _ = img.shape

                    # Determine crop dimensions (simple center crop)
                    min_dim = min(h, w)
                    start_x = max(0, int((w - min_dim) / 2))
                    start_y = max(0, int((h - min_dim) / 2))
                    end_x = start_x + min_dim
                    end_y = start_y + min_dim

                    # Crop the center square
                    center_crop = img[start_y:end_y, start_x:end_x]

                    # Resize the cropped image to the target size
                    img_resized = cv2.resize(center_crop, (img_width, img_height))

                    # Define save directory and path
                    save_dir = os.path.join(processed_train_dir, nim)
                    os.makedirs(save_dir, exist_ok=True)
                    save_path = os.path.join(save_dir, os.path.basename(img_path))

                    # Save the processed image
                    cv2.imwrite(save_path, img_resized)
                    total_processed += 1

                except Exception as e:
                    print(f"An error occurred while processing {img_path}: {e}")
                    skipped_error += 1
                    continue

    print(f"\nSimple center cropping for training data complete.")
    print(f"Total training images processed: {total_processed}")
    print(f"Training images skipped (error): {skipped_error}")

    total_processed_test = 0
    skipped_error_test = 0

    print(f"\nStarting simple center cropping for testing data from {raw_test_dir}...")

    # Process testing data
    for class_dir in os.listdir(raw_test_dir):
        class_path = os.path.join(raw_test_dir, class_dir)
        if os.path.isdir(class_path):
            nim = class_dir # Class name is the directory name (NIM)
            image_files = glob.glob(os.path.join(class_path, '*.jpg'))

            for img_path in image_files:
                try:
                    # Read image
                    img = cv2.imread(img_path)
                    if img is None:
                        print(f"Warning: Could not read image {img_path}. Skipping.")
                        skipped_error_test += 1
                        continue

                    # Get image dimensions
                    h, w, _ = img.shape

                    # Determine crop dimensions (simple center crop)
                    min_dim = min(h, w)
                    start_x = max(0, int((w - min_dim) / 2))
                    start_y = max(0, int((h - min_dim) / 2))
                    end_x = start_x + min_dim
                    end_y = start_y + min_dim

                    # Crop the center square
                    center_crop = img[start_y:end_y, start_x:end_x]

                    # Resize the cropped image to the target size
                    img_resized = cv2.resize(center_crop, (img_width, img_height))

                    # Define save directory and path
                    save_dir = os.path.join(processed_test_dir, nim)
                    os.makedirs(save_dir, exist_ok=True)
                    save_path = os.path.join(save_dir, os.path.basename(img_path))

                    # Save the processed image
                    cv2.imwrite(save_path, img_resized)
                    total_processed_test += 1

                except Exception as e:
                    print(f"An error occurred while processing {img_path}: {e}")
                    skipped_error_test += 1
                    continue

    print(f"\nSimple center cropping for testing data complete.")
    print(f"Total testing images processed: {total_processed_test}")
    print(f"Testing images skipped (error): {skipped_error_test}")


# Run the pipeline
# Assumes RAW_DATA_PATH now points to the directory containing 'Data Train' and 'Data Test' subfolders
raw_train_folder = os.path.join(RAW_DATA_PATH, 'Data Train')
raw_test_folder = os.path.join(RAW_DATA_PATH, 'Data Test')


# Check if the expected raw subfolders exist
if not os.path.exists(raw_train_folder):
    print(f"Error: Raw training data folder not found at {raw_train_folder}. Please check your extracted data structure.")
elif not os.path.exists(raw_test_folder):
     print(f"Error: Raw testing data folder not found at {raw_test_folder}. Please check your extracted data structure.")
else:
    process_and_save_faces_by_split(
        raw_train_dir=raw_train_folder,
        raw_test_dir=raw_test_folder,
        processed_train_dir=PROCESSED_TRAIN_DIR,
        processed_test_dir=PROCESSED_TEST_DIR,
        img_width=IMG_WIDTH,
        img_height=IMG_HEIGHT
    )


Starting simple center cropping for training data from /content/raw_dataset/Dataset Sistem Presensi Wajah V1.0/Data Train...

Simple center cropping for training data complete.
Total training images processed: 0
Training images skipped (error): 0

Starting simple center cropping for testing data from /content/raw_dataset/Dataset Sistem Presensi Wajah V1.0/Data Test...

Simple center cropping for testing data complete.
Total testing images processed: 0
Testing images skipped (error): 0


### 2.5. Verify Processed Dataset (Verifikasi Dataset Hasil Proses)

**Penjelasan:** Menghitung dan mencetak jumlah gambar di direktori training dan testing yang sudah diproses untuk memastikan bahwa pipeline deteksi dan pemotongan wajah berhasil.

In [None]:
# Count images in processed directories
processed_train_count = sum([len(files) for r, d, files in os.walk(PROCESSED_TRAIN_DIR)])
processed_test_count = sum([len(files) for r, d, files in os.walk(PROCESSED_TEST_DIR)])

print(f"Jumlah total gambar di Data Train (setelah proses): {processed_train_count}")
print(f"Jumlah total gambar di Data Test (setelah proses): {processed_test_count}")

# Verify class counts in processed directories
processed_train_class_counts = {}
for class_folder in os.listdir(PROCESSED_TRAIN_DIR):
    class_path = os.path.join(PROCESSED_TRAIN_DIR, class_folder)
    if os.path.isdir(class_path):
        processed_train_class_counts[class_folder] = len(os.listdir(class_path))

print(f"\nJumlah kelas (mahasiswa) di Data Train (setelah proses): {len(processed_train_class_counts)}")
# print("Jumlah gambar per kelas di Data Train (setelah proses):")
# for cls, count in processed_train_class_counts.items():
#     print(f"  {cls}: {count}")

processed_test_class_counts = {}
for class_folder in os.listdir(PROCESSED_TEST_DIR):
    class_path = os.path.join(PROCESSED_TEST_DIR, class_folder)
    if os.path.isdir(class_path):
        processed_test_class_counts[class_folder] = len(os.listdir(class_path))

print(f"Jumlah kelas (mahasiswa) di Data Test (setelah proses): {len(processed_test_class_counts)}")
# print("Jumlah gambar per kelas di Data Test (setelah proses):")
# for cls, count in processed_test_class_counts.items():
#     print(f"  {cls}: {count}")

# Check if the number of classes matches the expected NUM_CLASSES
if len(processed_train_class_counts) == NUM_CLASSES and len(processed_test_class_counts) == NUM_CLASSES:
    print("\nJumlah kelas di direktori proses sesuai dengan jumlah kelas yang terdeteksi.")
elif len(processed_train_class_counts) != NUM_CLASSES:
    print(f"\nWarning: Jumlah kelas di Data Train proses ({len(processed_train_class_counts)}) tidak sesuai dengan jumlah kelas yang terdeteksi ({NUM_CLASSES}).")
elif len(processed_test_class_counts) != NUM_CLASSES:
     print(f"\nWarning: Jumlah kelas di Data Test proses ({len(processed_test_class_counts)}) tidak sesuai dengan jumlah kelas yang terdeteksi ({NUM_CLASSES}).")
else:
     print("\nWarning: Jumlah kelas di direktori proses tidak sesuai dengan jumlah kelas yang terdeteksi.")

### 2.6. Inspect Class Distribution (Inspeksi Distribusi Kelas)

**Penjelasan:** Memvisualisasikan jumlah gambar per kelas di direktori training yang sudah diproses untuk melihat apakah pembagian train/test atau proses deteksi wajah menyebabkan ketidakseimbangan yang signifikan.

In [None]:
# Sort classes by count for better visualization
if processed_train_class_counts:
    sorted_processed_class_counts = dict(sorted(processed_train_class_counts.items(), key=lambda item: item[1]))

    # Plot class distribution for processed data
    plt.figure(figsize=(15, 7))
    plt.bar(sorted_processed_class_counts.keys(), sorted_processed_class_counts.values())
    plt.xticks(rotation=90)
    plt.xlabel("Nama Mahasiswa (Kelas - NIM)") # Use NIM as class name
    plt.ylabel("Jumlah Gambar di Data Train (Setelah Proses)")
    plt.title("Distribusi Jumlah Gambar per Mahasiswa (Data Train Setelah Proses)")
    plt.tight_layout()
    plt.show()

    # Check if all class counts are the same after processing
    if len(processed_train_class_counts) > 0:
        first_count_processed = list(processed_train_class_counts.values())[0]
        all_counts_same_processed = all(count == first_count_processed for count in processed_train_class_counts.values())

        if all_counts_same_processed:
            print("\nSemua mahasiswa memiliki jumlah gambar yang sama di Data Train (setelah proses).")
        else:
            print("\nJumlah gambar per mahasiswa di Data Train (setelah proses) bervariasi.")
    else:
        print("\nTidak ada data kelas yang ditemukan di Data Train (setelah proses) untuk diperiksa distribusinya.")

else:
    print("\nTidak ada data kelas yang ditemukan di Data Train (setelah proses) untuk diperiksa distribusinya.")

### 2.7. Visualize Sample Images (Visualisasi Sampel Gambar)

**Penjelasan:** Menampilkan beberapa gambar acak dari direktori training yang sudah diproses beserta label (NIM) mereka untuk melihat hasil dari proses deteksi dan pemotongan wajah.

In [None]:
plt.figure(figsize=(15, 12)) # Adjust figure size for more images
# Get all image files from the processed training directory
all_processed_train_images = glob.glob(os.path.join(PROCESSED_TRAIN_DIR, '*', '*.jpg'))

# Check if there are enough images
num_samples_to_display = min(20, len(all_processed_train_images)) # Display up to 20 samples
if num_samples_to_display == 0:
    print("Tidak ada gambar di direktori training yang sudah diproses untuk ditampilkan.")
else:
    # Select random images
    sample_images_paths_processed = np.random.choice(all_processed_train_images, size=num_samples_to_display, replace=False)

    for i, img_path in enumerate(sample_images_paths_processed):
        plt.subplot(4, 5, i + 1) # Adjust subplot grid to 4 rows and 5 columns
        img = plt.imread(img_path) # imread in matplotlib reads as RGB
        plt.imshow(img)

        # Extract class name (NIM) from directory name
        class_name_processed = os.path.basename(os.path.dirname(img_path))

        plt.title(class_name_processed)
        plt.axis("off")

    plt.tight_layout()
    plt.show()