# NutriGenius: Face Detection - Exploratory Data Analysis

Notebook ini melakukan exploratory data analysis pada dataset UTKFace untuk membangun model deteksi wajah, estimasi usia, dan gender pada aplikasi NutriGenius.

## Daftar Isi
1. [Pendahuluan](#pendahuluan)
2. [Setup Dataset](#setup-dataset)
3. [Eksplorasi Dataset](#eksplorasi-dataset)
4. [Preprocessing Dataset](#preprocessing-dataset)
5. [Visualisasi Data](#visualisasi-data)
6. [Kesimpulan](#kesimpulan)

## 1. Pendahuluan

Notebook ini mengeksplorasi dataset UTKFace untuk keperluan pengembangan komponen deteksi wajah pada aplikasi NutriGenius. Komponen ini akan melakukan deteksi wajah user dan memperkirakan usia serta gender, yang akan menjadi input untuk personalisasi rekomendasi artikel gizi.

Fokus eksplorasi ini mencakup:
- Pemahaman distribusi data wajah berdasarkan usia dan gender
- Analisis data untuk memastikan kecocokan dengan kebutuhan aplikasi
- Persiapan data untuk pemodelan

## 2. Setup Dataset

Sebelum melakukan eksplorasi, kita perlu menyiapkan dataset UTKFace. Dataset ini berisi gambar wajah dengan label usia, gender, dan etnis.

In [None]:
import os
import sys
import shutil
import urllib.request
import zipfile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import cv2
from PIL import Image
import glob
import logging

# Konfigurasi logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Path konfigurasi
BASE_DIR = os.path.abspath(os.path.join(os.path.dirname("__file__"), '..', '..'))
DATA_DIR = os.path.join(BASE_DIR, "data")
RAW_DATA_DIR = os.path.join(DATA_DIR, "raw")
FACE_DATA_DIR = os.path.join(RAW_DATA_DIR, "UTKFace")
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, "processed", "face")

# Buat direktori jika belum ada
for directory in [DATA_DIR, RAW_DATA_DIR, FACE_DATA_DIR, PROCESSED_DATA_DIR]:
    os.makedirs(directory, exist_ok=True)

print(f"Base directory: {BASE_DIR}")
print(f"Data directory: {DATA_DIR}")
print(f"Face data will be stored in: {FACE_DATA_DIR}")

### 2.1 Download Dataset

UTKFace dataset berisi lebih dari 20.000 gambar wajah dengan anotasi usia, gender, dan etnis. Kami menyediakan opsi untuk mengunduh dataset lengkap (~200MB) atau subset yang lebih kecil untuk prototyping cepat.

In [None]:
def download_with_progress(url, dest_path):
    """Download file dengan progress bar"""
    try:
        with tqdm(unit='B', unit_scale=True, miniters=1, desc=url.split('/')[-1]) as t:
            urllib.request.urlretrieve(
                url, 
                dest_path, 
                reporthook=lambda b, bsize, tsize: t.update(bsize if tsize == -1 else min(bsize, tsize-t.n))
            )
        return True
    except Exception as e:
        logger.error(f"Gagal mengunduh {url}: {e}")
        return False

def setup_face_dataset(use_full_dataset=False):
    """
    Menyiapkan dataset UTKFace
    
    Args:
        use_full_dataset: Jika True, unduh dataset UTKFace lengkap, jika False gunakan subset kecil
        
    Returns:
        bool: True jika berhasil, False jika gagal
    """
    # Periksa apakah dataset sudah ada
    sample_image_path = os.path.join(FACE_DATA_DIR, "1_0_0_20161219140623285.jpg")
    dataset_exists = os.path.exists(sample_image_path) or len(glob.glob(os.path.join(FACE_DATA_DIR, "*.jpg"))) > 0
    
    if dataset_exists:
        logger.info("Dataset UTKFace sudah ada.")
        return True
    
    if use_full_dataset:
        # Dataset UTKFace lengkap (~200MB)
        UTKFACE_URL = "https://drive.google.com/uc?id=0BxYys69jI14kYVM3aVhKS1VhRUk&export=download"
        UTKFACE_ZIP = os.path.join(DATA_DIR, "UTKFace.zip")
        
        logger.info("Mengunduh dataset UTKFace lengkap (~200MB). Proses ini mungkin membutuhkan waktu...")
        success = download_with_progress(UTKFACE_URL, UTKFACE_ZIP)
        if not success:
            logger.warning("Gagal mengunduh dataset UTKFace lengkap. Mencoba subset kecil...")
            return setup_face_dataset(use_full_dataset=False)
            
        # Ekstrak dataset
        try:
            logger.info("Mengekstrak dataset UTKFace...")
            with zipfile.ZipFile(UTKFACE_ZIP, 'r') as zip_ref:
                zip_ref.extractall(FACE_DATA_DIR)
            
            # Bersihkan file zip
            os.remove(UTKFACE_ZIP)
            
            logger.info("Setup dataset UTKFace berhasil!")
            return True
        except Exception as e:
            logger.error(f"Gagal mengekstrak dataset UTKFace: {e}")
            return False
    else:
        # Subset kecil untuk prototyping cepat (100 gambar)
        SAMPLE_UTKFACE_URL = "https://github.com/nutrigenius-samples/utkface-subset/archive/main.zip"
        SAMPLE_ZIP = os.path.join(DATA_DIR, "utkface_sample.zip")
        
        logger.info("Mengunduh subset kecil UTKFace (100 gambar)...")
        success = download_with_progress(SAMPLE_UTKFACE_URL, SAMPLE_ZIP)
        
        if not success:
            logger.error("Gagal mengunduh dataset sampel UTKFace.")
            
            # Alternatif: Unduh beberapa gambar sampel dari sumber lain
            logger.info("Mencoba mengunduh beberapa gambar sampel sebagai alternatif...")
            alt_images = [
                ("https://raw.githubusercontent.com/nutrigenius-samples/sample-faces/main/face1.jpg", "5_1_0_face1.jpg"),
                ("https://raw.githubusercontent.com/nutrigenius-samples/sample-faces/main/face2.jpg", "7_0_0_face2.jpg"),
                ("https://raw.githubusercontent.com/nutrigenius-samples/sample-faces/main/face3.jpg", "3_1_0_face3.jpg"),
                ("https://raw.githubusercontent.com/nutrigenius-samples/sample-faces/main/face4.jpg", "10_0_0_face4.jpg"),
                ("https://raw.githubusercontent.com/nutrigenius-samples/sample-faces/main/face5.jpg", "6_1_0_face5.jpg"),
            ]
            
            alt_success = True
            for url, filename in alt_images:
                alt_success = alt_success and download_with_progress(url, os.path.join(FACE_DATA_DIR, filename))
            
            return alt_success
        
        # Ekstrak dataset
        try:
            logger.info("Mengekstrak dataset sampel UTKFace...")
            with zipfile.ZipFile(SAMPLE_ZIP, 'r') as zip_ref:
                zip_ref.extractall(DATA_DIR)
            
            # Pindahkan file ke lokasi yang benar
            sample_dir = os.path.join(DATA_DIR, "utkface-subset-main")
            for file in os.listdir(sample_dir):
                if file.endswith(".jpg"):
                    shutil.move(os.path.join(sample_dir, file), FACE_DATA_DIR)
            
            # Bersihkan
            os.remove(SAMPLE_ZIP)
            if os.path.exists(sample_dir):
                shutil.rmtree(sample_dir)
            
            logger.info("Setup dataset sampel UTKFace berhasil!")
            return True
        except Exception as e:
            logger.error(f"Gagal mengekstrak dataset sampel UTKFace: {e}")
            return False

# Unduh dataset (default: subset kecil)
USE_FULL_DATASET = False  # Ubah menjadi True untuk dataset lengkap
success = setup_face_dataset(USE_FULL_DATASET)

if success:
    print("✅ Dataset siap digunakan!")
    # Hitung jumlah file untuk konfirmasi
    num_files = len(glob.glob(os.path.join(FACE_DATA_DIR, "*.jpg")))
    print(f"   Jumlah gambar dalam dataset: {num_files}")
else:
    print("❌ Gagal menyiapkan dataset. Periksa log untuk detail.")
    print("   Akan mencoba melanjutkan dengan gambar sampel yang tersedia.")

### 2.2 Struktur File/Folder yang Diharapkan

Dataset UTKFace disimpan dengan struktur file sebagai berikut:

```
NutriGenius/
├── data/
│   ├── raw/
│   │   └── UTKFace/              # Dataset UTKFace disimpan di sini
│   │       ├── 1_0_0_20170109142408075.jpg
│   │       ├── 1_0_0_20161219140623285.jpg
│   │       ├── ...
│   └── processed/
│       └── face/                 # Dataset yang telah diproses
│           ├── train/
│           ├── val/
│           └── test/
```

Format penamaan file UTKFace: [age]_[gender]_[race]_[date&time].jpg
- [age]: usia dalam tahun (integer)
- [gender]: 0 untuk laki-laki, 1 untuk perempuan
- [race]: 0 (Kulit Putih), 1 (Hitam), 2 (Asia), 3 (India), 4 (Lainnya)

### 2.3 Verifikasi Dataset

Mari kita periksa apakah dataset sudah tersedia dan valid:

In [None]:
def verify_face_dataset():
    """Verifikasi dataset wajah"""
    # Cek apakah directory ada
    if not os.path.exists(FACE_DATA_DIR):
        logger.error(f"Directory dataset UTKFace tidak ditemukan: {FACE_DATA_DIR}")
        return False
    
    # Cek apakah ada file gambar
    image_files = glob.glob(os.path.join(FACE_DATA_DIR, "*.jpg"))
    if len(image_files) == 0:
        logger.error("Tidak ada file gambar di directory UTKFace")
        return False
    
    # Cek apakah file dapat dibuka dan format nama valid
    valid_files = 0
    invalid_files = 0
    for img_path in image_files[:5]:  # Cek 5 file pertama saja
        try:
            # Coba buka gambar
            img = Image.open(img_path)
            img.close()
            
            # Cek format nama file
            filename = os.path.basename(img_path)
            parts = filename.split('_')
            if len(parts) >= 3:
                age = int(parts[0])
                gender = int(parts[1])
                if 0 <= age <= 116 and gender in [0, 1]:
                    valid_files += 1
                else:
                    invalid_files += 1
            else:
                invalid_files += 1
        except:
            invalid_files += 1
    
    if valid_files > 0:
        logger.info(f"Dataset valid dengan {len(image_files)} file gambar")
        return True
    else:
        logger.error("Tidak ada file valid dalam dataset")
        return False

# Verifikasi dataset
is_valid = verify_face_dataset()
if not is_valid:
    print("⚠️ Dataset tidak valid atau tidak lengkap.")
    print("   Silakan unduh dataset secara manual atau jalankan kode setup_face_dataset() lagi.")

## 3. Eksplorasi Dataset

Let's explore the UTKFace dataset. The file names contain labels in the format: `[age]_[gender]_[race]_[date&time].jpg`

- age: 0-116
- gender: 0 (male), 1 (female)
- race: 0 (White), 1 (Black), 2 (Asian), 3 (Indian), 4 (Others)

In [None]:
# Get all image paths
image_paths = glob.glob(os.path.join(RAW_DATA_DIR, "*.jpg"))
print(f"Total images found: {len(image_paths)}")

# Display a few sample paths
for path in image_paths[:5]:
    print(os.path.basename(path))

In [None]:
# Function to extract labels from filename
def extract_labels(filename):
    try:
        parts = os.path.basename(filename).split('_')
        age = int(parts[0])
        gender = "male" if int(parts[1]) == 0 else "female"
        race = int(parts[2])
        return {"age": age, "gender": gender, "race": race, "path": filename}
    except (IndexError, ValueError):
        return None

# Extract labels for all images
dataset = []
for path in tqdm(image_paths):
    labels = extract_labels(path)
    if labels:
        dataset.append(labels)

# Convert to dataframe
df = pd.DataFrame(dataset)
print(f"Processed {len(df)} images with valid labels")
df.head()

Let's check for any missing values or data issues

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Check for extreme values
print("\nAge range:")
print(f"Min age: {df['age'].min()}, Max age: {df['age'].max()}")

# Check gender distribution
print("\nGender distribution:")
print(df['gender'].value_counts())

# Check race distribution
print("\nRace distribution:")
print(df['race'].value_counts())

## 4. Age Distribution Analysis 

Let's analyze the age distribution in the dataset

In [None]:
# Plot age distribution
plt.figure(figsize=(12, 6))
sns.histplot(df['age'], bins=30, kde=True)
plt.title('Age Distribution in UTKFace Dataset')
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid(True)
plt.show()

# Age statistics
print("Age statistics:")
print(df['age'].describe())

In [None]:
# Create age groups for better analysis
def age_group(age):
    if age < 13:
        return 'Child (0-12)'
    elif age < 20:
        return 'Teen (13-19)'
    elif age < 30:
        return 'Young Adult (20-29)'
    elif age < 40:
        return 'Adult (30-39)'
    elif age < 50:
        return 'Middle Age (40-49)'
    elif age < 60:
        return 'Older Adult (50-59)'
    else:
        return 'Senior (60+)'

df['age_group'] = df['age'].apply(age_group)

# Plot age group distribution
plt.figure(figsize=(12, 6))
order = ['Child (0-12)', 'Teen (13-19)', 'Young Adult (20-29)', 'Adult (30-39)', 
         'Middle Age (40-49)', 'Older Adult (50-59)', 'Senior (60+)']
sns.countplot(x='age_group', data=df, order=order)
plt.title('Age Group Distribution')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()

## 5. Gender Distribution Analysis 

Now let's analyze gender distribution and its relationship with age

In [None]:
# Plot gender distribution
plt.figure(figsize=(10, 6))
sns.countplot(x='gender', data=df)
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.grid(True)

# Add percentage labels
total = len(df)
for p in plt.gca().patches:
    height = p.get_height()
    plt.gca().text(p.get_x() + p.get_width()/2.,
                   height + 5,
                   f'{height} ({100 * height / total:.1f}%)',
                   ha="center") 

plt.show()

In [None]:
# Plot age distribution by gender
plt.figure(figsize=(12, 6))
sns.kdeplot(data=df, x='age', hue='gender', common_norm=False, fill=True, alpha=0.5)
plt.title('Age Distribution by Gender')
plt.xlabel('Age')
plt.ylabel('Density')
plt.grid(True)
plt.show()

# Gender distribution across age groups
plt.figure(figsize=(14, 6))
gender_age_counts = df.groupby(['age_group', 'gender']).size().unstack()
gender_age_counts.plot(kind='bar', stacked=True)
plt.title('Gender Distribution Across Age Groups')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()

## 6. Face Detection

Let's test face detection on some sample images

In [None]:
# Load the face detection classifier
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')

# Function to detect faces in an image
def detect_faces(image_path):
    # Read the image
    img = cv2.imread(image_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # Detect faces
    faces = face_cascade.detectMultiScale(gray, 1.1, 5)
    
    # Draw rectangles around the faces
    for (x, y, w, h) in faces:
        cv2.rectangle(img_rgb, (x, y), (x+w, y+h), (255, 0, 0), 2)
    
    return img_rgb, faces

# Test on a few sample images
sample_indices = np.random.randint(0, len(df), 5)
sample_images = df.iloc[sample_indices]['path'].tolist()

plt.figure(figsize=(15, 10))
for i, img_path in enumerate(sample_images):
    # Get original labels
    labels = df[df['path'] == img_path].iloc[0]
    age, gender = labels['age'], labels['gender']
    
    # Detect faces
    result_img, faces = detect_faces(img_path)
    
    # Display results
    plt.subplot(2, 3, i+1)
    plt.imshow(result_img)
    plt.title(f"Age: {age}, Gender: {gender}\nFaces detected: {len(faces)}")
    plt.axis('off')
    
plt.tight_layout()
plt.show()

## 7. Preprocessing Functions 

Let's implement preprocessing functions for preparing data for model training

In [None]:
# Function to preprocess images
def preprocess_image(image_path, target_size=(200, 200)):
    # Read image
    img = cv2.imread(image_path)
    
    # Convert to RGB (our model will expect RGB)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    # Resize to target size
    img = cv2.resize(img, target_size)
    
    # Normalize pixel values to [0, 1]
    img = img / 255.0
    
    return img

# Function to create a dataset for age prediction
def create_age_dataset(df, target_size=(200, 200), test_size=0.2, max_samples=None):
    # Shuffle the dataframe
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    
    # Limit the number of samples if specified
    if max_samples and max_samples < len(df):
        df = df.iloc[:max_samples]
    
    # Split into train and test sets
    train_df = df.iloc[:int(len(df) * (1 - test_size))]
    test_df = df.iloc[int(len(df) * (1 - test_size)):]
    
    print(f"Training samples: {len(train_df)}")
    print(f"Testing samples: {len(test_df)}")
    
    # Create arrays for images and labels
    X_train = np.array([preprocess_image(path, target_size) for path in tqdm(train_df['path'])])
    y_train = np.array(train_df['age'])
    
    X_test = np.array([preprocess_image(path, target_size) for path in tqdm(test_df['path'])])
    y_test = np.array(test_df['age'])
    
    return (X_train, y_train), (X_test, y_test)

# Function to create a dataset for gender prediction
def create_gender_dataset(df, target_size=(200, 200), test_size=0.2, max_samples=None):
    # Shuffle the dataframe
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    
    # Limit the number of samples if specified
    if max_samples and max_samples < len(df):
        df = df.iloc[:max_samples]
    
    # Convert gender to binary
    df['gender_binary'] = df['gender'].apply(lambda x: 1 if x == 'female' else 0)
    
    # Split into train and test sets
    train_df = df.iloc[:int(len(df) * (1 - test_size))]
    test_df = df.iloc[int(len(df) * (1 - test_size)):]
    
    print(f"Training samples: {len(train_df)}")
    print(f"Testing samples: {len(test_df)}")
    
    # Create arrays for images and labels
    X_train = np.array([preprocess_image(path, target_size) for path in tqdm(train_df['path'])])
    y_train = np.array(train_df['gender_binary'])
    
    X_test = np.array([preprocess_image(path, target_size) for path in tqdm(test_df['path'])])
    y_test = np.array(test_df['gender_binary'])
    
    return (X_train, y_train), (X_test, y_test)

In [None]:
# Test preprocessing on a small subset
small_df = df.sample(10, random_state=42)

# Create a small age dataset
print("Creating small age dataset:")
(X_age_train, y_age_train), (X_age_test, y_age_test) = create_age_dataset(small_df, max_samples=10)

# Create a small gender dataset
print("\nCreating small gender dataset:")
(X_gender_train, y_gender_train), (X_gender_test, y_gender_test) = create_gender_dataset(small_df, max_samples=10)

# Display shapes
print("\nAge dataset shapes:")
print(f"X_train: {X_age_train.shape}, y_train: {y_age_train.shape}")
print(f"X_test: {X_age_test.shape}, y_test: {y_age_test.shape}")

print("\nGender dataset shapes:")
print(f"X_train: {X_gender_train.shape}, y_train: {y_gender_train.shape}")
print(f"X_test: {X_gender_test.shape}, y_test: {y_gender_test.shape}")

## 8. Conclusion 

In this notebook, we've explored the UTKFace dataset and prepared preprocessing functions for the face detection and age/gender classification models.

Here's a summary of our findings:

1. The dataset contains facial images with age, gender, and race labels
2. Age distribution is skewed towards younger individuals
3. There's a slight gender imbalance that should be addressed during training
4. We've tested the OpenCV face detection which works well on the dataset
5. Preprocessing functions for age and gender prediction are ready for model training

### Next Steps

In the next notebook, we'll build and train the age and gender models using the preprocessing functions defined here. We'll use deep learning techniques with TensorFlow to create models that can accurately predict a user's age and gender from facial images.