# BDC - Satria Data 2021

Task : Gender Detection

## Authors

1. Muhammad Amanda
2. Naufal Zhafran A.
3. Wahyu Setianto

## Running On

Kaggle [using GPU]

## First Thing First

Menginstall library yang diperlukan dan mengimport library - library yang akan digunakan serta menseting variable config yang akan digunakan di dalam notebook ini.

1. Menginstal library`MTCNN`

Library `MTCNN` adalah library yang digunakan untuk preprocessing data gambar pada notebook ini

In [1]:
!pip -q install mtcnn

2. Importing library

Mengimport library yang akan digunakan dalam notebook ini.

In [2]:
# Umum
import os, random
from tqdm.notebook import tqdm
import numpy as np
import pandas as pd
from PIL import Image

# Tensorflow
import tensorflow as tf
from tensorflow.keras.preprocessing.image import load_img, img_to_array

# Metrics & Splitting data
from sklearn.metrics import *
from sklearn.model_selection import *

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
import cv2
from mtcnn import MTCNN

print("Tensorflow :", tf.__version__)

3. Setup `CONFIG`

Mensetup varible - variable yang digunakan sebagai config pada notebook ini

In [3]:
SEED = 2021
SIZE = (256, 256)
BATCH_SIZE = 32
FACE_THRESHOLD = 0.85
FACE_DETECTOR = MTCNN()

## Dataset

Load dataset yang mengandung informasi `path` dari data gambar

In [4]:
train = pd.read_csv("../input/bdc-2021/train.csv")
test = pd.read_csv("../input/bdc-2021/submission.csv")
train.head()

memperjelas `path` ke setiap data gambar

In [5]:
images = []
labels = []
test_images = []

TRAIN_DIR = "../input/bdc-2021/Training"
TEST_DIR = "../input/bdc-2021/Testing"

for no, label in train[["nomor", "jenis kelamin"]].values:
    TEMP_DIR = os.path.join(TRAIN_DIR, str(no))
    for file in os.listdir(TEMP_DIR):
        file_dir = os.path.join(TEMP_DIR, file)
        if ".ini" not in file_dir:
            images.append(file_dir)
            labels.append(label)

for no in test.id.values:
    file_dir = os.path.join(TEST_DIR, f"{no}.jpg")
    if os.path.isfile(file_dir):
        test_images.append(file_dir)
    else:
        test_images.append(None)
        print(file_dir)

menampilkan dan mengecek beberapa gambar pada data `train`

In [6]:
def read(path):
    """
    Read data gambar
    """
    img = Image.open(path)
    return img

def show_images(list_dir, label = None, load_image = read, seed = SEED):
    """
    Menampilkan Gambar Secara acak sebanyak 5 buah.
    """
    random.seed(seed)
    unique = ["init"]
    if label:
        unique = list(set(label))
    fig, axes = plt.subplots(len(unique), 5, figsize = (20, 5 * len(unique)))
    for i in range(len(unique)):
        if i == 0 and unique[i] == "init":
            data = random.sample(list_dir, 5)
        else:
            data = random.sample([x for x in zip(list_dir, label) if x[1] == unique[i]], 5)
        for j in range(5):
            if unique[0] != "init":
                img = load_image(data[j][0])
                axes[i, j].imshow(img)
                axes[i, j].set_title(f'Label : {data[j][1]}', fontsize = 14)
                axes[i, j].axis('off')
            else:
                img = load_image(data[j])
                axes[j].imshow(img)
                axes[j].axis('off')
    fig.tight_layout()
    plt.show()

In [7]:
show_images(images, labels, seed=20)

## Preprocess Data

Metode yang digunakan:

1. Mengekstrak wajah - wajah yang terdapat pada gambar menjadi gambar - gambar baru dengan label yang sama dengan menggunakan model `MTCNN`
2. Pada data test jika terdapat dua wajah yang terdeteksi pada satu gambar akan di ambil wajah dengan tingkat confidence terbesar yang diberikan oleh model `MTCNN`.
3. Jika tidak terdetect wajah pada salah satu gambar maka akan dilakukan crop pada bagian tengah gambar sehingga gambar berbentuk persegi atau `jxj` pixel.
4. Selanjutnya gambar akan di resize menjadi ukuran `256x256` pixel

berikut adalah contoh hasil preprocess data gambar.

In [8]:
def get_faces(path):
    image = cv2.cvtColor(cv2.imread(path), cv2.COLOR_BGR2RGB)
    faces = FACE_DETECTOR.detect_faces(image)
    return faces

def load_and_preprocess_image(path: str, size = SIZE):
    """
    Load & Preprocess data gambar
    """
    image = img_to_array(load_img(path))
    faces = [x['box'] for x in get_faces(path) if x['confidence'] > FACE_THRESHOLD]
    if len(faces) > 0:
        x, y, w, h = faces[0]
        image = image[y:y+h, x:x+w]
    img = tf.convert_to_tensor(image, dtype=tf.float32)
    if len(faces) == 0:
        shapes = tf.shape(img)
        h, w = shapes[-3], shapes[-2]
        dim = tf.minimum(h, w)
        img = tf.image.resize_with_crop_or_pad(img, dim, dim)
    img = tf.image.resize(img, size)
    img = tf.cast(img, tf.float32) / 255.0
    return img.numpy()

In [9]:
show_images(images, labels, load_image = load_and_preprocess_image, seed=20)

Running preprocessing pada data gambar secara keseluruhan

In [10]:
def image_preprocessing(new_dir, images, labels=None):
    if os.path.isdir(new_dir):
        !rm -rf {new_dir}
    os.mkdir(new_dir)
    
    new_images, new_labels = [], []
    if not labels:
        labels = [None for _ in range(len(images))]
    
    for path, label in tqdm(zip(images, labels), total=len(images)):
        image = img_to_array(load_img(path))
        faces = [x['box'] for x in sorted(get_faces(path), key=lambda x: x['confidence'], 
                                          reverse=True) if x['confidence'] > FACE_THRESHOLD]
        if len(faces) > 0:
            if label:
                for j, (x, y, w, h) in enumerate(faces):
                    img = image[y:y+h, x:x+w]
                    img = tf.convert_to_tensor(img, dtype=tf.float32)
                    img = tf.image.resize(img, SIZE)

                    img_dir = os.path.join(new_dir, f'{j}_{path.split("/")[-1]}')
                    new_images.append(img_dir)
                    new_labels.append(label)
                    tf.keras.preprocessing.image.save_img(img_dir, img)
            else:
                x, y, w, h = faces[0]
                img = image[y:y+h, x:x+w]
                img = tf.convert_to_tensor(img, dtype=tf.float32)
                img = tf.image.resize(img, SIZE)
                
                img_dir = os.path.join(new_dir, path.split('/')[-1])
                new_images.append(img_dir)
                new_labels.append(label)
                tf.keras.preprocessing.image.save_img(img_dir, img)
        else :
            img = tf.convert_to_tensor(image, dtype=tf.float32)
            shapes = tf.shape(img)
            h, w = shapes[-3], shapes[-2]
            dim = tf.minimum(h, w)
            img = tf.image.resize_with_crop_or_pad(img, dim, dim)
            img = tf.image.resize(img, SIZE)

            img_dir = os.path.join(new_dir, path.split('/')[-1])
            new_images.append(img_dir)
            new_labels.append(label)
            tf.keras.preprocessing.image.save_img(img_dir, img)
    
    return new_images, new_labels

Untuk menghemat waktu running akan di skip bagian ini dan di ganti dengan meload data hasil preprocess yang sudah di save pada run sebelumnya. Namun jika ingin melakukan preprocess pada run sekarang maka uncomment code di bawah ini.

**Peringatan** : running block code di bawah memakan waktu sekitar 50 menit dengan GPU Nvidia Tesla P100-PCIE.

In [11]:
# new_train_dir = "./train"
# new_test_dir = "./test"

# new_images, new_labels = image_preprocessing(new_train_dir, images, labels)
# new_test_images, _ = image_preprocessing(new_test_dir, test_images)

**Note** : Comment dua block kode di bawah jika melakukan preprocess pada run saat ini.

In [12]:
preprocessed = pd.read_csv("../input/bdc-2021/preprocessed/preprocessed.csv")
preprocessed.head()

In [13]:
preprocessed_dir = "../input/bdc-2021/preprocessed"
new_images = [os.path.join(preprocessed_dir, x) for x in preprocessed.image.values]
new_labels = preprocessed.label.values
new_test_images = [os.path.join(preprocessed_dir, "test", f"{x}.jpg") for x in test.id.values]

In [14]:
# TO BE DELETED
rancu = [
    "0db77c32-10cc-4595-84e6-90c37678f518",
    "4870d63f-c2bb-4798-8e96-3bc9e6d0d8f4",
    "3d6171d7-d21c-45fd-9f68-7945a79f4737",
    "5e2e3f87-ce6f-4fbe-a6c9-c058796858dc",
    "0bf9b43e-fd13-4ef9-b423-5a3d43fec20f",
]
rancu = [os.path.join(TEST_DIR, f"{x}.jpg") for x in rancu]
show_images(rancu, load_image = load_and_preprocess_image)

Mengecek distribusi label pada data

In [15]:
plt.figure(figsize=(5, 5))
sns.countplot(x=new_labels)
plt.show()

Jumlah data yang berlabel `0` dan `1` cenderung sama.

**Splitting Data**

Split data train menjadi data `train` dan data `valid` dengan proporsi `85:15`

In [16]:
x_train, x_valid, y_train, y_valid = train_test_split(new_images, new_labels, test_size=0.15, 
                                                      stratify=new_labels, random_state=SEED)

**Tensorflow Data**

Load data gambar menggunakan `Tensorflow Data` agar pada saat pelatihan model penggunaan memmori dapat lebih optimal

In [17]:
def decode_image(filename, label=None, image_size=SIZE):
    """
    Decode Image from String Path Tensor
    """
    bits = tf.io.read_file(filename)
    image = tf.image.decode_jpeg(bits, channels=3)
    image = tf.cast(image, tf.float32) / 255.0
    image = tf.image.resize(image, SIZE)

    if label is None: # if test
        return image
    else:
        return image, label

In [18]:
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .map(decode_image)
    .cache()
    .repeat()
    .shuffle(1024)
    .batch(BATCH_SIZE)
)

# TF Valid Dataset
valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .map(decode_image)
    .batch(BATCH_SIZE)
    .cache()
)

# TF Test Dataset
test_dataset = (
    tf.data.Dataset
    .from_tensor_slices((new_test_images))
    .map(decode_image)
    .batch(BATCH_SIZE)
)

## Modelling

Membuat model untuk mengklasifikasikan gender

Model yang digunakan saat ini : `Resnet50`

In [19]:
tf.keras.backend.clear_session()

model = tf.keras.Sequential([
    tf.keras.applications.resnet50.ResNet50(
        include_top=False, weights=None, input_shape = (256,256,3)),
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1024, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(5e-4)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.summary()

Training model

In [20]:
checkpoint = tf.keras.callbacks.ModelCheckpoint('Resnet50_best_model.h5', monitor='val_loss', 
                                                save_best_only=True, save_weights_only=True, 
                                                mode='min')
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=['accuracy'])
history = model.fit(train_dataset, epochs=50, validation_data=valid_dataset, 
          steps_per_epoch=len(x_train) // BATCH_SIZE,
          callbacks = [checkpoint])

History training model.

In [21]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize = (12, 10))
ax1.plot(range(1, len(history.history['loss']) + 1), history.history['loss'], label = 'loss')
ax1.plot(range(1, len(history.history['loss']) + 1), history.history['val_loss'], label = 'val_loss')
ax1.set_title('Loss at Training', fontsize = 14)
ax1.legend()
ax2.plot(range(1, len(history.history['loss']) + 1), history.history['accuracy'], label = 'accuracy')
ax2.plot(range(1, len(history.history['loss']) + 1), history.history['val_accuracy'], label = 'val_accuracy')
ax2.set_title('Accuracy at Training', fontsize = 14)
ax2.legend()
fig.show()

Load weights dari epochs dengan nilai `val_loss` terendah pada saat training.

In [22]:
model.load_weights('Resnet50_best_model.h5')

**Mengecek Kebaikan Model**

Mengecek kebaikan model dengan melihat nilai Akurasi dan F1-Score pada data valid.

In [23]:
val_pred_classes = np.array(model.predict(valid_dataset).flatten() >= .5, dtype = 'int')
print(f'Accuracy Valid Data : {accuracy_score(y_valid, val_pred_classes)}')
print(f'F1 Score Valid Data : {f1_score(y_valid, val_pred_classes)}')

Confussion Matrix

In [24]:
plt.figure(figsize = (7, 7))
sns.heatmap(confusion_matrix(y_valid, val_pred_classes, normalize = 'true'),
            annot=True, cmap=plt.cm.Blues)
plt.title('Normalized Confussion Matrix Valid Data')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## Membuat Submission

In [25]:
submission = pd.DataFrame({'id' :[x.split('/')[-1].split('.')[0] for x in new_test_images],
                           'jenis kelamin': np.array(model.predict(test_dataset).flatten() >= .5, dtype = 'int')})
test = test.merge(submission, on="id")
test.head()

In [26]:
test.to_csv("submission.csv", index=False)