<a href="https://colab.research.google.com/github/Shirley-Dongxx/APS360_project/blob/main/APS360_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APS360 Project: Facial Age Progression and Regression


---
Contents to be added:


*   Image Preprocessing
*   CAAE Architecture
*   GAN Architecture
*   Baseline (optional)




## Data Preprocessing



*   Dataset used: The IMDB-WIKI dataset (only the WIKI set will be used)
*   Download the WIKI face only dataset at: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/


**What you need to do before running the code**

1.   Create a new folder called APS360_project directly under your MyDrive folder (in case you don't want to change any of the file paths in the code)
2.   Upload the wiki_crop dataset to your APS360_project folder (At present I am only testing on the folder 0-9)
3.   Upload the wiki.mat file to the APS360_project folder (which should have been included in your unzipped wiki_crop folder)
4.   Create a new folder called wiki_processed directly under APS360_project
5.   ~~Within the wiki_processed folder, create 3 folders named "train", "validation" and "test"~~
6.   ~~Within the above 3 folders, create 2 folders named "female" and "male" (since we want to train the two genders separately)~~
7.   ~~Within the gender folders, create 10 folders with folder name from 0-9~~

*(have created a function to generate the folders automatically)*


### Import data from Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Load the mat file of the Wiki dataset


**Reference**

@article\
{Rothe-IJCV-2018,
  author = {Rasmus Rothe and Radu Timofte and Luc Van Gool},\
  title = {Deep expectation of real and apparent age from a single 
  image without facial landmarks},\
  journal = {International Journal of Computer Vision},\
  volume={126},\
  number={2-4},\
  pages={144--157},\
  year={2018},\
  publisher={Springer}
}

@InProceedings\
{Rothe-ICCVW-2015,\
  author = {Rasmus Rothe and Radu Timofte and Luc Van Gool},\
  title = {DEX: Deep EXpectation of apparent age from a single image},\
  booktitle = {IEEE International Conference on Computer Vision Workshops (ICCVW)},\
  year = {2015}, \
  month = {December},\
}



In [52]:
from datetime import datetime, timedelta
from scipy.io import loadmat
import numpy as np
import pandas as pd
import scipy
import os
import imghdr
import matplotlib.pyplot as plt
import itertools

from collections import defaultdict
from pprint import pprint
from shutil import copy2

import cv2
import sys
from google.colab.patches import cv2_imshow

from sklearn.model_selection import train_test_split

In [4]:
# convert the matlab styled date to number
def matlab_datenum_to_dt(matlab_datenum):
    return datetime.fromordinal(int(matlab_datenum) - 366) + timedelta(days=int(matlab_datenum % 1))

In [5]:
# load wiki metadata
# Modified from https://gist.github.com/messefor/e2ee5fe1c18a040c90bbf91f2ee279e3

def load_wiki_meta():

    # Please change to your own path when testing!
    file_path = "/content/gdrive/MyDrive/APS360_project/wiki.mat"
    save_path = "/content/gdrive/MyDrive/APS360_project/wiki.pkl"

    mat = loadmat(file_path)

    print("Data header:",mat['__header__'])
    print("Data Version:",mat['__version__'])

    # Extract values
    data = mat['wiki'][0, 0]
    print("Column names are: ", data.dtype.names)

    # Data loaded in simple form
    col_keys = ('dob', 'photo_taken', 'gender', 'face_location', 'face_score', 'second_face_score')
    col_values = {k: data[k].squeeze() for k in col_keys}

    # Data loaded into numpy arrays
    col_keys_nested = ('full_path', 'name')
    for key in col_keys_nested:
        col_values[key] = np.array([x if not x else x[0] for x in data[key][0]])
    #print(col_values['name'][5])

    # Convert face location to DataFrame
    # Inputs:
    #    img - image (i.e. load with imread)
    #    box - location of face (i.e. img(box(2):box(4),box(1):box(3),:))
    #    crop_margin - margin around face as a fraction of the width, height
    #    [left above right below], default is [0.4 0.4 0.4 0.4]
    col_values['face_location'] =[tuple(x[0].tolist()) for x in data['face_location'].squeeze()]

    # Check all values extracted have same length
    set_nrows = {len(v) for _, v in col_values.items()}
    assert len(set_nrows) == 1

    # convert to panda data frame
    df_values = pd.DataFrame(col_values)

    # Convert matlab datenum to datetime
    df_values['dob'] = df_values['dob'].apply(matlab_datenum_to_dt)

    # Calculate ages when photo was taken
    df_values['photo_taken_age'] = df_values.apply(lambda x: x['photo_taken'] - x['dob'].year, axis=1)

    # Concat all together and save
    # Do not use csv format to work around tuple to be string
    df_values.to_pickle(save_path)

load_wiki_meta()

Data header: b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sat Jan 16 16:25:20 2016'
Data Version: 1.0
Column names are:  ('dob', 'photo_taken', 'full_path', 'gender', 'name', 'face_location', 'face_score', 'second_face_score')




In [6]:
# check if the image is in the correct format
DATA_PATH_FILTERED = "/content/gdrive/MyDrive/APS360_project/00"
def check_path():
  count = 0
  for root, dirs, files in os.walk(DATA_PATH_FILTERED, topdown = False):
    for name in files:
      print("loading...")
      filepath = os.path.join(root, name)
      count += 1
      if imghdr.what(filepath) is not 'jpeg':
        print(filepath)
  print(count)

### Load the image into different datasets with labels

**The separation of image labels**

The image labels are separated based on the age of the person when the photo was taken.

Label  | Age Range (Inclusive)
-------------------|------------------
0      | 0-5
1     | 6-10
2      | 11-15
3    | 16-20 
4      | 21-30
5     | 31-40 
6      | 41-50
7     | 51-60
8      | 61-70
9     | 71+ 



In [7]:
# record the paths of data and metadata
# please change to your own path when running
META_PATH = "/content/gdrive/MyDrive/APS360_project/wiki.pkl"


IMG_SIZE = 2048

# count the number of faces in the image
def count_face(imagePath):

  image = cv2.imread(imagePath)
  #cv2_imshow(image)
  gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

  faceCascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")
  faces = faceCascade.detectMultiScale(
      gray,
      scaleFactor=1.3,
      minNeighbors=3,
      minSize=(30, 30)
  )

  print("[INFO] Found {0} Faces.".format(len(faces)))
  return len(faces)

# get the gender of the image
def get_gender(gender):
    if(gender == 1.0):
      return "male"
    else:
      return "female"

# check if the image is valid
def check_valid_image(path):
    # if no image can be retrieved from the path
    if not (os.path.isfile(path)):
      print("Invalid Path!")
      return False

    # if the size of the image is too large
    if not (os.path.getsize(path) > IMG_SIZE):
      print("Invalid Size!")
      return False
    
    # if there are no faces or too many faces
    if (count_face(path)!=1):
      print("Abort picture!")
      return False

    return True

# check if the age of the person is in our range
def check_valid_age(age):
    if (age < 0 or age > 100):
      return False
    return True

In [8]:
def get_age_label(age):
    if 0 <= age <= 5:
        label = 0
    elif 6 <= age <= 10:
        label = 1
    elif 11 <= age <= 15:
        label = 2
    elif 16 <= age <= 20:
        label = 3
    elif 21 <= age <= 30:
        label = 4
    elif 31 <= age <= 40:
        label = 5
    elif 41 <= age <= 50:
        label = 6
    elif 51 <= age <= 60:
        label = 7
    elif 61 <= age <= 70:
        label = 8
    else:
        label = 9

    return label

In [9]:
def getBaseFilename(filename):
    return filename.split('/')[-1]

In [10]:
def get_folder(filename):
    return int(filename.split('/')[0])

In [11]:
def print_age_distribution(age_count):
  fig = plt.figure()
  ax = fig.add_axes([0,0,1,1])
  age_label = [0,1,2,3,4,5,6,7,8,9]
  count_label = age_count
  ax.plot(age_label,count_label)
  plt.show()

def print_gender_distribution(gender_count):
  fig = plt.figure()
  ax = fig.add_axes([0,0,1,1])
  gender_label = ["female", "male"]
  count_label = gender_count
  ax.bar(gender_label,count_label)
  plt.show()

In [53]:
DATA_PATH = "/content/gdrive/MyDrive/APS360_project/wiki_crop"
DATA_SAVE_PATH = "/content/gdrive/MyDrive/APS360_project/wiki_processed"

def process_wiki_image(file_test = False):
    uploaded_folder = 19

    # create the file paths
    if not os.path.isdir(DATA_SAVE_PATH):
        print("Path ", DATA_SAVE_PATH, "has not formed")
        os.mkdir(DATA_SAVE_PATH)

    if not os.path.isdir(DATA_SAVE_PATH+"/test"):
        print("Path ", DATA_SAVE_PATH+"/test", "has not formed")
        os.mkdir(DATA_SAVE_PATH+"/test")

    if not os.path.isdir(DATA_SAVE_PATH+"/train"):
        print("Path ", DATA_SAVE_PATH+"/train", "has not formed")
        os.mkdir(DATA_SAVE_PATH+"/train")

    if not os.path.isdir(DATA_SAVE_PATH+"/validation"):
        print("Path ", DATA_SAVE_PATH+"/validation", "has not formed")
        os.mkdir(DATA_SAVE_PATH+"/validation")

    for k in range(10):
        if not os.path.isdir(DATA_SAVE_PATH+"/test/female/"+str(k)):
          print("Path ", DATA_SAVE_PATH+"/test/female/"+str(k), "has not formed")
          os.mkdir(DATA_SAVE_PATH+"/test/"+str(k))

        if not os.path.isdir(DATA_SAVE_PATH+"/train/female/"+str(k)):
          print("Path ", DATA_SAVE_PATH+"/train/female/"+str(k), "has not formed")
          os.mkdir(DATA_SAVE_PATH+"/train/female/"+str(k))

        if not os.path.isdir(DATA_SAVE_PATH+"/validation/female/"+str(k)):
          print("Path ", DATA_SAVE_PATH+"/validation/"+str(k), "has not formed")
          os.mkdir(DATA_SAVE_PATH+"/validation/female/"+str(k))

        if not os.path.isdir(DATA_SAVE_PATH+"/test/male/"+str(k)):
          print("Path ", DATA_SAVE_PATH+"/test/male/"+str(k), "has not formed")
          os.mkdir(DATA_SAVE_PATH+"/test/male/"+str(k))

        if not os.path.isdir(DATA_SAVE_PATH+"/train/male/"+str(k)):
          print("Path ", DATA_SAVE_PATH+"/train/male/"+str(k), "has not formed")
          os.mkdir(DATA_SAVE_PATH+"/train/male/"+str(k))

        if not os.path.isdir(DATA_SAVE_PATH+"/validation/male/"+str(k)):
          print("Path ", DATA_SAVE_PATH+"/validation/male/"+str(k), "has not formed")
          os.mkdir(DATA_SAVE_PATH+"/validation/male/"+str(k))

    test_age_count,train_age_count,val_age_count = [0,0,0,0,0,0,0,0,0,0]
    test_gender_count,train_gender_count,val_gender_count = [0,0]
    meta = pd.read_pickle(META_PATH)

    # Limit the age of the dataset set
    meta = meta[meta['photo_taken_age'] >= 0]
    meta = meta[meta['photo_taken_age'] <= 101]

    # Converting into numpy array
    meta = meta.values
    remove_indices = []
    for i in range(len(meta)):
        if get_folder(meta[i][6])>uploaded_folder:
            remove_indices.append(i)

    meta = np.array(list(itertools.compress(meta, [i not in remove_indices for i in range(len(meta))])))
    print(meta[0])

    # Split dataset into training validation and testing set
    D_train, D_test = train_test_split(meta, test_size=0.15, random_state=62)
    D_train, D_val = train_test_split(D_train, test_size=0.15, random_state=51)

    # Load the training set
    # test a few images at first
    n = 0
    for i in range(len(D_train)):
      age = D_train[i][-1]
      #print("Age is", age)

      full_path = os.path.join(DATA_PATH, D_train[i][6])
      #print("The path is:", full_path)

      # currently only testing on folder 0-9
      #print("folder is: ",get_folder(D_train[i][6]))

      if get_folder(D_train[i][6])>uploaded_folder:
        #print("out")
        new_age = None
        continue

      if not check_valid_image(full_path):
        continue

      n += 1
      print("n=",n)

      if file_test and n>5:
        print("Testing end...")
        print_age_distribution(age_count)
        print_gender_distribution(gender_count)
        return

      # obtain the gender and age of the person in the image
      gender = get_gender(D_train[i][2])
      age_label = get_age_label(age)
      train_age_count[age_label] += 1
      age_label = str(age_label)
      train_gender_count[int(D_train[i][2])] += 1

      new_file_path = DATA_SAVE_PATH + "/train/"+ gender + "/" + age_label +"/"
      #print("The new file path is:", new_file_path)

      # check if the folder exists
      if not os.path.isdir(new_file_path):
        print("Folder not found!")

      new_file_path = os.path.join(new_file_path, getBaseFilename(D_train[i][6]))
      if(file_test): 
        print("Now, the new file path is:", new_file_path)

      copy2(full_path, new_file_path)
      if(file_test):
        print("copy finished")

    for i in range(len(D_val)):
      age = D_val[i][-1]
      full_path = os.path.join(DATA_PATH, D_val[i][6])

      if get_folder(D_val[i][6])>uploaded_folder:
        new_age = None
        continue

      if not check_valid_image(full_path):
        continue

      # obtain the gender and age of the person in the image
      gender = get_gender(D_val[i][2])
      val_age_label = get_age_label(age)
      age_count[age_label] += 1
      age_label = str(age_label)
      val_gender_count[int(D_val[i][2])] += 1

      new_file_path = DATA_SAVE_PATH + "/validation/"+ gender + "/" + age_label +"/"
      #print("The new file path is:", new_file_path)

      # check if the folder exists
      if not os.path.isdir(new_file_path):
        print("Folder not found!")

      new_file_path = os.path.join(new_file_path, getBaseFilename(D_val[i][6]))
      if(file_test): 
        print("Now, the new file path is:", new_file_path)

      copy2(full_path, new_file_path)
      if(file_test):
        print("copy finished")

    # load testing data
    for i in range(len(D_test)):
      age = D_test[i][-1]
      full_path = os.path.join(DATA_PATH, D_test[i][6])

      if get_folder(D_test[i][6])>uploaded_folder:
        new_age = None
        continue

      if not check_valid_image(full_path):
        continue

      # obtain the gender and age of the person in the image
      gender = get_gender(D_test[i][2])
      test_age_label = get_age_label(age)
      age_count[age_label] += 1
      age_label = str(age_label)
      test_gender_count[int(D_test[i][2])] += 1

      new_file_path = DATA_SAVE_PATH + "/test/"+ gender + "/" + age_label +"/"
      #print("The new file path is:", new_file_path)

      # check if the folder exists
      if not os.path.isdir(new_file_path):
        print("Folder not found!")

      new_file_path = os.path.join(new_file_path, getBaseFilename(D_test[i][6]))
      if(file_test): 
        print("Now, the new file path is:", new_file_path)

      copy2(full_path, new_file_path)
      if(file_test):
        print("copy finished")

    print("training statistics:")
    print_age_distribution(train_age_count)
    print_gender_distribution(train_gender_count)

    print("validation statistics:")
    print_age_distribution(val_age_count)
    print_gender_distribution(val_gender_count)

    print("testing statistics:")
    print_age_distribution(test_age_count)
    print_gender_distribution(test_gender_count)

In [None]:
# for testing purpose only
process_wiki_image(file_test = True)

In [51]:
'''
meta = pd.read_pickle(META_PATH)
import itertools

# Limit the folder of the dataset set
#print(meta['full_path'])
#

# Limit the age of the dataset set
meta = meta[meta['photo_taken_age'] >= 0]
meta = meta[meta['photo_taken_age'] <= 101]

# Converting into numpy array
meta = meta.values

remove_indices = []
for i in range(len(meta)):
  if get_folder(meta[i][6])>19:
    remove_indices.append(i)


##elements_to_remove = meta[remove_indices]
#meta = np.delete(meta, np.where(meta == value_to_delete))
meta = np.array(list(itertools.compress(meta, [i not in remove_indices for i in range(len(meta))])))
print(meta[0])
print(len(meta))
#
# Split dataset into training validation and testing set
D_train, D_test = train_test_split(meta, test_size=0.15, random_state=62)
D_train, D_val = train_test_split(D_train, test_size=0.15, random_state=51)


print(len(D_train))
print(len(D_val))
print(len(D_test))
'''

[datetime.datetime(1981, 5, 5, 0, 0) 2009 1.0
 (111.29109473290997, 111.29109473290997, 252.66993081807996, 252.66993081807996)
 4.3009623883308095 nan '17/10000217_1981-05-05_2009.jpg'
 'Sami Jauhojärvi' 28]
12398
8957
1581
1860


**What should have been obtained after data preprocessing**

There will be a series of folders created in your Google Drive for storing the categorized dataset. For each train/test/validation dataset, there should be data categorized into genders and age stamps as listed above. 


## Architecture I: CAAE

---
For both the encoder and generator, there is a four convolution layer structure with the pooling replaced by stridden convolutions. The model also consists of two discriminators, each work on the personal features (z) and the original/generated face images, assisting the generation of more realistic images.

* Encoder
* Generator
* Discriminator-z
* Discriminator - image
* Others (training, data loading etc.)



## Architecture II: GAN


---
Generative Adversarial Networks (GANs) use an idea of adversarial loss to force the generated image to be realistic. The network contains one generator and one discriminator. The generator is made up of convolution layers mapping the sampled noise vector to synthesized images. The discriminator takes in the real or generated face image and tries to tell them apart. The LeakyReLU activation is used in all layers in the discriminator for stabilization.


* Generator
* Discriminator
* Others (training, data loading etc.)

