# Title

### Preparing the Training, Validation and Test datasets

Firstly, you will need to download the 3 landmark list files from the [UTKFace Google Drive](https://drive.google.com/open?id=0BxYys69jI14kS1lmbW1jbkFHaW8) and copy them to the project directory.

We import the `concatenate_files` function from the `preselect.py` script and use that to join the 3 parts into a single file called `landmark_list_concatenated.txt` (while deleting the 3 original files). The concatenated file has 23,708 lines, one for each cropped image in the dataset.

In [1]:
from preselect import concatenate_files

# Concatenating the 3 landmark files into a single one
ll_parts = ["landmark_list_part1.txt", "landmark_list_part2.txt", "landmark_list_part3.txt"]
concatenate_files("landmark_list_concatenated.txt", files=ll_parts, delete=True)

Before filtering and splitting the dataset, we need to correct some minor typos in a few image names:

In [2]:
with open('landmark_list_concatenated.txt', 'r') as f:
    lines = f.readlines()

lines[8512] = lines[8512].replace("61_1_20170109142408075.jpg", "61_1_1_20170109142408075.jpg")  # Missing gender
lines[8513] = lines[8513].replace("61_3_20170109150557335.jpg", "61_1_3_20170109150557335.jpg")  # Missing gender
lines[13951] = lines[13951].replace("53__0_20170116184028385.jpg", "53_1_0_20170116184028385.jpg")  # Missing gender
lines[20080] = lines[20080].replace("39_1_20170116174525125.jpg", "39_1_1_20170116174525125.jpg")  # Missing gender
lines[20585] = lines[20585].replace("55_0_0_20170116232725357jpg", "55_0_0_20170116232725357.jpg")  # Missing period
lines[20621] = lines[20621].replace("24_0_1_20170116220224657 .jpg", "24_0_1_20170116220224657.jpg")  # Space in name
lines[20647] = lines[20647].replace("44_1_4_20170116235150272.pg", "44_1_4_20170116235150272.jpg")  # Wrong extension

with open('landmark_list_corrected.txt', "w") as f:
    f.writelines(lines)

In order to keep the data consistent, we decided to focus on adults of known races. So, in order to remove the undesired images, we import the `preselect_landmarks` function from the `preselect.py` script and use that to filter out images where the subject's age is under 15 or above 64 and their race is classified as 'other'. The script will produce a file called `ll_initial.txt` in the current directory with the 16,862 images that fit that criteria.

In [3]:
from preselect import preselect_landmarks

# Selecting images for people of known races and with ages between 15-64.
preselect_landmarks('landmark_list_corrected.txt', age=(15, 64),
                    race=['asian', 'black', 'indian', 'white'], filename='ll_initial',
                    randomise=True, randomseed=680780122122)

Now let's run `preselect_landmarks` two more times to produce a file with 7,500 males and another with 7,500 females.

In [4]:
# Selecting 7,500 male images from the initial dataset
preselect_landmarks('ll_initial.txt', gender="male", target=7500, randomise=True, randomseed=680780122122)

# Selecting 7,500 female images from the initial dataset
preselect_landmarks('ll_initial_male_remainder.txt', gender="female", target=7500, randomise=True, randomseed=680780122122)

To make things simpler, let's rename the files to `ll_males.txt` and `ll_females.txt`, respectively. We can also delete the intermediary/additional files generated.

In [5]:
import os
# Renaming the output files for clarity
os.rename('ll_initial_male_filtered.txt', 'll_males.txt')
os.rename('ll_initial_male_remainder_female_filtered.txt', 'll_females.txt')

# Deleting intermediary files
os.remove('landmark_list_concatenated.txt')
os.remove('landmark_list_corrected.txt')
os.remove('ll_initial.txt')
os.remove('ll_initial_male_remainder.txt')
os.remove('ll_initial_male_remainder_female_remainder.txt')

Now we can extract 750 lines (10%) from each file and concatenate that into a single balanced test file called `ll_test_50-50.txt` with 1,500 images.

In [6]:
# Splitting the male data set into test (750 images) and training/validation (6,750 images)
preselect_landmarks('ll_males.txt', filename="ll_males_test", gender="male", target=750, randomise=True, randomseed=680780122122)

# Splitting the female data set into test (750 images) and training/validation (6,750 images)
preselect_landmarks('ll_females.txt', filename="ll_females_test", gender="female", target=750, randomise=True, randomseed=680780122122)

# Concatenating the male/female test images into a single dataset
concatenate_files("ll_test_50-50.txt", files=['ll_males_test_filtered.txt', 'll_females_test_filtered.txt'], delete=True)

# Deleting intermediary files
os.remove('ll_males.txt')
os.remove('ll_females.txt')

Now, we'll split the remaining into Training and Validation sets, but we won't contatenate them just yet.

In [7]:
# Splitting the male data set into training (6,000 images) and validation (750 images)
preselect_landmarks('ll_males_test_remainder.txt', filename="ll_males_validation", gender="male", target=750)

# Renaming the output files for clarity
os.rename('ll_males_validation_filtered.txt', 'll_males_validation.txt')
os.rename('ll_males_validation_remainder.txt', 'll_males_training.txt')

# Deleting intermediary files
os.remove('ll_males_test_remainder.txt')

# Splitting the female data set into training (6,000 images) and validation (750 images)
preselect_landmarks('ll_females_test_remainder.txt', filename="ll_females_validation", gender="female", target=750)

# Renaming the output files for clarity
os.rename('ll_females_validation_filtered.txt', 'll_females_validation.txt')
os.rename('ll_females_validation_remainder.txt', 'll_females_training.txt')

# Deleting intermediary files
os.remove('ll_females_test_remainder.txt')

Now, we need to split the **training** set into thirds so that we can compose our 3 separate cohorts (25-75 split, 50-50 split, 75-25 split).

In [8]:
# Spliting the male training dataset into 3 files with 2,000 images each
preselect_landmarks('ll_males_training.txt', filename="ll_males_training_1", gender="male", target=2000)
preselect_landmarks('ll_males_training_1_remainder.txt', filename="ll_males_training_2", gender="male", target=2000)

# Renaming the output files for clarity
os.rename('ll_males_training_1_filtered.txt', 'll_males_training_cohort_1.txt')
os.rename('ll_males_training_2_filtered.txt', 'll_males_training_cohort_2.txt')
os.rename('ll_males_training_2_remainder.txt', 'll_males_training_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_males_training.txt')
os.remove('ll_males_training_1_remainder.txt')

# Spliting the female training dataset into 3 files with 2,000 images each
preselect_landmarks('ll_females_training.txt', filename="ll_females_training_1", gender="female", target=2000)
preselect_landmarks('ll_females_training_1_remainder.txt', filename="ll_females_training_2", gender="female", target=2000)

# Renaming the output files for clarity
os.rename('ll_females_training_1_filtered.txt', 'll_females_training_cohort_1.txt')
os.rename('ll_females_training_2_filtered.txt', 'll_females_training_cohort_2.txt')
os.rename('ll_females_training_2_remainder.txt', 'll_females_training_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_females_training.txt')
os.remove('ll_females_training_1_remainder.txt')

# 1 female cohort + 3 male cohorts
concatenate_files('ll_training_25-75_split.txt', files=['ll_females_training_cohort_3.txt', 'll_males_training_cohort_1.txt', 'll_males_training_cohort_2.txt', 'll_males_training_cohort_3.txt'])

# 2 female cohorts + 2 male cohorts
concatenate_files('ll_training_50-50_split.txt', files=['ll_females_training_cohort_1.txt', 'll_females_training_cohort_2.txt', 'll_males_training_cohort_1.txt', 'll_males_training_cohort_2.txt'])

# 3 female cohorts + 1 male cohort
concatenate_files('ll_training_75-25_split.txt', files=['ll_females_training_cohort_1.txt', 'll_females_training_cohort_2.txt', 'll_females_training_cohort_3.txt','ll_males_training_cohort_3.txt'])

# Deleting intermediary files
os.remove('ll_females_training_cohort_1.txt')
os.remove('ll_females_training_cohort_2.txt')
os.remove('ll_females_training_cohort_3.txt')
os.remove('ll_males_training_cohort_1.txt')
os.remove('ll_males_training_cohort_2.txt')
os.remove('ll_males_training_cohort_3.txt')

Now we do the same with the **validation** set:

In [9]:
# Spliting the male validation dataset into 3 files with 2,000 images each
preselect_landmarks('ll_males_validation.txt', filename="ll_males_validation_1", gender="male", target=250)
preselect_landmarks('ll_males_validation_1_remainder.txt', filename="ll_males_validation_2", gender="male", target=250)

# Renaming the output files for clarity
os.rename('ll_males_validation_1_filtered.txt', 'll_males_validation_cohort_1.txt')
os.rename('ll_males_validation_2_filtered.txt', 'll_males_validation_cohort_2.txt')
os.rename('ll_males_validation_2_remainder.txt', 'll_males_validation_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_males_validation.txt')
os.remove('ll_males_validation_1_remainder.txt')

# Spliting the female validation dataset into 3 files with 2,000 images each
preselect_landmarks('ll_females_validation.txt', filename="ll_females_validation_1", gender="female", target=250)
preselect_landmarks('ll_females_validation_1_remainder.txt', filename="ll_females_validation_2", gender="female", target=250)

# Renaming the output files for clarity
os.rename('ll_females_validation_1_filtered.txt', 'll_females_validation_cohort_1.txt')
os.rename('ll_females_validation_2_filtered.txt', 'll_females_validation_cohort_2.txt')
os.rename('ll_females_validation_2_remainder.txt', 'll_females_validation_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_females_validation.txt')
os.remove('ll_females_validation_1_remainder.txt')

# 1 female cohort + 3 male cohorts
concatenate_files('ll_validation_25-75_split.txt', files=['ll_females_validation_cohort_3.txt', 'll_males_validation_cohort_1.txt', 'll_males_validation_cohort_2.txt', 'll_males_validation_cohort_3.txt'])

# 2 female cohorts + 2 male cohorts
concatenate_files('ll_validation_50-50_split.txt', files=['ll_females_validation_cohort_1.txt', 'll_females_validation_cohort_2.txt', 'll_males_validation_cohort_1.txt', 'll_males_validation_cohort_2.txt'])

# 3 female cohorts + 1 male cohort
concatenate_files('ll_validation_75-25_split.txt', files=['ll_females_validation_cohort_1.txt', 'll_females_validation_cohort_2.txt', 'll_females_validation_cohort_3.txt','ll_males_validation_cohort_3.txt'])

# Deleting intermediary files
os.remove('ll_females_validation_cohort_1.txt')
os.remove('ll_females_validation_cohort_2.txt')
os.remove('ll_females_validation_cohort_3.txt')
os.remove('ll_males_validation_cohort_1.txt')
os.remove('ll_males_validation_cohort_2.txt')
os.remove('ll_males_validation_cohort_3.txt')

### Preparing the Image Files

1. You'll also need to download the "Aligned & Cropped Faces" images from UTKFace's [Google Drive](https://drive.google.com/drive/folders/0BxYys69jI14kU0I1YUQyY1ZDRUE?resourcekey=0-01Pth1hq20K4kuGVkp3oBw) (you want "UTKFace.tar.gz" which sits at about 102MB).
2. Copy the _UTKFace.tar.gz_ file into the repo folder.
3. If you are on a Mac/Unix/Linux machine, you can run the `make.sh` file to unzip the above file and correct 5 known issues with image naming. **Alternatively, you can perform the following steps manually:**
   1. Unzip _UTKFace.tar.gz_ .
   2. Within the "UTKFace" directory, you you will need to find and rename the following files:
       * 24_0_1_20170116220224657 .jpg.chip.jpg -> 24_0_1_20170116220224657.jpg.chip.jpg
       * 55_0_0_20170116232725357jpg.chip.jpg -> 55_0_0_20170116232725357.jpg.chip.jpg
       * 61_1_20170109142408075.jpg.chip.jpg -> 61_1_1_20170109142408075.jpg.chip.jpg
       * 61_1_20170109150557335.jpg.chip.jpg -> 61_1_3_20170109150557335.jpg.chip.jpg
       * 39_1_20170116174525125.jpg.chip.jpg -> 39_1_1_20170116174525125.jpg.chip.jpg

### Training the Network

Now that the datatest and all the prep work has been completed the neural network is ready to be trained.

We created two different network types to compare the efficiency and effectiveness of each in learning the facial detection task. Each model will then be compared a varing proportions of male to female ratios as prepared earlier, to compare performance on a final on an unbiased 50-50 split in the testing stage.

#### Convolutional Network 

#### Residual Network

### Testing the Models

Now that all models have been trained on the different subsets of the datasets we can now compare each trained model against each other in the testing phase. The isolated 50-50 ratio of males to female unused during the training/validation phase will now be used for testing.

#### Conclusion