# The Effect of Biased Gender Data on Face Recognition
* **Course:** Neural Networks and Deep Learning (COMP9444) 22T3
* **Mentor:** Kunzie Xie
* **Team:** Dr. Teeth and the Electric Mayhem
* **Members:**
    * _Daniel Gotilla_ (z5343046@student.unsw.edu.au)
    * _John Conlon_ (z5257381@student.unsw.edu.au)
    * _John Pham_ (z3216645@student.unsw.edu.au)
    * _Marco Seidenberg_ (z5264260@student.unsw.edu.au)
    * _Oscar Feng_ (z5396050@student.unsw.edu.au)

## Problem Statement



## Data Preparation

### Downloading, Concatenating, Validating and Correcting the Original Data

We chose to leverage the [UTKFace](https://github.com/aicip/UTKFace) dataset, built by Yang Song and Zhifei Zhang at the [Advanced Imaging and Collaborative Information Processing (AICIP) Lab](https://aicip.github.io/) at [The University of Tennessee, Knoxville (UTK)](http://www.utk.edu/). This dataset contains over 28k face images with annotations for age, gender, and ethnicity and facial landmark coordinates and is available for non-commercial, research purposes.

The dataset is provided in separate files (through links to Google Drive):
1. **[In-the-wild Faces (1.3GB)](https://drive.google.com/open?id=0BxYys69jI14kSVdWWllDMWhnN2c)** — original images, containing one or more faces. _Not used in our study._
2. **[Aligned & Cropped Faces (107MB)](https://drive.google.com/drive/folders/0BxYys69jI14kU0I1YUQyY1ZDRUE?usp=sharing)** — above images, cropped tightly around the faces of each subject; filenames contain the age, gender, race and datetime metadata.
3. **[Landmarks (68 points, 12MB)](https://drive.google.com/open?id=0BxYys69jI14kS1lmbW1jbkFHaW8)** — 3 separate text files listing all the aligned/cropped faces (in 2, above) and their associated facial landmarks.

Firstly, we need to download the 3 landmark list files from the [UTKFace Google Drive](https://drive.google.com/open?id=0BxYys69jI14kS1lmbW1jbkFHaW8) and copy them to the project directory (without renaming); you will have three files named:
* `landmark_list_part1.txt` (9,780 lines)
* `landmark_list_part2.txt` (10,719 lines)
* `landmark_list_part3.txt` (3,209 lines)

We can use the `concatenate_files` function (below) to join the 3 landmark list files into a single file called `landmark_list_concatenated.txt`, deleting the 3 original files in the process. (This function will be used extensively below to consolidate the training, validation and testing datasets.)

In [1]:
from os import remove
from random import shuffle, seed

def concatenate_files(target: str, files: list[str], *, delete: bool = False, randomise: bool = False,
                      randomseed=None):
    """
    Concatenates multiple files, producing a target file and (optionally) randomising lines and/or deleting the input files.

    :param target: Name for new (output) file concatenating the input files.
    :param files: List of files to concatenate.
    :param delete: Should the input files be deleted after concatenation?
    :param randomise: Should the lines of the input files be randomised?
    :param randomseed: If so, should we use a specific seed value?
    :return: No return value but a file with the target name will be created in the current directory.
    """
    lines = []

    for name in files:
        with open(name, 'r') as f:
            for line in f:
                lines.append(line)

    if randomise:
        if seed is not None:
            seed(randomseed)
        shuffle(lines)

    with open(target, "w") as new_file:
        for line in lines:
            new_file.write(line)

    if delete:
        for name in files:
            remove(name)

    print(f"Input files successfully concatenated as", target)

# Concatenating the 3 landmark files into a single one
ll_parts = ["landmark_list_part1.txt", "landmark_list_part2.txt", "landmark_list_part3.txt"]
concatenate_files("landmark_list_concatenated.txt", files=ll_parts, delete=True)

Input files successfully concatenated as landmark_list_concatenated.txt


The three partial files were substituted by a single, consolidated file named `landmark_list_concatenated.txt` with 23,708 lines, one for each cropped image in the dataset.

If you open and examine the file, you'll see something like this:
```
1_0_2_20161219140530307.jpg -4 71 -4 96 -3 120 -1 144 9 166 28 179 53 186 77 192 100 194 121 191 142 183 161 174 180 161 192 142 195 120 194 97 192 74 16 53 29 39 48 33 68 34 86 40 113 39 129 33 148 32 164 37 175 49 100 59 101 72 101 85 101 99 78 112 89 113 100 116 110 114 120 111 39 62 51 61 61 60 71 65 60 63 50 62 124 64 134 59 144 59 155 62 144 62 134 62 55 137 72 134 87 132 97 133 107 131 120 132 136 133 121 143 109 146 98 147 88 146 72 145 61 138 87 137 97 138 107 136 130 135 108 139 98 140 88 139
```

Each line follows the pattern `A_G_E_DDDDDDDDTTTTTTTTT.jpg X1 Y1 X2 Y2 … X68 Y68` where:
* `A` stands for the subject's age in years (1 to 3 digits);
* `G` stands for the subject's gender, where
    * 0 is male and
    * 1 is female;
* `E` stands for the subject's (perceived) ethnicity, where
    * 0 is "white",
    * 1 is "black",
    * 2 is "asian",
    * 3 is "indian",
    * 4 is "other";
* `DDDDDDDDTTTTTTTTT` stands for the date/time the image was added to the dataset;
* `X1 Y1 X2 Y2 … X68 Y68` stands for the 68 (x,y) coordinate pairs for the facial landamarks where
    * Pairs 1 through 17 map the contour of the subject's face, from ear to ear around chin;
    * Pairs 18 through 22 map the contour of the subject's left eyebrow;
    * Pairs 23 through 27 map the contour of the subject's right eyebrow;
    * Pairs 28 through 31 map the contour of the line on top of the subject's nose;
    * Pairs 32 through 36 map the contour of the bottom part of the subject's nose;
    * Pairs 37 through 42 map the contour of the subject's left eye;
    * Pairs 43 through 48 map the contour of the subject's right eye;
    * Pairs 49 through 60 map the outer contour of the subject's lips;
    * Pairs 61 through 68 map the inner contour of the subject's lips;

Now let us download the "Aligned & Cropped Faces" images from UTKFace's [Google Drive](https://drive.google.com/drive/folders/0BxYys69jI14kU0I1YUQyY1ZDRUE?resourcekey=0-01Pth1hq20K4kuGVkp3oBw)  (you want "UTKFace.tar.gz" which sits at about 102MB). Copy the _UTKFace.tar.gz_ file into the project directory and unzip it; you should have a folder named `UTKFace` with 23,708 files. It is a good sign that the number of image files matches the number of lines in `landmark_list_concatenated.txt`, but let us double-check if all the images in the folder are listed in the landmark file and vice-versa. Note that the images are listed in the landmark file with the `.jpg` extension but the actual image files in the `UTKFace` directory use the `.jpg.chip.jpg` extension. The following script accounts for that when validating the dataset:


In [2]:
from os import listdir, path

directory = 'UTKFace'  # directory with images
landmarks_file = 'landmark_list_concatenated.txt'  # list of images with landmarks
probs = 0  # issues found
matches = 0  # matching files found

with open(landmarks_file, 'r') as file:
    lines = file.readlines()
images = [line.split()[0] + ".chip.jpg" for line in lines]
image_set = set()
for image in images:
    file = path.join(directory, image)
    if path.exists(file):
        image_set.add(image)
    else:
        print("Missing image:", image)
        probs += 1

# iterate over files in that directory
for filename in listdir(directory):
    f = path.join(directory, filename)
    # checking if it is a file
    if filename not in image_set:
        print("Unlisted image:", filename)
        probs += 1
    else:
        splitname: list[str] = filename.split(".")[0].split("_")
        if len(splitname) != 4 or int(splitname[0]) < 1 or int(splitname[0]) > 200 or int(splitname[1]) > 1 or int(splitname[2]) > 4 or not splitname[3].isdigit():
            print("Misnamed image:", filename)
            probs += 1
        else:
            matches += 1
print("\nIssues found:", probs)
print("Matching images:", matches)

Missing image: 61_3_20170109150557335.jpg.chip.jpg
Missing image: 53__0_20170116184028385.jpg.chip.jpg
Missing image: 24_0_1_20170116220224657.chip.jpg
Missing image: 44_1_4_20170116235150272.pg.chip.jpg
Unlisted image: 24_0_1_20170116220224657 .jpg.chip.jpg
Misnamed image: 39_1_20170116174525125.jpg.chip.jpg
Unlisted image: 61_1_20170109150557335.jpg.chip.jpg
Misnamed image: 55_0_0_20170116232725357jpg.chip.jpg
Misnamed image: 61_1_20170109142408075.jpg.chip.jpg
Unlisted image: 53_1_0_20170116184028385.jpg.chip.jpg
Unlisted image: 44_1_4_20170116235150272.jpg.chip.jpg

Issues found: 11
Matching images: 23701


We can see that most of the listed images are missing metadata fields or have minor typos such as missing a period/character or an extraneous space character. And when that is the case, the image is listed twice: as a "missing image" and as an "unlisted image". The following script will apply the corrections to the landmark file, yielding a new file named `landmark_list_corrected.txt`.

In [3]:
if path.exists('landmark_list_concatenated.txt'):

    with open('landmark_list_concatenated.txt', 'r') as f:
        lines = f.readlines()

    # Correcting specific lines of the spreadsheet so they match the actual image names
    lines[8512] = lines[8512].replace("61_1_20170109142408075.jpg", "61_1_1_20170109142408075.jpg")  # Missing gender
    lines[8513] = lines[8513].replace("61_3_20170109150557335.jpg", "61_1_3_20170109150557335.jpg")  # Missing gender
    lines[13951] = lines[13951].replace("53__0_20170116184028385.jpg", "53_1_0_20170116184028385.jpg")  # Missing gender
    lines[20080] = lines[20080].replace("39_1_20170116174525125.jpg", "39_1_1_20170116174525125.jpg")  # Missing gender
    lines[20585] = lines[20585].replace("55_0_0_20170116232725357jpg", "55_0_0_20170116232725357.jpg")  # Missing period
    lines[20621] = lines[20621].replace("24_0_1_20170116220224657 .jpg", "24_0_1_20170116220224657.jpg")  # Space in name
    lines[20647] = lines[20647].replace("44_1_4_20170116235150272.pg", "44_1_4_20170116235150272.jpg")  # Wrong extension

    with open('landmark_list_corrected.txt', "w") as f:
        f.writelines(lines)

    print("Corrections successfully applied to 'landmark_list_corrected.txt'")

else:

    print("ERROR: Could not find 'landmark_list_concatenated.txt'")


Corrections successfully applied to 'landmark_list_corrected.txt'


However, we still need to correct 5 issues in image filenames; if using an Unix-compatible system (such as Linux, Mac or Windows with WSL), you can run the following shell script:

```
#!/usr/bin/env sh

UTKZIP='UTKFace.tar.gz';
UTKDIR='UTKFace';

	if [ ! -d "$UTKDIR" ]; then
		if [ -f "$UTKZIP" ]; then
		  echo "Unzipping $UTKZIP to $UTKDIR..."
			tar -xvf UTKFace.tar.gz 2>/dev/null
			echo "Done. Feel free to delete $UTKZIP."
		else
		  echo "$UTKDIR directory not found. Please download $UTKZIP from https://gotil.la/3ziDcCX"
		fi
	fi
	if [ -d "$UTKDIR" ]; then
	  echo "$UTKDIR directory found."
	  cnt=0
		if [ -f "$UTKDIR/24_0_1_20170116220224657 .jpg.chip.jpg" ]; then
		    cnt=$((cnt+1))
		    echo "Renaming 24_0_1_20170116220224657 .jpg.chip.jpg\tto 24_0_1_20170116220224657.jpg.chip.jpg"
  			mv "$UTKDIR/24_0_1_20170116220224657 .jpg.chip.jpg" "$UTKDIR/24_0_1_20170116220224657.jpg.chip.jpg"
  	fi
		if [ -f "$UTKDIR/39_1_20170116174525125.jpg.chip.jpg" ]; then
		    cnt=$((cnt+1))
		    echo "Renaming 39_1_20170116174525125.jpg.chip.jpg\tto 39_1_1_20170116174525125.jpg.chip.jpg"
  			mv "$UTKDIR/39_1_20170116174525125.jpg.chip.jpg" "$UTKDIR/39_1_1_20170116174525125.jpg.chip.jpg"
  	fi
		if [ -f "$UTKDIR/61_1_20170109150557335.jpg.chip.jpg" ]; then
		    cnt=$((cnt+1))
		    echo "Renaming 61_1_20170109150557335.jpg.chip.jpg\tto 61_1_3_20170109150557335.jpg.chip.jpg"
  			mv "$UTKDIR/61_1_20170109150557335.jpg.chip.jpg" "$UTKDIR/61_1_3_20170109150557335.jpg.chip.jpg"
  	fi
		if [ -f "$UTKDIR/55_0_0_20170116232725357jpg.chip.jpg" ]; then
		    cnt=$((cnt+1))
		    echo "Renaming 55_0_0_20170116232725357jpg.chip.jpg\tto 55_0_0_20170116232725357.jpg.chip.jpg"
  			mv "$UTKDIR/55_0_0_20170116232725357jpg.chip.jpg" "$UTKDIR/55_0_0_20170116232725357.jpg.chip.jpg"
  	fi
		if [ -f "$UTKDIR/61_1_20170109142408075.jpg.chip.jpg" ]; then
		    cnt=$((cnt+1))
		    echo "Renaming 61_1_20170109142408075.jpg.chip.jpg\tto 61_1_1_20170109142408075.jpg.chip.jpg"
  			mv "$UTKDIR/61_1_20170109142408075.jpg.chip.jpg" "$UTKDIR/61_1_1_20170109142408075.jpg.chip.jpg"
  	fi
  	if [ "$cnt" -eq 0 ]; then
      echo "Nothing to do; all files correctly named."
  	fi
  else
    	  echo "$UTKDIR directory not found."
	fi
```

Alternatively, you can apply the following corrections by hand:

* `24_0_1_20170116220224657 .jpg.chip.jpg` -> `24_0_1_20170116220224657.jpg.chip.jpg` _(Space in filename)_
* `55_0_0_20170116232725357jpg.chip.jpg` -> `55_0_0_20170116232725357.jpg.chip.jpg` _(Missing period)_
* `61_1_20170109142408075.jpg.chip.jpg` -> `61_1_1_20170109142408075.jpg.chip.jpg` _(Missing gender)_
* `61_1_20170109150557335.jpg.chip.jpg` -> `61_1_3_20170109150557335.jpg.chip.jpg` _(Missing race)_
* `39_1_20170116174525125.jpg.chip.jpg` -> `39_1_1_20170116174525125.jpg.chip.jpg` _(Missing gender)_

Once that is done, feel free to re-run the dataset validation script a few fields above to check that there ar eno more mismatches.

### Analysing and Filtering the Data

Now let's look at the data we have; the following table shows the breakdown of the dataset by gender, age and race (we will not use the datetime metadata field).

#### Original UTKFace Dataset

| **Gender/Age** | **Asian** | **Black** | **Indian** | **Other** |  **White** |  **Total** |
|:---------------|----------:|----------:|-----------:|----------:|-----------:|-----------:|
| **Female**     | **1,859** | **2,210** |  **1,715** |   **932** |  **4,601** | **11,317** |
| 1 - 14         |       461 |       110 |        297 |       280 |        783 |      1,931 |
| 15 - 24        |       414 |       346 |        383 |       277 |        667 |      2,087 |
| 25 - 44        |       872 |     1,473 |        847 |       346 |      1,655 |      5,193 |
| 45 - 64        |        39 |       205 |        146 |        22 |        798 |      1,210 |
| 65 and over    |        73 |        76 |         42 |         7 |        698 |        896 |
| **Male**       | **1,575** | **2,318** |  **2,261** |   **760** |  **5,477** | **12,391** |
| 1 - 14         |       511 |        87 |        246 |       163 |        713 |      1,720 |
| 15 - 24        |       137 |       246 |        175 |       121 |        486 |      1,165 |
| 25 - 44        |       601 |     1,426 |      1,102 |       374 |      2,056 |      5,559 |
| 45 - 64        |       179 |       400 |        649 |        97 |      1,560 |      2,885 |
| 65 and over    |       147 |       159 |         89 |         5 |        662 |      1,062 |
| **Total**      | **3,434** | **4,528** |  **3,976** | **1,692** | **10,078** | **23,708** |

After review of some sample images and discussion with various teammembers, we decided to discard images for children (age < 15) and older adults (age ≥ 65) as well as those where the race was classified as "Other" for consistency. This reduced the dataset to the following:

#### Filtered UTKFace Dataset

| **Gender/Age** | **Asian** | **Black** | **Indian** | **White** |  **Total** |
|:---------------|----------:|----------:|-----------:|----------:|-----------:|
| **Female**     | **1,325** | **2,024** |  **1,376** | **3,120** |  **7,845** |
| 15 - 24        |       414 |       346 |        383 |       667 |      2,087 |
| 25 - 44        |       872 |     1,473 |        847 |     1,655 |      5,193 |
| 45 - 64        |        39 |       205 |        146 |       798 |      1,210 |
| **Male**       |   **917** | **2,072** |  **1,926** | **4,102** |  **9,017** |
| 15 - 24        |       137 |       246 |        175 |       486 |      1,165 |
| 25 - 44        |       601 |     1,426 |      1,102 |     2,056 |      5,559 |
| 45 - 64        |       179 |       400 |        649 |     1,560 |      2,885 |
| **Total**      | **2,242** | **4,096** |  **3,302** | **7,222** | **16,862** |

We can use the `preselect_landmarks` function (below) to filter out images where the subject's age is under 15 or above 64 and their race is classified as 'other'. The script will produce a file called `ll_initial.txt` in the current directory listing the 16,862 images that fit that criteria. (Note that we specify the seed value through the `randomseed` parameter to enable reproducibility.)

In [4]:
from sys import exit
from os.path import exists

def preselect_landmarks(landmarks_file: str, age=None, gender=None, race=None,
                        *, log: bool = False, randomise: bool = False,
                        randomseed=None, target=None, filename=None) -> None:
    """ Preselect Landmarks

    Iterates through an original file listing images and associated landmarks
    applying the given filters and generates another file with the subset of
    images that passed *all* filters.

    :param landmarks_file: name of original landmark file (str, Required)
    :param log: Whether to log why each image was discarded and to print a
        summary message with the number of images filtered (Default: False)
    :param target: (maximum) number of images to select based on the provided
        filters. If defined, the function may output two files with suffixes:
        • "_filtered": list of images which meet all filter criteria;
        • "_remainder": list of images which do not meet filter criteria or
            exceed target number;
    :param randomise: shuffle landmarks before preselection? (Default: False)
    :param randomseed: seed value (int) when randomising (Default: None)
    :param age: tuple containing min (int) and max (int) values (Default: None)
    :param gender: 'male' or 'female' (str, Default: None)
    :param race: either a str with a single race or a list for multiple races,
        values 'white', 'black', 'asian', 'indian' and 'other' (Default: None)
    :param filename: filename for target files
    :return: nothing, may print Errors, Warnings and log messages to stdout
    """
    with open(landmarks_file, 'r') as file:
        landmarks = file.readlines()

    # Initialise maps
    genders: dict[str: str] = {
        '0': "male",
        '1': "female"
    }
    races: dict[str: str] = {
        '0': "white",
        '1': "black",
        '2': "asian",
        '3': "indian",
        '4': "other"
    }

    # Read parameters for valid filters and capture those for new filename
    filters = []
    if isinstance(age, tuple) and age[0] <= age[1]:
        filters.append(str(age[0]) + "-" + str(age[1]))
    if isinstance(gender, str) and gender in genders.values():
        filters.append(gender)
    if isinstance(race, str):
        race = [race]
    if isinstance(race, list) and all(r in races.values() for r in race):
        filters.append("-".join(race))

    # Abort if no valid filters were found or invalid target
    if len(filters) == 0:
        print("Error: No valid filters to apply.")
        exit(1)
    if target is not None and (not isinstance(target, int) or target < 1):
        print("Error: Target needs to be greater than zero.")
        exit(1)

    # Abort if files already exist with target name (avoid overwriting).
    if filename:
        filtered_landmarks_file = filename
    else:
        filtered_landmarks_file = landmarks_file.split(".")[0]
        filtered_landmarks_file += "_" + "_".join(filters)
    filtered_landmarks_file += "_filtered" if target else ""
    filtered_landmarks_file += ".txt"
    if exists(filtered_landmarks_file):
        print(f"Error: File '{filtered_landmarks_file}' already exists in "
              f"current directory; delete or rename and run script again.")
        exit(1)
    if filename:
        remainder_landmarks_file = filename
    else:
        remainder_landmarks_file = landmarks_file.split(".")[0]
        remainder_landmarks_file += "_" + "_".join(filters)
    remainder_landmarks_file += "_remainder.txt"
    if target and exists(remainder_landmarks_file):
        print(
            f"Error: File '{remainder_landmarks_file}' already exists in "
            f"current directory; delete or rename and run script again.")
        exit(1)

    if randomise:
        if seed is not None:
            seed(randomseed)
        shuffle(landmarks)

    filtered_landmarks: list[str] = []
    remainder_landmarks: list[str] = []
    for line in landmarks:
        # Iterate over all lines in landmark file and apply filters
        keep = True

        # Retrieve metadata from filename
        imagename = line.split()[0]
        splitname: list[str] = imagename.split(".")[0].split("_")
        # 0 is presumed if no age is provided in metadata, so this may fail
        # minimum age filters greater than 0.
        line_age = int(splitname[0]) if len(splitname) > 0 else 0
        # An image with no gender metadata will fail all gender filters
        if len(splitname) > 1 and splitname[1] in genders:
            line_gender = genders[splitname[1]]
        else:
            line_gender = ""
        # An image without race metadata will fail any race filters
        if len(splitname) > 2 and splitname[2] in races:
            line_race = races[splitname[2]]
        else:
            line_race = ""

        # Check if a given line passes *all* filters
        if isinstance(age, tuple) and (line_age < age[0] or line_age > age[1]):
            if log:
                print(f"Image {imagename} skipped due to age ({line_age}).", )
            keep = False
        if isinstance(gender, str) and line_gender != gender:
            if log:
                print(f"Image {imagename} skipped due to gender ({line_gender}).", )
            keep = False
        if isinstance(race, list) and line_race not in race:
            if log:
                print(f"Image {imagename} skipped due to race ({line_race}).", )
            keep = False

        if keep and (target is None or target > len(filtered_landmarks)):
            filtered_landmarks.append(line)
            if log:
                print(f"Image {imagename} added to filtered list.")
        else:
            remainder_landmarks.append(line)
            if log and target:
                print(f"Image {imagename} added to remainder list.")

    if len(filtered_landmarks) == 0:
        print("Warning: No images passed all filters.")
        exit(1)

    with open(filtered_landmarks_file, 'w') as file:
        file.writelines(filtered_landmarks)
    if log:
        print(len(filtered_landmarks),
              "filtered images saved to file",
              filtered_landmarks_file)

    if target and len(remainder_landmarks) != 0:
        with open(remainder_landmarks_file, 'w') as file:
            file.writelines(remainder_landmarks)
        if log:
            print(len(remainder_landmarks),
                  "remainder images saved to file",
                  remainder_landmarks_file)


# Selecting images for people of known races and with ages between 15-64.
preselect_landmarks('landmark_list_corrected.txt', age=(15, 64),
                    race=['asian', 'black', 'indian', 'white'], filename='ll_initial',
                    randomise=True, randomseed=680780122122)

### Splitting the data into Training, Validation and Test datasets

The plan is to train our models on 3 datasets with differing ratios of female/male images and then test their performance against malefemale-only, male-only and evenly-split datasets. This is the plan:

![](Data_Preparation_Plan.png)

First, let's run `preselect_landmarks` two more times to produce a file with 7,500 randomly-ordered males and another with 7,500 randomly-ordered females.

In [5]:
# Selecting 7,500 male images from the initial dataset
preselect_landmarks('ll_initial.txt', gender="male", target=7500, randomise=True, randomseed=680780122122)

# Selecting 7,500 female images from the initial dataset
preselect_landmarks('ll_initial_male_remainder.txt', gender="female", target=7500, randomise=True, randomseed=680780122122)

To make things easier to follow, let's import the `os` package and use it to rename the files to `ll_males.txt` and `ll_females.txt`, respectively. We can also delete the intermediary/additional files generated so far (and will do so at each step from now on).

In [6]:
import os
# Renaming the output files for clarity
os.rename('ll_initial_male_filtered.txt', 'll_males.txt')
os.rename('ll_initial_male_remainder_female_filtered.txt', 'll_females.txt')

# Deleting intermediary files
os.remove('landmark_list_concatenated.txt')
os.remove('landmark_list_corrected.txt')
os.remove('ll_initial.txt')
os.remove('ll_initial_male_remainder.txt')
os.remove('ll_initial_male_remainder_female_remainder.txt')

Now we can extract 750 lines (10%) from `ll_females.txt` and `ll_males.txt` and concatenate that into a single balanced test file called `ll_test_50-50_split.txt` with 1,500 images. We will keep all three files for our testing phase.

In [7]:
# Splitting the male data set into test (750 images) and training/validation (6,750 images)
preselect_landmarks('ll_males.txt', filename="ll_males_test", gender="male", target=750, randomise=True, randomseed=680780122122)

# Splitting the female data set into test (750 images) and training/validation (6,750 images)
preselect_landmarks('ll_females.txt', filename="ll_females_test", gender="female", target=750, randomise=True, randomseed=680780122122)

# Renaming the output files for clarity
os.rename('ll_males_test_filtered.txt', 'll_test_males.txt')
os.rename('ll_females_test_filtered.txt', 'll_test_females.txt')

# Concatenating the male/female test images into a single dataset
concatenate_files("ll_test_50-50_split.txt", files=['ll_test_males.txt', 'll_test_females.txt'], delete=False)

# Deleting intermediary files
os.remove('ll_males.txt')
os.remove('ll_females.txt')

Input files successfully concatenated as ll_test_50-50_split.txt


We split the remaining male/female data into Training (80%: 6,000 images) and Validation (10%: 750 images) sets, but we won't contatenate them just yet.

In [8]:
# Splitting the male data set into training (6,000 images) and validation (750 images)
preselect_landmarks('ll_males_test_remainder.txt', filename="ll_males_validation", gender="male", target=750)

# Renaming the output files for clarity
os.rename('ll_males_validation_filtered.txt', 'll_males_validation.txt')
os.rename('ll_males_validation_remainder.txt', 'll_males_training.txt')

# Deleting intermediary files
os.remove('ll_males_test_remainder.txt')

# Splitting the female data set into training (6,000 images) and validation (750 images)
preselect_landmarks('ll_females_test_remainder.txt', filename="ll_females_validation", gender="female", target=750)

# Renaming the output files for clarity
os.rename('ll_females_validation_filtered.txt', 'll_females_validation.txt')
os.rename('ll_females_validation_remainder.txt', 'll_females_training.txt')

# Deleting intermediary files
os.remove('ll_females_test_remainder.txt')

Now, we need to split both the male/female **training** sets into thirds so that we can compose our 3 separate cohorts (25-75 split, 50-50 split, 75-25 split).

In [9]:
# Spliting the male training dataset into 3 files with 2,000 images each
preselect_landmarks('ll_males_training.txt', filename="ll_males_training_1", gender="male", target=2000)
preselect_landmarks('ll_males_training_1_remainder.txt', filename="ll_males_training_2", gender="male", target=2000)

# Renaming the output files for clarity
os.rename('ll_males_training_1_filtered.txt', 'll_males_training_cohort_1.txt')
os.rename('ll_males_training_2_filtered.txt', 'll_males_training_cohort_2.txt')
os.rename('ll_males_training_2_remainder.txt', 'll_males_training_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_males_training.txt')
os.remove('ll_males_training_1_remainder.txt')

# Spliting the female training dataset into 3 files with 2,000 images each
preselect_landmarks('ll_females_training.txt', filename="ll_females_training_1", gender="female", target=2000)
preselect_landmarks('ll_females_training_1_remainder.txt', filename="ll_females_training_2", gender="female", target=2000)

# Renaming the output files for clarity
os.rename('ll_females_training_1_filtered.txt', 'll_females_training_cohort_1.txt')
os.rename('ll_females_training_2_filtered.txt', 'll_females_training_cohort_2.txt')
os.rename('ll_females_training_2_remainder.txt', 'll_females_training_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_females_training.txt')
os.remove('ll_females_training_1_remainder.txt')

# 25:75 Training file: 1 female cohort + 3 male cohorts
concatenate_files('ll_training_25-75_split.txt', files=['ll_females_training_cohort_3.txt', 'll_males_training_cohort_1.txt', 'll_males_training_cohort_2.txt', 'll_males_training_cohort_3.txt'])

# 50:50 Training file: 2 female cohorts + 2 male cohorts
concatenate_files('ll_training_50-50_split.txt', files=['ll_females_training_cohort_1.txt', 'll_females_training_cohort_2.txt', 'll_males_training_cohort_1.txt', 'll_males_training_cohort_2.txt'])

# 75:25 Training file: 3 female cohorts + 1 male cohort
concatenate_files('ll_training_75-25_split.txt', files=['ll_females_training_cohort_1.txt', 'll_females_training_cohort_2.txt', 'll_females_training_cohort_3.txt','ll_males_training_cohort_3.txt'])

# Deleting intermediary files
os.remove('ll_females_training_cohort_1.txt')
os.remove('ll_females_training_cohort_2.txt')
os.remove('ll_females_training_cohort_3.txt')
os.remove('ll_males_training_cohort_1.txt')
os.remove('ll_males_training_cohort_2.txt')
os.remove('ll_males_training_cohort_3.txt')

Input files successfully concatenated as ll_training_25-75_split.txt
Input files successfully concatenated as ll_training_50-50_split.txt
Input files successfully concatenated as ll_training_75-25_split.txt


We do the same with the male/female **validation** sets:

In [10]:
# Spliting the male validation dataset into 3 files with 250 images each
preselect_landmarks('ll_males_validation.txt', filename="ll_males_validation_1", gender="male", target=250)
preselect_landmarks('ll_males_validation_1_remainder.txt', filename="ll_males_validation_2", gender="male", target=250)

# Renaming the output files for clarity
os.rename('ll_males_validation_1_filtered.txt', 'll_males_validation_cohort_1.txt')
os.rename('ll_males_validation_2_filtered.txt', 'll_males_validation_cohort_2.txt')
os.rename('ll_males_validation_2_remainder.txt', 'll_males_validation_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_males_validation.txt')
os.remove('ll_males_validation_1_remainder.txt')

# Spliting the female validation dataset into 3 files with 250 images each
preselect_landmarks('ll_females_validation.txt', filename="ll_females_validation_1", gender="female", target=250)
preselect_landmarks('ll_females_validation_1_remainder.txt', filename="ll_females_validation_2", gender="female", target=250)

# Renaming the output files for clarity
os.rename('ll_females_validation_1_filtered.txt', 'll_females_validation_cohort_1.txt')
os.rename('ll_females_validation_2_filtered.txt', 'll_females_validation_cohort_2.txt')
os.rename('ll_females_validation_2_remainder.txt', 'll_females_validation_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_females_validation.txt')
os.remove('ll_females_validation_1_remainder.txt')

# 25:75 Validation file: 1 female cohort + 3 male cohorts
concatenate_files('ll_validation_25-75_split.txt', files=['ll_females_validation_cohort_3.txt', 'll_males_validation_cohort_1.txt', 'll_males_validation_cohort_2.txt', 'll_males_validation_cohort_3.txt'])

# 50:50 Validation file: 2 female cohorts + 2 male cohorts
concatenate_files('ll_validation_50-50_split.txt', files=['ll_females_validation_cohort_1.txt', 'll_females_validation_cohort_2.txt', 'll_males_validation_cohort_1.txt', 'll_males_validation_cohort_2.txt'])

# 75:25 Validation file: 3 female cohorts + 1 male cohort
concatenate_files('ll_validation_75-25_split.txt', files=['ll_females_validation_cohort_1.txt', 'll_females_validation_cohort_2.txt', 'll_females_validation_cohort_3.txt','ll_males_validation_cohort_3.txt'])

# Deleting intermediary files
os.remove('ll_females_validation_cohort_1.txt')
os.remove('ll_females_validation_cohort_2.txt')
os.remove('ll_females_validation_cohort_3.txt')
os.remove('ll_males_validation_cohort_1.txt')
os.remove('ll_males_validation_cohort_2.txt')
os.remove('ll_males_validation_cohort_3.txt')

Input files successfully concatenated as ll_validation_25-75_split.txt
Input files successfully concatenated as ll_validation_50-50_split.txt
Input files successfully concatenated as ll_validation_75-25_split.txt


We now have the following files ready for training, validation and testing:

#### Training
* 25% Female + 75% Male: `ll_training_25-75_split.txt` (8,000 lines)
* 50% Female + 50% Male: `ll_training_50-50_split.txt` (8,000 lines)
* 75% Female + 25% Male: `ll_training_75-25_split.txt` (8,000 lines)

#### Validation
* 25% Female + 75% Male: `ll_validation_25-75_split.txt` (1,000 lines)
* 50% Female + 50% Male: `ll_validation_50-50_split.txt` (1,000 lines)
* 75% Female + 25% Male: `ll_validation_75-25_split.txt` (1,000 lines)

#### Testing
* 100% Female: `ll_test_females.txt` (750 lines)
* 100% Male: `ll_test_males.txt` (750 lines)
* 50% Female + 50% Male: `ll_test_50-50_split.txt` (1,500 lines)

## Models

