# The Effect of Biased Gender Data on Face Recognition
* **Course:** Neural Networks and Deep Learning (COMP9444) 22T3
* **Mentor:** Kunzie Xie
* **Team:** Dr. Teeth and the Electric Mayhem
* **Members:**
    * _Daniel Gotilla_ (z5343046@student.unsw.edu.au)
    * _John Conlon_ (z5257381@student.unsw.edu.au)
    * _John Pham_ (z3216645@student.unsw.edu.au)
    * _Marco Seidenberg_ (z5264260@student.unsw.edu.au)
    * _Oscar Feng_ (z5396050@student.unsw.edu.au)

## Problem Statement



## Data Preparation

### Downloading, Concatenating, Validating and Correcting the Original Data

We chose to leverage the [UTKFace](https://github.com/aicip/UTKFace) dataset, built by Yang Song and Zhifei Zhang at the [Advanced Imaging and Collaborative Information Processing (AICIP) Lab](https://aicip.github.io/) at [The University of Tennessee, Knoxville (UTK)](http://www.utk.edu/). This dataset contains over 28k face images with annotations for age, gender, and ethnicity and facial landmark coordinates and is available for non-commercial, research purposes.

The dataset is provided in separate files (through links to Google Drive):
1. **[In-the-wild Faces (1.3GB)](https://drive.google.com/open?id=0BxYys69jI14kSVdWWllDMWhnN2c)** — original images, containing one or more faces. _Not used in our study._
2. **[Aligned & Cropped Faces (107MB)](https://drive.google.com/drive/folders/0BxYys69jI14kU0I1YUQyY1ZDRUE?usp=sharing)** — above images, cropped tightly around the faces of each subject; filenames contain the age, gender, race and datetime metadata.
3. **[Landmarks (68 points, 12MB)](https://drive.google.com/open?id=0BxYys69jI14kS1lmbW1jbkFHaW8)** — 3 separate text files listing all the aligned/cropped faces (in 2, above) and their associated facial landmarks.

Firstly, we need to download the 3 landmark list files from the [UTKFace Google Drive](https://drive.google.com/open?id=0BxYys69jI14kS1lmbW1jbkFHaW8) and copy them to the project directory (without renaming); you will have three files named:
* `landmark_list_part1.txt` (9,780 lines)
* `landmark_list_part2.txt` (10,719 lines)
* `landmark_list_part3.txt` (3,209 lines)

We can use the `concatenate_files` function (below) to join the 3 landmark list files into a single file called `landmark_list_concatenated.txt`, deleting the 3 original files in the process. (This function will be used extensively below to consolidate the training, validation and testing datasets.)

In [5]:
from os import remove
from random import shuffle, seed

def concatenate_files(target: str, files: list, *, delete: bool = False, randomise: bool = False,
                      randomseed=None):
    """
    Concatenates multiple files, producing a target file and (optionally) randomising lines and/or deleting the input files.

    :param target: Name for new (output) file concatenating the input files.
    :param files: List of files to concatenate.
    :param delete: Should the input files be deleted after concatenation?
    :param randomise: Should the lines of the input files be randomised?
    :param randomseed: If so, should we use a specific seed value?
    :return: No return value but a file with the target name will be created in the current directory.
    """
    lines = []

    for name in files:
        with open(name, 'r') as f:
            for line in f:
                lines.append(line)

    if randomise:
        if seed is not None:
            seed(randomseed)
        shuffle(lines)

    with open(target, "w") as new_file:
        for line in lines:
            new_file.write(line)

    if delete:
        for name in files:
            remove(name)

    print(f"Input files successfully concatenated as", target)

# Concatenating the 3 landmark files into a single one
ll_parts = ["landmark_list_part1.txt", "landmark_list_part2.txt", "landmark_list_part3.txt"]
concatenate_files("landmark_list_concatenated.txt", files=ll_parts, delete=True)

FileNotFoundError: [Errno 2] No such file or directory: './landmark_list_part1.txt'

The three partial files were substituted by a single, consolidated file named `landmark_list_concatenated.txt` with 23,708 lines, one for each cropped image in the dataset.

If you open and examine the file, you'll see something like this:
```
1_0_2_20161219140530307.jpg -4 71 -4 96 -3 120 -1 144 9 166 28 179 53 186 77 192 100 194 121 191 142 183 161 174 180 161 192 142 195 120 194 97 192 74 16 53 29 39 48 33 68 34 86 40 113 39 129 33 148 32 164 37 175 49 100 59 101 72 101 85 101 99 78 112 89 113 100 116 110 114 120 111 39 62 51 61 61 60 71 65 60 63 50 62 124 64 134 59 144 59 155 62 144 62 134 62 55 137 72 134 87 132 97 133 107 131 120 132 136 133 121 143 109 146 98 147 88 146 72 145 61 138 87 137 97 138 107 136 130 135 108 139 98 140 88 139
```

Each line follows the pattern `A_G_E_DDDDDDDDTTTTTTTTT.jpg X1 Y1 X2 Y2 … X68 Y68` where:
* `A` stands for the subject's age in years (1 to 3 digits);
* `G` stands for the subject's gender, where
    * 0 is male and
    * 1 is female;
* `E` stands for the subject's (perceived) ethnicity, where
    * 0 is "white",
    * 1 is "black",
    * 2 is "asian",
    * 3 is "indian",
    * 4 is "other";
* `DDDDDDDDTTTTTTTTT` stands for the date/time the image was added to the dataset;
* `X1 Y1 X2 Y2 … X68 Y68` stands for the 68 (x,y) coordinate pairs for the facial landamarks where
    * Pairs 1 through 17 map the contour of the subject's face, from ear to ear around chin;
    * Pairs 18 through 22 map the contour of the subject's left eyebrow;
    * Pairs 23 through 27 map the contour of the subject's right eyebrow;
    * Pairs 28 through 31 map the contour of the line on top of the subject's nose;
    * Pairs 32 through 36 map the contour of the bottom part of the subject's nose;
    * Pairs 37 through 42 map the contour of the subject's left eye;
    * Pairs 43 through 48 map the contour of the subject's right eye;
    * Pairs 49 through 60 map the outer contour of the subject's lips;
    * Pairs 61 through 68 map the inner contour of the subject's lips;

Now let us download the "Aligned & Cropped Faces" images from UTKFace's [Google Drive](https://drive.google.com/drive/folders/0BxYys69jI14kU0I1YUQyY1ZDRUE?resourcekey=0-01Pth1hq20K4kuGVkp3oBw)  (you want "UTKFace.tar.gz" which sits at about 102MB). Copy the _UTKFace.tar.gz_ file into the project directory and unzip it; you should have a folder named `UTKFace` with 23,708 files. It is a good sign that the number of image files matches the number of lines in `landmark_list_concatenated.txt`, but let us double-check if all the images in the folder are listed in the landmark file and vice-versa. Note that the images are listed in the landmark file with the `.jpg` extension, but the actual image files in the `UTKFace` directory use the `.jpg.chip.jpg` extension. The following script accounts for that when validating the dataset:


In [10]:
from os import listdir, path

directory = 'UTKFace'  # directory with images
landmarks_file = 'landmark_list_concatenated.txt'  # list of images with landmarks
probs = 0  # issues found
matches = 0  # matching files found

with open(landmarks_file, 'r') as file:
    lines = file.readlines()
images = [line.split()[0] + ".chip.jpg" for line in lines]
image_set = set()
for image in images:
    file = path.join(directory, image)
    if path.exists(file):
        image_set.add(image)
    else:
        print("Missing image:", image)
        probs += 1

# iterate over files in that directory
for filename in listdir(directory):
    f = path.join(directory, filename)
    # checking if it is a file
    if filename not in image_set:
        print("Unlisted image:", filename)
        probs += 1
    else:
        splitname: list[str] = filename.split(".")[0].split("_")
        if len(splitname) != 4 or int(splitname[0]) < 1 or int(splitname[0]) > 200 or int(splitname[1]) > 1 or int(splitname[2]) > 4 or not splitname[3].isdigit():
            print("Misnamed image:", filename)
            probs += 1
        else:
            matches += 1
print("\nIssues found:", probs)
print("Matching images:", matches)

FileNotFoundError: [Errno 2] No such file or directory: 'landmark_list_concatenated.txt'

We can see that most of the listed images are missing metadata fields or have minor typos such as missing a period/character or an extraneous space character. And when that is the case, the image is listed twice: once as a "missing image" and once as an "unlisted image". The following script will apply the corrections to the landmark file, yielding a new file named `landmark_list_corrected.txt`.

In [11]:
if path.exists('landmark_list_concatenated.txt'):

    with open('landmark_list_concatenated.txt', 'r') as f:
        lines = f.readlines()

    # Correcting specific lines of the spreadsheet so they match the actual image names
    lines[8512] = lines[8512].replace("61_1_20170109142408075.jpg", "61_1_1_20170109142408075.jpg")  # Missing gender
    lines[8513] = lines[8513].replace("61_3_20170109150557335.jpg", "61_1_3_20170109150557335.jpg")  # Missing gender
    lines[13951] = lines[13951].replace("53__0_20170116184028385.jpg", "53_1_0_20170116184028385.jpg")  # Missing gender
    lines[20080] = lines[20080].replace("39_1_20170116174525125.jpg", "39_1_1_20170116174525125.jpg")  # Missing gender
    lines[20585] = lines[20585].replace("55_0_0_20170116232725357jpg", "55_0_0_20170116232725357.jpg")  # Missing period
    lines[20621] = lines[20621].replace("24_0_1_20170116220224657 .jpg", "24_0_1_20170116220224657.jpg")  # Space in name
    lines[20647] = lines[20647].replace("44_1_4_20170116235150272.pg", "44_1_4_20170116235150272.jpg")  # Wrong extension

    with open('landmark_list_corrected.txt', "w") as f:
        f.writelines(lines)

    print("Corrections successfully applied to 'landmark_list_corrected.txt'")

else:

    print("ERROR: Could not find 'landmark_list_concatenated.txt'")


ERROR: Could not find 'landmark_list_concatenated.txt'


However, we still need to correct 5 issues in image filenames; if using an Unix-compatible system (such as Linux, Mac or Windows with WSL), you can run the following shell script:

```
#!/usr/bin/env sh

UTKZIP='UTKFace.tar.gz';
UTKDIR='UTKFace';

	if [ ! -d "$UTKDIR" ]; then
		if [ -f "$UTKZIP" ]; then
		  echo "Unzipping $UTKZIP to $UTKDIR..."
			tar -xvf UTKFace.tar.gz 2>/dev/null
			echo "Done. Feel free to delete $UTKZIP."
		else
		  echo "$UTKDIR directory not found. Please download $UTKZIP from https://gotil.la/3ziDcCX"
		fi
	fi
	if [ -d "$UTKDIR" ]; then
	  echo "$UTKDIR directory found."
	  cnt=0
		if [ -f "$UTKDIR/24_0_1_20170116220224657 .jpg.chip.jpg" ]; then
		    cnt=$((cnt+1))
		    echo "Renaming 24_0_1_20170116220224657 .jpg.chip.jpg\tto 24_0_1_20170116220224657.jpg.chip.jpg"
  			mv "$UTKDIR/24_0_1_20170116220224657 .jpg.chip.jpg" "$UTKDIR/24_0_1_20170116220224657.jpg.chip.jpg"
  	fi
		if [ -f "$UTKDIR/39_1_20170116174525125.jpg.chip.jpg" ]; then
		    cnt=$((cnt+1))
		    echo "Renaming 39_1_20170116174525125.jpg.chip.jpg\tto 39_1_1_20170116174525125.jpg.chip.jpg"
  			mv "$UTKDIR/39_1_20170116174525125.jpg.chip.jpg" "$UTKDIR/39_1_1_20170116174525125.jpg.chip.jpg"
  	fi
		if [ -f "$UTKDIR/61_1_20170109150557335.jpg.chip.jpg" ]; then
		    cnt=$((cnt+1))
		    echo "Renaming 61_1_20170109150557335.jpg.chip.jpg\tto 61_1_3_20170109150557335.jpg.chip.jpg"
  			mv "$UTKDIR/61_1_20170109150557335.jpg.chip.jpg" "$UTKDIR/61_1_3_20170109150557335.jpg.chip.jpg"
  	fi
		if [ -f "$UTKDIR/55_0_0_20170116232725357jpg.chip.jpg" ]; then
		    cnt=$((cnt+1))
		    echo "Renaming 55_0_0_20170116232725357jpg.chip.jpg\tto 55_0_0_20170116232725357.jpg.chip.jpg"
  			mv "$UTKDIR/55_0_0_20170116232725357jpg.chip.jpg" "$UTKDIR/55_0_0_20170116232725357.jpg.chip.jpg"
  	fi
		if [ -f "$UTKDIR/61_1_20170109142408075.jpg.chip.jpg" ]; then
		    cnt=$((cnt+1))
		    echo "Renaming 61_1_20170109142408075.jpg.chip.jpg\tto 61_1_1_20170109142408075.jpg.chip.jpg"
  			mv "$UTKDIR/61_1_20170109142408075.jpg.chip.jpg" "$UTKDIR/61_1_1_20170109142408075.jpg.chip.jpg"
  	fi
  	if [ "$cnt" -eq 0 ]; then
      echo "Nothing to do; all files correctly named."
  	fi
  else
    	  echo "$UTKDIR directory not found."
	fi
```

Alternatively, you can apply the following corrections by hand:

* `24_0_1_20170116220224657 .jpg.chip.jpg` -> `24_0_1_20170116220224657.jpg.chip.jpg` _(Space in filename)_
* `55_0_0_20170116232725357jpg.chip.jpg` -> `55_0_0_20170116232725357.jpg.chip.jpg` _(Missing period)_
* `61_1_20170109142408075.jpg.chip.jpg` -> `61_1_1_20170109142408075.jpg.chip.jpg` _(Missing gender)_
* `61_1_20170109150557335.jpg.chip.jpg` -> `61_1_3_20170109150557335.jpg.chip.jpg` _(Missing race)_
* `39_1_20170116174525125.jpg.chip.jpg` -> `39_1_1_20170116174525125.jpg.chip.jpg` _(Missing gender)_

Once that is done, feel free to re-run the dataset validation script a few fields above to check that there ar eno more mismatches.

### Analysing and Filtering the Data

Now let's look at the data we have; the following table shows the breakdown of the dataset by gender, age and race (we will not use the datetime metadata field).

#### Original UTKFace Dataset

| **Gender/Age** | **Asian** | **Black** | **Indian** | **Other** |  **White** |  **Total** |
|:---------------|----------:|----------:|-----------:|----------:|-----------:|-----------:|
| **Female**     | **1,859** | **2,210** |  **1,715** |   **932** |  **4,601** | **11,317** |
| 1 - 14         |       461 |       110 |        297 |       280 |        783 |      1,931 |
| 15 - 24        |       414 |       346 |        383 |       277 |        667 |      2,087 |
| 25 - 44        |       872 |     1,473 |        847 |       346 |      1,655 |      5,193 |
| 45 - 64        |        39 |       205 |        146 |        22 |        798 |      1,210 |
| 65 and over    |        73 |        76 |         42 |         7 |        698 |        896 |
| **Male**       | **1,575** | **2,318** |  **2,261** |   **760** |  **5,477** | **12,391** |
| 1 - 14         |       511 |        87 |        246 |       163 |        713 |      1,720 |
| 15 - 24        |       137 |       246 |        175 |       121 |        486 |      1,165 |
| 25 - 44        |       601 |     1,426 |      1,102 |       374 |      2,056 |      5,559 |
| 45 - 64        |       179 |       400 |        649 |        97 |      1,560 |      2,885 |
| 65 and over    |       147 |       159 |         89 |         5 |        662 |      1,062 |
| **Total**      | **3,434** | **4,528** |  **3,976** | **1,692** | **10,078** | **23,708** |

After review of some sample images and discussion with various teammembers, we decided to discard images for children (age < 15) and older adults (age ≥ 65) as well as those where the race was classified as "Other" for consistency. This reduced the dataset to the following:

#### Filtered UTKFace Dataset

| **Gender/Age** | **Asian** | **Black** | **Indian** | **White** |  **Total** |
|:---------------|----------:|----------:|-----------:|----------:|-----------:|
| **Female**     | **1,325** | **2,024** |  **1,376** | **3,120** |  **7,845** |
| 15 - 24        |       414 |       346 |        383 |       667 |      2,087 |
| 25 - 44        |       872 |     1,473 |        847 |     1,655 |      5,193 |
| 45 - 64        |        39 |       205 |        146 |       798 |      1,210 |
| **Male**       |   **917** | **2,072** |  **1,926** | **4,102** |  **9,017** |
| 15 - 24        |       137 |       246 |        175 |       486 |      1,165 |
| 25 - 44        |       601 |     1,426 |      1,102 |     2,056 |      5,559 |
| 45 - 64        |       179 |       400 |        649 |     1,560 |      2,885 |
| **Total**      | **2,242** | **4,096** |  **3,302** | **7,222** | **16,862** |

We can use the `preselect_landmarks` function (below) to filter out images where the subject's age is under 15 or above 64 and their race is classified as 'other'. The script will produce a file called `ll_initial.txt` in the current directory listing the 16,862 images that fit that criteria. (Note that we specify the seed value through the `randomseed` parameter to enable reproducibility.)

In [12]:
from sys import exit
from os.path import exists

def preselect_landmarks(landmarks_file: str, age=None, gender=None, race=None,
                        *, log: bool = False, randomise: bool = False,
                        randomseed=None, target=None, filename=None) -> None:
    """ Preselect Landmarks

    Iterates through an original file listing images and associated landmarks
    applying the given filters and generates another file with the subset of
    images that passed *all* filters.

    :param landmarks_file: name of original landmark file (str, Required)
    :param log: Whether to log why each image was discarded and to print a
        summary message with the number of images filtered (Default: False)
    :param target: (maximum) number of images to select based on the provided
        filters. If defined, the function may output two files with suffixes:
        • "_filtered": list of images which meet all filter criteria;
        • "_remainder": list of images which do not meet filter criteria or
            exceed target number;
    :param randomise: shuffle landmarks before preselection? (Default: False)
    :param randomseed: seed value (int) when randomising (Default: None)
    :param age: tuple containing min (int) and max (int) values (Default: None)
    :param gender: 'male' or 'female' (str, Default: None)
    :param race: either a str with a single race or a list for multiple races,
        values 'white', 'black', 'asian', 'indian' and 'other' (Default: None)
    :param filename: filename for target files
    :return: nothing, may print Errors, Warnings and log messages to stdout
    """
    with open(landmarks_file, 'r') as file:
        landmarks = file.readlines()

    # Initialise maps
    genders: dict[str: str] = {
        '0': "male",
        '1': "female"
    }
    races: dict[str: str] = {
        '0': "white",
        '1': "black",
        '2': "asian",
        '3': "indian",
        '4': "other"
    }

    # Read parameters for valid filters and capture those for new filename
    filters = []
    if isinstance(age, tuple) and age[0] <= age[1]:
        filters.append(str(age[0]) + "-" + str(age[1]))
    if isinstance(gender, str) and gender in genders.values():
        filters.append(gender)
    if isinstance(race, str):
        race = [race]
    if isinstance(race, list) and all(r in races.values() for r in race):
        filters.append("-".join(race))

    # Abort if no valid filters were found or invalid target
    if len(filters) == 0:
        print("Error: No valid filters to apply.")
        exit(1)
    if target is not None and (not isinstance(target, int) or target < 1):
        print("Error: Target needs to be greater than zero.")
        exit(1)

    # Abort if files already exist with target name (avoid overwriting).
    if filename:
        filtered_landmarks_file = filename
    else:
        filtered_landmarks_file = landmarks_file.split(".")[0]
        filtered_landmarks_file += "_" + "_".join(filters)
    filtered_landmarks_file += "_filtered" if target else ""
    filtered_landmarks_file += ".txt"
    if exists(filtered_landmarks_file):
        print(f"Error: File '{filtered_landmarks_file}' already exists in "
              f"current directory; delete or rename and run script again.")
        exit(1)
    if filename:
        remainder_landmarks_file = filename
    else:
        remainder_landmarks_file = landmarks_file.split(".")[0]
        remainder_landmarks_file += "_" + "_".join(filters)
    remainder_landmarks_file += "_remainder.txt"
    if target and exists(remainder_landmarks_file):
        print(
            f"Error: File '{remainder_landmarks_file}' already exists in "
            f"current directory; delete or rename and run script again.")
        exit(1)

    if randomise:
        if seed is not None:
            seed(randomseed)
        shuffle(landmarks)

    filtered_landmarks: list[str] = []
    remainder_landmarks: list[str] = []
    for line in landmarks:
        # Iterate over all lines in landmark file and apply filters
        keep = True

        # Retrieve metadata from filename
        imagename = line.split()[0]
        splitname: list[str] = imagename.split(".")[0].split("_")
        # 0 is presumed if no age is provided in metadata, so this may fail
        # minimum age filters greater than 0.
        line_age = int(splitname[0]) if len(splitname) > 0 else 0
        # An image with no gender metadata will fail all gender filters
        if len(splitname) > 1 and splitname[1] in genders:
            line_gender = genders[splitname[1]]
        else:
            line_gender = ""
        # An image without race metadata will fail any race filters
        if len(splitname) > 2 and splitname[2] in races:
            line_race = races[splitname[2]]
        else:
            line_race = ""

        # Check if a given line passes *all* filters
        if isinstance(age, tuple) and (line_age < age[0] or line_age > age[1]):
            if log:
                print(f"Image {imagename} skipped due to age ({line_age}).", )
            keep = False
        if isinstance(gender, str) and line_gender != gender:
            if log:
                print(f"Image {imagename} skipped due to gender ({line_gender}).", )
            keep = False
        if isinstance(race, list) and line_race not in race:
            if log:
                print(f"Image {imagename} skipped due to race ({line_race}).", )
            keep = False

        if keep and (target is None or target > len(filtered_landmarks)):
            filtered_landmarks.append(line)
            if log:
                print(f"Image {imagename} added to filtered list.")
        else:
            remainder_landmarks.append(line)
            if log and target:
                print(f"Image {imagename} added to remainder list.")

    if len(filtered_landmarks) == 0:
        print("Warning: No images passed all filters.")
        exit(1)

    with open(filtered_landmarks_file, 'w') as file:
        file.writelines(filtered_landmarks)
    if log:
        print(len(filtered_landmarks),
              "filtered images saved to file",
              filtered_landmarks_file)

    if target and len(remainder_landmarks) != 0:
        with open(remainder_landmarks_file, 'w') as file:
            file.writelines(remainder_landmarks)
        if log:
            print(len(remainder_landmarks),
                  "remainder images saved to file",
                  remainder_landmarks_file)


# Selecting images for people of known races and with ages between 15-64.
preselect_landmarks('landmark_list_corrected.txt', age=(15, 64),
                    race=['asian', 'black', 'indian', 'white'], filename='ll_initial',
                    randomise=True, randomseed=680780122122)

FileNotFoundError: [Errno 2] No such file or directory: 'landmark_list_corrected.txt'

### Splitting the data into Training, Validation and Test datasets

The plan is to train our models on 3 datasets with differing ratios of female/male images and then test their performance against female-only, male-only and evenly-split datasets. This is the plan:

![data_preparation_plan](jupyter_images/Data_Preparation_Plan.png)

First, let's run `preselect_landmarks` two more times to produce a file with 7,500 randomly-ordered males and another with 7,500 randomly-ordered females.

In [1]:
# Selecting 7,500 male images from the initial dataset
preselect_landmarks('ll_initial.txt', gender="male", target=7500, randomise=True, randomseed=680780122122)

# Selecting 7,500 female images from the initial dataset
preselect_landmarks('ll_initial_male_remainder.txt', gender="female", target=7500, randomise=True, randomseed=680780122122)

NameError: name 'preselect_landmarks' is not defined

To make things easier to follow, let's import the `os` package and use it to rename the files to `ll_males.txt` and `ll_females.txt`, respectively. We can also delete the intermediary/additional files generated so far (and will do so at each step from now on).

In [14]:
import os
# Renaming the output files for clarity
os.rename('ll_initial_male_filtered.txt', 'll_males.txt')
os.rename('ll_initial_male_remainder_female_filtered.txt', 'll_females.txt')

# Deleting intermediary files
os.remove('landmark_list_concatenated.txt')
os.remove('landmark_list_corrected.txt')
os.remove('ll_initial.txt')
os.remove('ll_initial_male_remainder.txt')
os.remove('ll_initial_male_remainder_female_remainder.txt')

FileNotFoundError: [Errno 2] No such file or directory: 'll_initial_male_filtered.txt' -> 'll_males.txt'

Now we can extract 750 lines (10%) from `ll_females.txt` and `ll_males.txt` and concatenate that into a single balanced test file called `ll_test_50-50_split.txt` with 1,500 images. We will keep all three files for our testing phase.

In [15]:
# Splitting the male data set into test (750 images) and training/validation (6,750 images)
preselect_landmarks('ll_males.txt', filename="ll_males_test", gender="male", target=750, randomise=True, randomseed=680780122122)

# Splitting the female data set into test (750 images) and training/validation (6,750 images)
preselect_landmarks('ll_females.txt', filename="ll_females_test", gender="female", target=750, randomise=True, randomseed=680780122122)

# Renaming the output files for clarity
os.rename('ll_males_test_filtered.txt', 'll_test_males.txt')
os.rename('ll_females_test_filtered.txt', 'll_test_females.txt')

# Concatenating the male/female test images into a single dataset
concatenate_files("ll_test_50-50_split.txt", files=['ll_test_males.txt', 'll_test_females.txt'], delete=False)

# Deleting intermediary files
os.remove('ll_males.txt')
os.remove('ll_females.txt')

FileNotFoundError: [Errno 2] No such file or directory: 'll_males.txt'

We split the remaining male/female data into Training (80%: 6,000 images) and Validation (10%: 750 images) sets, but we won't contatenate them just yet.

In [8]:
# Splitting the male data set into training (6,000 images) and validation (750 images)
preselect_landmarks('ll_males_test_remainder.txt', filename="ll_males_validation", gender="male", target=750)

# Renaming the output files for clarity
os.rename('ll_males_validation_filtered.txt', 'll_males_validation.txt')
os.rename('ll_males_validation_remainder.txt', 'll_males_training.txt')

# Deleting intermediary files
os.remove('ll_males_test_remainder.txt')

# Splitting the female data set into training (6,000 images) and validation (750 images)
preselect_landmarks('ll_females_test_remainder.txt', filename="ll_females_validation", gender="female", target=750)

# Renaming the output files for clarity
os.rename('ll_females_validation_filtered.txt', 'll_females_validation.txt')
os.rename('ll_females_validation_remainder.txt', 'll_females_training.txt')

# Deleting intermediary files
os.remove('ll_females_test_remainder.txt')

Now, we need to split both the male/female **training** sets into thirds so that we can compose our 3 separate cohorts (25-75 split, 50-50 split, 75-25 split).

In [9]:
# Spliting the male training dataset into 3 files with 2,000 images each
preselect_landmarks('ll_males_training.txt', filename="ll_males_training_1", gender="male", target=2000)
preselect_landmarks('ll_males_training_1_remainder.txt', filename="ll_males_training_2", gender="male", target=2000)

# Renaming the output files for clarity
os.rename('ll_males_training_1_filtered.txt', 'll_males_training_cohort_1.txt')
os.rename('ll_males_training_2_filtered.txt', 'll_males_training_cohort_2.txt')
os.rename('ll_males_training_2_remainder.txt', 'll_males_training_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_males_training.txt')
os.remove('ll_males_training_1_remainder.txt')

# Spliting the female training dataset into 3 files with 2,000 images each
preselect_landmarks('ll_females_training.txt', filename="ll_females_training_1", gender="female", target=2000)
preselect_landmarks('ll_females_training_1_remainder.txt', filename="ll_females_training_2", gender="female", target=2000)

# Renaming the output files for clarity
os.rename('ll_females_training_1_filtered.txt', 'll_females_training_cohort_1.txt')
os.rename('ll_females_training_2_filtered.txt', 'll_females_training_cohort_2.txt')
os.rename('ll_females_training_2_remainder.txt', 'll_females_training_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_females_training.txt')
os.remove('ll_females_training_1_remainder.txt')

# 25:75 Training file: 1 female cohort + 3 male cohorts
concatenate_files('ll_training_25-75_split.txt', files=['ll_females_training_cohort_3.txt', 'll_males_training_cohort_1.txt', 'll_males_training_cohort_2.txt', 'll_males_training_cohort_3.txt'])

# 50:50 Training file: 2 female cohorts + 2 male cohorts
concatenate_files('ll_training_50-50_split.txt', files=['ll_females_training_cohort_1.txt', 'll_females_training_cohort_2.txt', 'll_males_training_cohort_1.txt', 'll_males_training_cohort_2.txt'])

# 75:25 Training file: 3 female cohorts + 1 male cohort
concatenate_files('ll_training_75-25_split.txt', files=['ll_females_training_cohort_1.txt', 'll_females_training_cohort_2.txt', 'll_females_training_cohort_3.txt','ll_males_training_cohort_3.txt'])

# Deleting intermediary files
os.remove('ll_females_training_cohort_1.txt')
os.remove('ll_females_training_cohort_2.txt')
os.remove('ll_females_training_cohort_3.txt')
os.remove('ll_males_training_cohort_1.txt')
os.remove('ll_males_training_cohort_2.txt')
os.remove('ll_males_training_cohort_3.txt')

Input files successfully concatenated as ll_training_25-75_split.txt
Input files successfully concatenated as ll_training_50-50_split.txt
Input files successfully concatenated as ll_training_75-25_split.txt


We do the same with the male/female **validation** sets:

In [10]:
# Spliting the male validation dataset into 3 files with 250 images each
preselect_landmarks('ll_males_validation.txt', filename="ll_males_validation_1", gender="male", target=250)
preselect_landmarks('ll_males_validation_1_remainder.txt', filename="ll_males_validation_2", gender="male", target=250)

# Renaming the output files for clarity
os.rename('ll_males_validation_1_filtered.txt', 'll_males_validation_cohort_1.txt')
os.rename('ll_males_validation_2_filtered.txt', 'll_males_validation_cohort_2.txt')
os.rename('ll_males_validation_2_remainder.txt', 'll_males_validation_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_males_validation.txt')
os.remove('ll_males_validation_1_remainder.txt')

# Spliting the female validation dataset into 3 files with 250 images each
preselect_landmarks('ll_females_validation.txt', filename="ll_females_validation_1", gender="female", target=250)
preselect_landmarks('ll_females_validation_1_remainder.txt', filename="ll_females_validation_2", gender="female", target=250)

# Renaming the output files for clarity
os.rename('ll_females_validation_1_filtered.txt', 'll_females_validation_cohort_1.txt')
os.rename('ll_females_validation_2_filtered.txt', 'll_females_validation_cohort_2.txt')
os.rename('ll_females_validation_2_remainder.txt', 'll_females_validation_cohort_3.txt')

# Deleting intermediary files
os.remove('ll_females_validation.txt')
os.remove('ll_females_validation_1_remainder.txt')

# 25:75 Validation file: 1 female cohort + 3 male cohorts
concatenate_files('ll_validation_25-75_split.txt', files=['ll_females_validation_cohort_3.txt', 'll_males_validation_cohort_1.txt', 'll_males_validation_cohort_2.txt', 'll_males_validation_cohort_3.txt'])

# 50:50 Validation file: 2 female cohorts + 2 male cohorts
concatenate_files('ll_validation_50-50_split.txt', files=['ll_females_validation_cohort_1.txt', 'll_females_validation_cohort_2.txt', 'll_males_validation_cohort_1.txt', 'll_males_validation_cohort_2.txt'])

# 75:25 Validation file: 3 female cohorts + 1 male cohort
concatenate_files('ll_validation_75-25_split.txt', files=['ll_females_validation_cohort_1.txt', 'll_females_validation_cohort_2.txt', 'll_females_validation_cohort_3.txt','ll_males_validation_cohort_3.txt'])

# Deleting intermediary files
os.remove('ll_females_validation_cohort_1.txt')
os.remove('ll_females_validation_cohort_2.txt')
os.remove('ll_females_validation_cohort_3.txt')
os.remove('ll_males_validation_cohort_1.txt')
os.remove('ll_males_validation_cohort_2.txt')
os.remove('ll_males_validation_cohort_3.txt')

Input files successfully concatenated as ll_validation_25-75_split.txt
Input files successfully concatenated as ll_validation_50-50_split.txt
Input files successfully concatenated as ll_validation_75-25_split.txt


We now have the following files ready for training, validation and testing:

#### Training
* 25% Female + 75% Male: `ll_training_25-75_split.txt` (8,000 lines)
* 50% Female + 50% Male: `ll_training_50-50_split.txt` (8,000 lines)
* 75% Female + 25% Male: `ll_training_75-25_split.txt` (8,000 lines)

#### Validation
* 25% Female + 75% Male: `ll_validation_25-75_split.txt` (1,000 lines)
* 50% Female + 50% Male: `ll_validation_50-50_split.txt` (1,000 lines)
* 75% Female + 25% Male: `ll_validation_75-25_split.txt` (1,000 lines)

#### Testing
* 100% Female: `ll_test_females.txt` (750 lines)
* 100% Male: `ll_test_males.txt` (750 lines)
* 50% Female + 50% Male: `ll_test_50-50_split.txt` (1,500 lines)





## Models

To examine the effects of different model architectures on model bias we created three different models to compare. These models were:

* A basic convolutional neural network
* A dense net
* A residual network

Each of these models would then be trained on three different training datasets. These datasets will have varying proportions of men and women and will be validated using a datset with similar proportions. This allows us to examine how resilient each model is to dataset bias.

##### Basic Convolutional NN

The first NN used in our experiment is a basic NN featuring four convolutional layers, one linear layer and pooling and dropout layers.

![alternative text](jupyter_images/convNN2_structure.png)

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision.models

In [3]:
class convNN2(torch.nn.Module):
    def __init__(self):
        super(convNN2, self).__init__()

        self.conv1 = nn.Conv2d(3, 32, kernel_size=5)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3)

        self.fc1 = nn.Linear(128, 6)

        self.pool = nn.MaxPool2d(2, 2)

        self.dropout = nn.Dropout2d(p=0.2)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        x = F.relu(self.conv3(x))
        x = self.pool(x)

        bs, _, _, _ = x.shape
        x = F.adaptive_avg_pool2d(x, 1).reshape(bs, -1)
        x = self.dropout(x)
        out = self.fc1(x) 

        return out

### Resnet

A next step after the ConvNN2 model was the Resnet34 model. We decided on the Resnet model because of its good performance for general image classification, and we wished to see how it might be able to extend to facial landmark detection. The skip connections in the Resnet models also allowed us to train a larger network without running into vanishing gradient problems that we might have encountered had we simply naively added extra layers to our convNN2 model. In addition to Resnet34, we did test other variants including Resnet18 and Resnet50, but found that having 34 layers struck a good balance between performance and time required for training. 

Given that we decided to select three landmarks for our points of reference on the face, any additional resnet layers did not yield significant reductions in error rate. In the future if we wished to do our testing with additional facial landmarks, then it may be necessary to use Resnet101 or Resnet152.

![foo](jupyter_images/resnet-table.png)

Source: He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). *Deep Residual Learning for Image Recognition*

Structure of ResNet34

![foo](jupyter_images/resnetdiagram.png)

In [4]:
class resnet18(nn.Module):
    def __init__(self):
        super(resnet18, self).__init__()
        self.resnet = torchvision.models.ResNet(torchvision.models.resnet.BasicBlock, [2, 2, 2, 2], num_classes=6)
    
    def forward(self, x):
        return self.resnet.forward(x)

class resnet34(nn.Module):
    def __init__(self):
        super(resnet34, self).__init__()
        self.resnet = torchvision.models.ResNet(torchvision.models.resnet.BasicBlock, [3, 4, 6, 3], num_classes=6)
    
    def forward(self, x):
        return self.resnet.forward(x)

class resnet50(nn.Module):
    def __init__(self):
        super(resnet50, self).__init__()
        self.resnet = torchvision.models.ResNet(torchvision.models.resnet.Bottleneck, [3, 4, 6, 3], num_classes=6)

    def forward(self, x):
        return self.resnet.forward(x)

The Pytorch implementation of the Resnet models allowed for defining a specific number of output classes. Therefore the Resnet models used in our experiments uses 6 output classes, one for each coordinate of the three chosen landmark features. Thus the fully connected layer at the end of the Resnet is no longer a 1000 output fully connected layer, which the original authors of the Resnet paper used.

#### Dense Net

This network is based on the pytorch implementation of [densenet121](https://pytorch.org/vision/main/models/generated/torchvision.models.densenet121.html). A dense net was chosen as it is a common network used in image recognition and classification tasks.

Since the default pytorch implementation produces an output tensor of shape [1000,] a few linear layers were added to turn this into a [6,] output in order to describe just three landmarks.

The structure is as follows below

![foo](jupyter_images/densenetdiagram.png)

In [5]:
class denseNN(nn.Module):
    def __init__(self, device):
        super(denseNN, self).__init__()
        # Pytorch does not come with densenet121 installed and it must be downloaded.
        # 
        self.dense121 = torch.hub.load('pytorch/vision:v0.10.0', 'densenet121', pretrained=False)
        self.fc1 = nn.Linear(1000, 600)
        self.fc2 = nn.Linear(600, 100)
        self.fc3 = nn.Linear(100, 6)

    def forward(self, x):
        x = F.relu(self.dense121(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

### Training Process and Validation

To perform our training we must first create a pytorch [Dataset](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). This allows us to read in images and labels described by the dataset text files created earlier. This class also serves to extract other information from the name of the image. Features like age, race and gender are recorded in the image name. Some preprocessing is also required. Once images are loaded into a tensor, we divide all values by 255 to restrict each value to the range [0, 1]. If this is not done we run into issues training where the loss and some weights could reach infinity and cause issues in training.

In [6]:
from os import path
from torch import tensor, div
from torch.utils.data import Dataset
from torchvision.io import read_image


class CustomImageDataset(Dataset):
    """
    Custom Image Dataset Class
    """
    def __init__(self, annotations_file: str, img_dir: str, transform=None):
        """
        Create a Custom Image Dataset object

        Usage:
        dataset_obj = CustomImageDataset('landmarks.txt', '../images/')

        :param annotations_file: address of input file (relative to script)
        :param img_dir: path to image directory (relative to script)
        :param transform: function to be applied to every image requested
        """

        # Read the landmarks file for later querying
        with open(annotations_file, 'r') as file:
            lines = file.readlines()
        self.img_labels = [line.split() for line in lines]
        self.img_dir: str = img_dir
        self.transform = transform

        # Initialise Maps for the __getitem__ method
        self.genders: dict[str: str] = {
            '0': "male",
            '1': "female"
        }
        self.races: dict[str: str] = {
            '0': "white",
            '1': "black",
            '2': "asian",
            '3': "indian",
            '4': "other"
        }

    def __len__(self) -> int:
        """
        Returns the number of images in the Custom Image Dataset object.

        Usage:
        len(dataset_obj)

        :return: int
        """
        return len(self.img_labels)

    def __getitem__(self, idx: int):
        """
        Used by PyTorch to request a given image within the Dataset.

        Usage:
        dataset_obj[42]

        :param idx: number of requested image (should be less than __len__)
        :return: the requested image and its metadata as separate variables:
            'image': Scaled PyTorch tensor obj for the image file
            'age': int with the age of the person in the image
            'gender': str ('male' or 'female')
            'race': str ('white', 'black', 'asian', 'indian', 'others')
            'landamarks': PyTorch tensor obj with 68 pairs of x,y coords
        """

        # Reads file for given index as a tensor image
        imagename = self.img_labels[idx][0] + ".chip.jpg"
        image = read_image(path.join(self.img_dir, imagename)).float()
        image_scaled = div(image, 255)

        # Applies any transformations to image
        if self.transform:
            image_scaled = self.transform(image_scaled)

        # Reads the image metadata from the filename
        splitname: list[str] = imagename.split(".")[0].split("_")
        age: int = int(splitname[0])
        gender: str = self.genders[splitname[1]]
        race: str = self.races[splitname[2]]
        # datetime: int = int(splitname[3][:13])    // Not used

        # Reorganises the x,y landmark coordinates as a 68x2 tensor
        coords = self.img_labels[idx][1:]
        raw_landmarks: list[list[int, int]] = []
        for i in range(0, len(coords), 2):
            raw_landmarks.append([int(coords[i]), int(coords[i + 1])])
        landmarks = tensor(raw_landmarks).float()

        return image_scaled, age, gender, race, landmarks

In [7]:
import torch.optim as optim
import numpy as np

import time
import copy
import json

from torch.utils.data import DataLoader
from argparse import ArgumentParser

In [8]:
class Timer():
    def __init__(self):
        self.start_time = time.time()

    def start(self):
        self.start_time = time.time()

    def elapsed_time(self):
        current_time = time.time()

        duration = current_time - self.start_time
    
        hours = int(duration / 3600)
        minutes = int((duration % 3600) / 60)
        seconds = int((duration % 3600) % 60)

        return f"{hours}h {minutes}m {seconds}s"

#### Validation

The evaluate function is used for both validation and testing purposes. Here we calculate the straight line distance between the real label coordinates and the coordinates outputted by the model.

Average distance was chosen over MSE or other similar error functions as it is easier to interpret.

The evaluate function also outputs the standard deviation of these scores, as it may provide some useful information about the variance of the model. A high variance would mean the model is simply guessing in a tight area and not producing unique coordinates for each image.

In [9]:
def evaluate(model, valid_set_path, device):
    UTKFace = CustomImageDataset(valid_set_path, 'UTKFace')
    valid_set = DataLoader(UTKFace, 
                            500, 
                            shuffle=True)

    # We're calculating the distance ourselves as using MSE loss doesnt 
    # allow us to square root terms individually.
    model.eval()
    with torch.no_grad():
        for images, _, _, _, landmarks in valid_set:
            images, landmarks = images.to(device), landmarks.to(device)

            outputs = model(images).view([-1,3,2]) # organise into (x, y) pairs

            land_idx = [8, 30, 39]  # The labels we are training for
            difference = torch.square(outputs - landmarks[:, land_idx]).to(device)
            difference = torch.sqrt(difference[:, 0] + difference[:, 1])

    model.train()
    return torch.mean(difference).item(), torch.std(difference).item()

#### Now for the training

Here validation is only performed every 20 iterations as it adds a significant amount of time to the training process. Validating less frequently allows us to graph the results. Each time validation is performed, we check the current set of weights against the previous best scoring model. If the current weights perform better, we update the best model accordingly and record which iteration it occured in. 

Stochastic gradient descent (SGD) does not ensure that the final epoch and iteration will produce the best model. As such, validating regularly allows us to choose the best model thus far. While we cannot do this every iteration due to the overhead, we assume that validating regularly will yield a set of weights close to the optimal.

In [10]:
def train(model, train_loader, lr, device, valid_set, epochs=5):
    loss_func = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=lr)

    # Initialise somewhere to save data for later graphing
    batches = len(train_loader)
    scores = np.empty([batches * epochs, 3])
    scores[:] = np.nan

    best_model = model
    best_scores = {"iteration": 0, 
                "mean": 1000,
                "std": 1000,
                "loss_list": []}

    timer = Timer()
    timer.start()

    for epoch in range(epochs):
        for i, data in enumerate(train_loader, 0):
            images, _, _, _, landmarks = data   # images, age, gender, race, landmarks

            # Zero paramter gradients
            optimizer.zero_grad()
            images, landmarks = images.to(device), landmarks.to(device)

            outputs = model(images)
            land_idx = [8, 30, 39]  # The indexs of the landmarks we are training with
            loss = loss_func(outputs, landmarks[:, land_idx].view(-1, 6))
            best_scores["loss_list"].append(loss.item())    # Record for graphing later
            loss.backward()
            optimizer.step()

            # Validation is performed every 20 iterations due to its high overhead.
            if i % 20 == 0:
                mean, std = evaluate(model, valid_set, device)
                scores[(epoch * batches) + i, 0] = (epoch * batches) + i
                scores[(epoch * batches) + i, 1] = mean
                scores[(epoch * batches) + i, 2] = std
                print(f"[{timer.elapsed_time()}] Epoch: {epoch}, iteration: {i}, loss: {loss.item()}, mean: {mean}, std: {std}")

                # If the current model is the best we have seen so far, preserver the weights
                if mean < best_scores["mean"]:
                    best_model = copy.deepcopy(model)    # We need to copy to preserve weights
                    best_scores["iteration"] = (epoch * batches) + i
                    best_scores["mean"] = mean
                    best_scores["std"] = std
            
    # Remove iterations where we did not do any validation
    filtered_scores = scores[~np.isnan(scores).any(axis=1)]

    return best_model, best_scores, filtered_scores

Since we used cli arguments to make the code in this notebook we can instead add the arguments to a string and pass the string to the `ArgumentParser` class.

* -f, --train_file
    - Used to specify the path to a training file. These should be text files with a list of images. For example `ll_training_75-25_split.txt` indicates the model should use the training set with the 75-25 gender split.
* -vf, --validation_file
    - Like the train file, this argument is used to indicate which subset of `UTKFace` to use for validation.
* -b, --batch
    - Set the batch size for training. All models were trained on a batch size of 32.
* -m, --model
    - Choose which model to train.
    - The models availible are:
        - "convNN2"
        - "dense"
        - "resnet18"
        - "resnet34"
        - "resnet50"
* -lr, --learning_rate
    - Specify the learning rate.
* --cuda
    - Including this argument will enable training on cuda.
* -e, --epochs
    - Specify the number of epochs to train for.

In [32]:
arg_str = "train.py -b 32 --learning_rate 0.0001 --cuda --train_file ll_training_75-25_split.txt --validation_file ll_validation_75-25_split.txt --epochs 50 -m dense"

# Read in args
parser = ArgumentParser(arg_str)
parser.add_argument("-tf", "--train_file",
                    help="Path to data file.", 
                    metavar="FILE_PATH", 
                    default="landmark_list.txt")
parser.add_argument("-vf", "--validation_file",
                    help="Choose file to use for validation.",
                    metavar="FILE_PATH",
                    default="landmark_list.txt")
parser.add_argument("-b", "--batch", 
                    help="Batch size for training.", 
                    type=int, 
                    metavar="INT",
                    default=64)
parser.add_argument("-m", "--model",
                    help="Choose which model structure to use.",
                    default="convNN2",
                    metavar="MODEL_NAME")
parser.add_argument("-lr", "--learning_rate",
                    help="Learning rate to run the optimizer function with.",
                    default=0.0001,
                    type=float,
                    metavar="FLOAT")
parser.add_argument("--cuda",
                    help="Add this argument to run the code using GPU acceleration.",
                    action="store_true")
parser.add_argument("-e", "--epochs",
                    help="Dictate number of epochs to train for.",
                    type=int,
                    metavar="INT",
                    default=5)

args, _  = parser.parse_known_args()

In [30]:
device = "cpu"
if args.cuda and torch.cuda.is_available():    
    device = "cuda"

model = None
if args.model == "convNN2":
    model = convNN2().to(device)
elif args.model == "resnet18":
    model = resnet18().to(device)
elif args.model == "resnet34":
    model = resnet34().to(device)
elif args.model == "resnet50":
    model = resnet50().to(device)
elif args.model == "dense":
    model = denseNN(device).to(device)

UTKFace = CustomImageDataset(args.train_file, 'UTKFace')
train_dataloader = DataLoader(UTKFace, 
                                batch_size=args.batch, 
                                shuffle=True)

In [33]:
print(f"Training {args.model} from {args.train_file} with batch_size={args.batch}\n")

# Train model
model, info, plots = train(model, train_dataloader, args.learning_rate, device, args.validation_file, epochs=args.epochs)

Training convNN2 from landmark_list.txt with batch_size=64



KeyboardInterrupt: 

Now we save the model weights and model performance over time for later evaluation

In [None]:
# save model and training/validation results
# filename includes batchsize, epoch number, learning rate
filename = f"{args.model}_{args.train_file}_batch{args.batch}_ep{args.epochs}_lr{args.learning_rate}"
model_path = f"./models/{filename}.pt"
scores_path = f"./model_scores/{filename}.csv"
torch.save(model.state_dict(), model_path)
np.savetxt(scores_path, plots, delimiter=",")

info["epochs"] = args.epochs
info["batch"] = args.batch

with open(f"./model_infos/{filename}.json", "w") as outfile:
    json.dump(info, outfile)

### Testing the Network

We can commence testing now that all the networks have been successfully trained. The models and training data are in the folders: *models*, *model_scores*, and *model_infos*.
Each model will be tested and compared on three different sets: a 50-50 split of male to female testing data, a solely male testing set, and a solely female testing set.

First we will load all the models into memory. This is simply done by insert them all into a list, initialising a model, and then loading each file into that subsequent model.

In [38]:
# Adding in model names
model_names = [
        "convNN2_ll_training_50-50_split.txt_batch32_ep50_lr0.0001.pt", 
        "convNN2_ll_training_25-75_split.txt_batch32_ep50_lr0.0001.pt", 
        "convNN2_ll_training_75-25_split.txt_batch32_ep50_lr0.0001.pt", 
        "dense_ll_training_50-50_split.txt_batch32_ep50_lr0.0001.pt", 
        "dense_ll_training_25-75_split.txt_batch32_ep50_lr0.0001.pt",
        "dense_ll_training_75-25_split.txt_batch32_ep50_lr0.0001.pt", 
        "resnet34_ll_training_50-50_split.txt_batch64_ep50_lr0.0001.pt", 
        "resnet34_ll_training_25-75_split.txt_batch64_ep50_lr0.0001.pt",
        "resnet34_ll_training_75-25_split.txt_batch64_ep50_lr0.0001.pt"]

models = [convNN2(), convNN2(), convNN2(), denseNN("cpu"), denseNN("cpu"), denseNN("cpu"), resnet34(), resnet34(), resnet34()]

for name, model in zip(model_names, models):
    model.load_state_dict(torch.load("./models/" + name))

Using cache found in /home/oscarfzs/.cache/torch/hub/pytorch_vision_v0.10.0
Using cache found in /home/oscarfzs/.cache/torch/hub/pytorch_vision_v0.10.0
Using cache found in /home/oscarfzs/.cache/torch/hub/pytorch_vision_v0.10.0


RuntimeError: Error(s) in loading state_dict for convNN2:
	Unexpected key(s) in state_dict: "conv4.weight", "conv4.bias", "fc2.weight", "fc2.bias". 
	size mismatch for fc1.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([6, 128]).
	size mismatch for fc1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([6]).

Once we have done this we can now calculate the scores of each model for every test set. This should compute the mean error and standard deviation and return it as a tuple.

In [None]:
for model in models:
    print(model.__class__.__name__)
    print("50-50 test file " + evaluate(model, "./ll_test_50-50", "cpu"))
    print("male test file " + evaluate(model, "./ll_test_males", "cpu"))
    print("female test file " + evaluate(model, "./ll_test_females", "cpu"))

Due to its simplicity, the error for the convNN2 is much higher than the the other two models. 
While we can see that ResNet34 outperformed DesneNet121 with a significant reduction in error between the two, 
both models had higher error in females for identifing landmarks for both genders, as expected at the start of this project.

Both models performed best overall on balanced data sets, though this difference was very slight. It also shows that a higher proportion of females or males in the dataset does not necessarily result in a lower error for each respective gender. However, the change in error was quite minimal, meaning we can discard the bias in the dataset as a source of the bias between males and females.

Now we can generate graphs for the change in mean error and standard deviation during training, and how it progressed over the iterations.
When loading in the model scores file, we need to parse this data to make it useful to us. Namely, we split it into a list by using ',' as dividers.
The model infos are already presented in json format, and can simply be read in directly. This gives us the iteration, mean, and standard deviation of the saved model.

In [39]:
import matplotlib.pyplot as plt

for model in models:
    epoch = []
    error = []
    std = []
    m_nopt = args.model.split(".pt")
    with open("./model_scores/" + m_nopt[0] + ".csv") as file:
        for line in file:
            scores = line.split(",")   
            epoch.append(float(scores[0]))
            error.append(float(scores[1]))
            std.append(float(scores[2]))
    
    file = open("./model_infos/" + m_nopt[0] + ".json")
    model_info = json.load(file)

    plt.plot(epoch, error)
    plt.xlabel("Iterations")
    plt.ylabel("Mean Error")
    plt.title('Error change during training')

    plt.scatter([model_info["iteration"]], [model_info["mean"]], color = 'red')
    plt.gca().legend(('Validation Error','Best performing iteration'))
    plt.show()

    plt.plot(epoch, std)
    plt.xlabel("Epochs")
    plt.ylabel("Standard Deviation")
    plt.title('STD change during training')
    plt.scatter([model_info["iteration"]], [model_info["std"]], color = 'red')
    plt.gca().legend(('Validation Error','Best performing iteration'))
    plt.show()
    

FileNotFoundError: [Errno 2] No such file or directory: './model_scores/convNN2.csv'

From the graphs we can see that the mean error and standard deviation rapidly decreases over time within the first few iterations. Initially, with the weights initialized to zero, the predicted landmarks all fall very far from the correct points.

Both the ConvNN2 and DenseNet seem to oscillate between certain points and do not converge to a single point like the Resnet34 does. This seems to indicate that they are not necessarily the most optimal model.

We can also see the loss during training with the following,

In [None]:
# Graph for loss during training
for model in models:
    plt.plot(list(range(len(model_info["loss_list"]))), model_info["loss_list"])
    plt.xlabel("Iterations")
    plt.ylabel("Loss")
    plt.title('Loss during training')
    plt.show()


We can visually check how accurate the predicted landmarks are on an image by comparing them against the input landmarks used by the model for training. For each model an image will be generated that plots 2 lines, with 5 images per line for easy comparison between different images.

In [None]:
def generate_images(train_dataloader, axs_flat):
    with torch.no_grad():
            for i, (image, _, _, _, labels) in enumerate(train_dataloader):
                output = model(image)
                output = output.reshape(3,2)
                image = image.squeeze()
                image = image.permute(1, 2, 0)    #Default was 3,200,200
                im = axs_flat[i].imshow(image)
                x = np.array(range(200))

                # Finding landmarks for chin, nose and eye this txt file for each image
                land_idx = [8, 30, 39]
                labels = labels.squeeze()
                labels = labels[land_idx, :]
                #ax.scatter(output[:,0], output[:,1], linewidth=2, color='red')
                axs_flat[i].scatter(output[:,0], output[:,1], linewidth=2, color='c', s = 5)
                axs_flat[i].scatter(labels[:,0], labels[:,1], linewidth=2, color='m', s = 5)

UTKFace = CustomImageDataset("./test_output_images", 'UTKFace')
train_dataloader = DataLoader(UTKFace, 
                                    batch_size=1, 
                                    shuffle=False)

for model in models:
    print(model.__class__.__name__)
    fig, axs = plt.subplots(2,5, figsize=(20,10))
    axs_flat = axs.flatten()

    generate_images(train_dataloader, axs_flat)
    #plt.subplots_adjust(wspace=0, hspace=0)
    fig.legend(('Predicted output','Expected output'))
    plt.show()

#### Analysis

As we can see the basic convolutional neural network essentially predicts the same three points for every single image. While that does produce the best possible mean error for this specific model, the results do show that the model is severely underfitting. It does not have enough layers, depth, or complexity to recognize the different landmark features on each face.

Currently, the model produces a very high mean error (54.377 and 54.895 pixels for females and males, respectively). On a 200x200 pixel image this is very far off. Thus, convNN2 will not be considered when evaluating our hypothesis.

Both ResNet34 and DenseNet121 vastly outperformed convNN2 and get significantly lower average mean error and standard deviation.
Their errors on different datasets can be seen below
##### ResNet34 Test Data
![ResNet34 Test Data](jupyter_images/resnet34table.jpg) 
##### DenseNet121 Test Data
![DenseNet121 Test Data](jupyter_images/densenet121table.jpg)

From the tables we can see that ResNet had a mean error about 45% lower than DenseNet, which was a lot more sensitive to data bias from the data sets.
The DenseNet had a higher error for women regardless of dataset bias. The error on the female test set was also very insensitive to dataset bias, indicating that the model is not properly learning how to predict the correct landmarks for women.

We can look at a graph for all three models to get a better understand of what is happening during training.

##### convNN2 Loss
![convNN2 loss](jupyter_images/lossconnv2.jpg)
##### DenseNet121 Loss
![densenet121 loss](jupyter_images/lossdense.jpg)
##### ResNet34 Loss
![resnet34 loss](jupyter_images/lossresnet.jpg)


Another issue may be due to underfitting of the model, which may be experimented with by increasing the amount of layers in the network in the future. If there is little to no increase in mean error, it would highlight that vanishing gradients are at play and that the number of layers must be reduced instead.

![Both Models Mean Error Change](jupyter_images/modelgraphs.jpg)

ResNet mean error converges asymptotically, while DenseNet continues oscillating. This indicates that the DenseNet model had issues during training and did not ouput a very optimal model. High oscillation tends to indicate that the learning rate is too high, or that the training might benefit from using an optimiser with momentum. DenseNet is ostensibly a more sophisticated model as it has both more layers and parameters, but this did not appear to hold in practice. It appeared the model was unable to properly fit to the female portion of the dataset, while ResNet despite being comparatively simpler, was able to. We theorise this is due to the DenseNet having vanishing gradient issues due to its high layer count. While DenseNet is intended to prevent vanishing gradients, the pytorch implementation only uses dense blocks. The layers inside these blocks possess dense connections, but subsequent layers only receive inputs from the previous dense block's output layer. This saves on computational complexity by bringing down the number of weights. However, this could allow vanishing gradients to exist. The ResNet model however used fewer layers and took advantage of skip connections. Since it uses summation as opposed to concatenation to bypass layers, it is more computationally efficient. Due to this, the model does not need to be split into blocks and the outputs from earlier layers can reach all later layers. This theory would explain our concerns about vanishing gradients. Additionally, when we examined the model weights for our trained DenseNets, we found that many of the earlier weight parameters had changed by no more that 1e-4, further supporting the vanishing gradient issues.

As discussed, research suggests that discrepencies between men and women in facial recognition are not due to dataset bias but rather differences in facial structure. Another difference between ResNet and denseNet is their choice of pooling algorithms. Max pooling is likely better at learning to find edges and contours, while average pooling has a propensity to smooth values. If we assume that women tend to have softer features and therefore less defined contours on their face, it may be harder for a model using average pooling to identify these contours. Since men tend to have more defined features with more contrast, the model may cope better. This is reflected in our models as DenseNet performed worse than ResNet and Resnet features many more max pooling layers while densenet contains many average pooling layers.

While underfitting can also be a symptom of poor hyperparameter selection or insufficient data, we do not believe this to be the case. The models were both trained on large datasets and we had spent some time testing  hyperparameters that would produce satisfactory results for all models.

#### Further Discussion

Clearly, the inherent differences between male and female facial structure can pose problems when developing facial recognition models. We have shown this is largely independent of any dataset bias. While we have identified the cause of these issues, further work would be required to determine the extent of this bias. Since we have only tested on three landmarks, we are unable to determine if these issues affect all landmarks or if there is a small subset that are responsible.


### Future Work

* Investigating which landmarks most heavily influence gender bias in model predictions may be a good next step. It would help affirm our conclusions if landmarks with less contrast were more biased, as it would support our theory that it is the difference in facial definition causing the bias.

* Examining different pooling algorithms would also help support our results. This could be done by training a version of denseNet with max pooling instead of average pooling between each dense block, and a version of Resnet with without max pooling. If the denseNet performance improves while the resnet decreases, this would support the literature.

* Testing different proportions of men and women in datasets would allow us to better understand how influenced the models are by bias and experiment with strategies and guidelines to make the models more resiliant to dataset bias. Due to time constraints this team only experimented with three different proportions, but more datapoints would provide considerably more information.

* Investigate the minimal number of facial features necessary to perform face recognition fairly.​

* This team would also like to suggest identifying how large of an impact vanishing gradients played on these results by performing the same tests with a smaller version of denseNet. If the scores increase or remain consistent that would point heavily to vanishing gradients as it indicates the model is overly complex.

* This team is also interested on performing a similar experiment with different racial groups. Here we believe dataset bias plays a bigger role as datasets created in certain countries are likely to have the same distributions of ethnicities as their local populace. We also expect differences in skin tone and again, facial structure would also contribute. This experiment was cut for time.

### Conclusion

Our experiments thus far have provided promising insights into why facial recognition performs worse for certain groups, and more specifically genders. We were able to pinpoint differences in facial structure as a major contributing factor and some model design choices such as pooling algorithms that may exacerbate these issues. Additionally, we were able to find a model that was relatively unbiased while still producing low errors. These findings have helped verify existing research and can help influence future facial recognition development to be fairer.