In [13]:
import torch
import torch.nn as nn

### Testing the Network

We can commence testing now that all the networks have been successfully trained. The models and training data are in the folders: *models*, *model_scores*, and *model_infos*.
Each model will be tested and compared on three different sets: a 50-50 split of male to female testing data, a solely male testing set, and a solely female testing set.

First we will load all the models into memory. This is simply done by insert them all into a list, initialising a model, and then loading each file into that subsequent model.

In [1]:


# Adding in model names
model_names = ["convNN2_ll_training_50-50_split.txt_batch32_ep50_lr0.0001.pt", "convNN2_ll_training_27-75_split.txt_batch32_ep50_lr0.0001.pt", "convNN2_ll_training_75-25_split.txt_batch32_ep50_lr0.0001.pt",
        "convNN2_ll_training_27-75_split.txt_batch32_ep50_lr0.0001.pt", "dense_ll_training_50-50_split.txt_batch32_ep50_lr0.0001.pt", "dense_ll_training_25-75_split.txt_batch32_ep50_lr0.0001.pt",
        "dense_ll_training_75-25_split.txt_batch32_ep50_lr0.0001.pt", "resnet34_ll_training_50-50_split.txt_batch64_ep50_lr0.0001.pt", "resnet34_ll_training_25-75_split.txt_batch64_ep50_lr0.0001.pt",
        "resnet34_ll_training_75-25_split.txt_batch64_ep50_lr0.0001.pt"]

models = [convNN2(), convNN2(), convNN2(), denseNN(), denseNN(), denseNN(), resnet34(), resnet34(), resnet34()]

for name, model in zip(model_names, models):
    model.load_state_dict(torch.load("./models/" + name))

NameError: name 'convNN2' is not defined

Once we have done this we can now calculate the scores of each model for every test set. This should compute the mean error and standard deviation and return it as a tuple.

In [None]:
for model in models:
    print(model.__class__.__name__)
    print("50-50 test file " + evaluate(model, "./ll_test_50-50", "cpu"))
    print("male test file " + evaluate(model, "./ll_test_males", "cpu"))
    print("female test file " + evaluate(model, "./ll_test_females", "cpu"))

Due to its simplicity, the error for the convNN2 is much higher than the the other two models. 
While we can see that ResNet34 outperformed DesneNet121 with a significant reduction in error between the two, 
both models had higher error in females for identifing landmarks for both genders, as expected at the start of this project.

Both models performed best overall on balanced data sets, though this difference was very slight. It also shows that a higher proportional of females or males in the dataset does not neccesarily result in a lower error for each respective gender, but the change in error was quite minimal, meaning we can discard the bias in the dataset as a source of the bias between males and females.

Now we can generate graphs for the change in mean error and standard deviation during training, and how it progressed over the iterations.
When loading in the model scores file, we need to parse this data to make it useful to us. Namely, we split it into a list by using ',' as dividers.
The model infos are already presented in json format, and can simply be read in directly. This gives us the iteration, mean, and standard deviation of the saved model.

In [None]:


for model in models:
    epoch = []
    error = []
    std = []
    m_nopt = args.model.split(".pt")
    with open("./model_scores/" + m_nopt[0] + ".csv") as file:
        for line in file:
            scores = line.split(",")   
            epoch.append(float(scores[0]))
            error.append(float(scores[1]))
            std.append(float(scores[2]))
    
    file = open("./model_infos/" + m_nopt[0] + ".json")
    model_info = json.load(file)

    plt.plot(epoch, error)
    plt.xlabel("Iterations")
    plt.ylabel("Mean Error")
    plt.title('Error change during training')

    plt.scatter([model_info["iteration"]], [model_info["mean"]], color = 'red')
    plt.gca().legend(('Validation Error','Best performing iteration'))
    plt.show()

    plt.plot(epoch, std)
    plt.xlabel("Epochs")
    plt.ylabel("Standard Deviation")
    plt.title('STD change during training')
    plt.scatter([model_info["iteration"]], [model_info["std"]], color = 'red')
    plt.gca().legend(('Validation Error','Best performing iteration'))
    plt.show()
    

From the graphs we can see that the mean error and standard deviation rapidly decreases over time within the first few iterations. Initially, with the weights initialized to zero, the predicted landmarks all fall very far from the correct points.

Both the ConvNN2 and DenseNet seem to oscillate between certain points and do not converge to a single point like the Resnet34 does. This seems to indicate that they are not necessarily the most optimal model.

We can also see the loss during training with the following,

In [None]:
# Graph for loss during training
for model in models:
    plt.plot(list(range(len(model_info["loss_list"]))), model_info["loss_list"])
    plt.xlabel("Iterations")
    plt.ylabel("Loss")
    plt.title('Loss during training')
    plt.show()


We can visually check how accurate the predicted landmarks are on an image by comparing them against the input landmarks used by the model for training. For each model an image will be generated that plots 2 lines, with 5 images per line for easy comparison between different images.

In [None]:
def generate_images(train_dataloader, axs_flat):
    with torch.no_grad():
            for i, (image, _, _, _, labels) in enumerate(train_dataloader):
                output = model(image)
                output = output.reshape(3,2)
                image = image.squeeze()
                image = image.permute(1, 2, 0)    #Default was 3,200,200
                im = axs_flat[i].imshow(image)
                x = np.array(range(200))

                # Finding landmarks for chin, nose and eye this txt file for each image
                land_idx = [8, 30, 39]
                labels = labels.squeeze()
                labels = labels[land_idx, :]
                #ax.scatter(output[:,0], output[:,1], linewidth=2, color='red')
                axs_flat[i].scatter(output[:,0], output[:,1], linewidth=2, color='c', s = 5)
                axs_flat[i].scatter(labels[:,0], labels[:,1], linewidth=2, color='m', s = 5)

UTKFace = CustomImageDataset("./test_output_images", 'UTKFace')
train_dataloader = DataLoader(UTKFace, 
                                    batch_size=1, 
                                    shuffle=False)

for model in models:
    print(model.__class__.__name__)
    fig, axs = plt.subplots(2,5, figsize=(20,10))
    axs_flat = axs.flatten()

    generate_images(train_dataloader, axs_flat)
    #plt.subplots_adjust(wspace=0, hspace=0)
    fig.legend(('Predicted output','Expected output'))
    plt.show()

#### Analysis

As we can see the basic convolutional neural network essentially predicts the same three points for every single image. While that does produce the best possible mean error for this specific model, the results do show that the model is severely underfitting. It does not have enough layers, depth, or complexity to recognize the different landmark features on each face.

Currently, the model produces a very high mean error (54.377 and 54.895 pixels for females and males, respectively). On a 200x200 pixel image this is very far off. Thus, convNN2 will not be considered when evaluating our hypothesis.

Both ResNet34 and DenseNet121 vastly outperformed convNN2 and get significantly lower average mean error and standard deviation.
Their errors on different datasets can be seen below
##### ResNet34 Test Data
![ResNet34 Test Data](jupyter_images/resnet34table.jpg) 
##### DenseNet121 Test Data
![DenseNet121 Test Data](jupyter_images/densenet121table.jpg)

From the tables we can see that ResNet had a mean error about 45% lower than DenseNet, which was a lot more sensitive to data bias from the data sets.
The DenseNet had a higher error for women regardless of dataset bias. The error on the female test set was also very insensitive to dataset bias, indicating that the model is not properly learning how to predict the correct landmarks for women.

We can look at a graph for all three models to get a better understand of what is happening during training.

##### convNN2 Loss
![convNN2 loss](jupyter_images/lossconnv2.jpg)
##### DenseNet121 Loss
![densenet121 loss](jupyter_images/lossdense.jpg)
##### ResNet34 Loss
![resnet34 loss](jupyter_images/lossresnet.jpg)


Another issue may be due to underfitting of the model, which may be experimented with by increasing the amount of layers in the network in the future. If there is little to no increase in mean error, it would highlight that vanishing gradients are at play and that the number of layers must be reduced instead.

![Both Models Mean Error Change](jupyter_images/modelgraphs.jpg)

ResNet mean error converges asymptotically, while DenseNet continues oscillating. This indicates that the DenseNet model had issues during training and did not ouput a very optimal model. High oscillation tends to indicate that the learning rate is too high, or that the training might benefit from using an optimiser with momentum. DenseNet is ostensibly a more sophisticated model as it has both more layers and parameters, but this did not appear to hold in practice. It appeared the model was unable to properly fit to the female portion of the dataset, while ResNet despite being comparatively simpler, was able to. We theorise this is due to the DenseNet having vanishing gradient issues due to its high layer count. While DenseNet is intended to prevent vanishing gradients, the pytorch implementation only uses dense blocks. The layers inside these blocks possess dense connections, but subsequent layers only receive inputs from the previous dense block's output layer. This saves on computational complexity by bringing down the number of weights. However, this could allow vanishing gradients to exist. The ResNet model however used fewer layers and took advantage of skip connections. Since it uses summation as opposed to concatenation to bypass layers, it is more computationally efficient. Due to this, the model does not need to be split into blocks and the outputs from earlier layers can reach all later layers. This theory would explain our concerns about vanishing gradients. Additionally, when we examined the model weights for our trained DenseNets, we found that many of the earlier weight parameters had changed by no more that 1e-4, further supporting the vanishing gradient issues.

As discussed, research suggests that discrepencies between men and women in facial recognition are not due to dataset bias but rather differences in facial structure. Another difference between ResNet and denseNet is their choice of pooling algorithms. Max pooling is likely better at learning to find edges and contours, while average pooling has a propensity to smooth values. If we assume that women tend to have softer features and therefore less defined contours on their face, it may be harder for a model using average pooling to identify these contours. Since men tend to have more defined features with more contrast, the model may cope better. This is reflected in our models as DenseNet performed worse than ResNet and Resnet features many more max pooling layers while densenet contains many average pooling layers.

While underfitting can also be a symptom of poor hyperparameter selection or insufficient data, we do not believe this to be the case. The models were both trained on large datasets and we had spent some time testing  hyperparameters that would produce satisfactory results for all models.

#### Further Discussion

Clearly, the inherent differences between male and female facial structure can pose problems when developing facial recognition models. We have shown this is largely independent of any dataset bias. While we have identified the cause of these issues, further work would be required to determine the extent of this bias. Since we have only tested on three landmarks, we are unable to determine if these issues affect all landmarks or if there is a small subset that are responsible.


### Future Work

* Investigating which landmarks most heavily influence gender bias in model predictions may be a good next step. It would help affirm our conclusions if landmarks with less contrast were more biased, as it would support our theory that it is the difference in facial definition causing the bias.

* Examining different pooling algorithms would also help support our results. This could be done by training a version of denseNet with max pooling instead of average pooling between each dense block, and a version of Resnet with without max pooling. If the denseNet performance improves while the resnet decreases, this would support the literature.

* Testing different proportions of men and women in datasets would allow us to better understand how influenced the models are by bias and experiment with strategies and guidelines to make the models more resiliant to dataset bias. Due to time constraints this team only experimented with three different proportions, but more datapoints would provide considerably more information.

* Investigate the minimal number of facial features necessary to perform face recognition fairly.​

* This team would also like to suggest identifying how large of an impact vanishing gradients played on these results by performing the same tests with a smaller version of denseNet. If the scores increase or remain consistent that would point heavily to vanishing gradients as it indicates the model is overly complex.

* This team is also interested on performing a similar experiment with different racial groups. Here we believe dataset bias plays a bigger role as datasets created in certain countries are likely to have the same distributions of ethnicities as their local populace. We also expect differences in skin tone and again, facial structure would also contribute. This experiment was cut for time.

### Conclusion

Our experiments thus far have provided promising insights into why facial recognition performs worse for certain groups, and more specifically genders. We were able to pinpoint differences in facial structure as a major contributing factor and some model design choices such as pooling algorithms that may exacerbate these issues. Additionally, we were able to find a model that was relatively unbiased while still producing low errors. These findings have helped verify existing research and can help influence future facial recognition development to be fairer.